[ https://issues.apache.org/jira/browse/HAWQ-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858621#comment-15858621 ]
ASF GitHub Bot commented on HAWQ-1304: -------------------------------------- Github user dyozie commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/94#discussion_r100185543 --- Diff: markdown/pxf/HivePXF.html.md.erb --- @@ -164,9 +164,143 @@ Create a Hive table to expose our sample data set. In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well. + +## <a id="hcatalog"></a>Using PXF and HCatalog to Query Hive + +You can query Hive tables directly through HCatalog integration with HAWQ and PXF, regardless of the underlying file storage format. This integration allows HAWQ to directly use table metadata stored in HCatalog. + +HCatalog is built on top of the Hive metastore and incorporates Hive's DDL. This provides several advantages: + +- You do not need to know the table schema of your Hive tables +- You do not need to manually enter information about Hive table location or format +- If Hive table metadata changes, HCatalog provides updated metadata. This is in contrast to the use of static external PXF tables to define Hive table metadata for HAWQ. + +The following diagram depicts how HAWQ integrates with HCatalog to query Hive tables: + +<img src="../images/hawq_hcatalog.png" id="hcatalog__image_ukw_h2v_c5" class="image" width="672" /> + +1. HAWQ retrieves table metadata from HCatalog using PXF. +2. HAWQ creates in-memory catalog tables from the retrieved metadata. If a table is referenced multiple times in a transaction, HAWQ uses its in-memory metadata to reduce external calls to HCatalog. +3. PXF queries Hive using table metadata that is stored in the HAWQ in-memory catalog tables. Table metadata is dropped at the end of the transaction. + + +### <a id="topic_j1l_enabling"></a>Enabling HCatalog Integration + +To enable HCatalog query integration in HAWQ, perform the following steps: + +1. Make sure your deployment meets the requirements listed in [Prerequisites](#installingthepxfhiveplugin). +2. If necessary, set the `pxf_service_address` global configuration property to the hostname or IP address and port where you have installed the PXF Hive plug-in. By default, the value is set to `localhost:51200`. + + ``` sql + postgres=# SET pxf_service_address TO <hivenode>:51200 + ``` + +3. HCatalog internally uses the `pxf` protocol to query. Grant this protocol privilege to all roles requiring access: + + ``` sql + postgres=# GRANT ALL ON PROTOCOL pxf TO <role>; + ``` + +4. It is not recommended to create a HAWQ table using the `WITH (OIDS)` clause. If any user tables were created using the `WITH (OIDS)` clause, additional operations are required to enable HCatalog integration. To access a Hive table via HCatalog when user tables were created using `WITH (OIDS)`, HAWQ users must have `SELECT` permission to query every user table within the same schema that was created using the `WITH (OIDS)` clause. + + 1. Determine which user tables were created using the `WITH (OIDS)` clause: + + ``` sql + postgres=# SELECT oid, relname FROM pg_class + WHERE relhasoids = true + AND relnamespace <> (SELECT oid FROM pg_namespace WHERE nspname = 'pg_catalog'); + ``` + + 2. Grant `SELECT` privilege on all returned tables to all roles to which you chose to provide HCatalog query access. For example: + + ``` sql + postgres=# GRANT SELECT ON <table-created-WITH-OIDS> TO <role> + ``` + +### <a id="topic_j1l_y55_c5"></a>Usage + +To query a Hive table with HCatalog integration, query HCatalog directly from HAWQ. The query syntax is: + +``` sql +postgres=# SELECT * FROM hcatalog.hive-db-name.hive-table-name; +``` + +For example: + +``` sql +postgres=# SELECT * FROM hcatalog.default.sales_info; +``` + +To obtain a description of a Hive table with HCatalog integration, you can use the `psql` client interface. + +- Within HAWQ, use either the `\d hcatalog.hive-db-name.hive-table-name` or `\d+ hcatalog.hive-db-name.hive-table-name` commands to describe a single table. `\d` displays only HAWQ's interpretation of the underlying source (Hive in this case) data type, while `\d+` displays both the HAWQ interpreted and Hive source data types. For example, from the `psql` client interface: + + ``` shell + $ psql -d postgres + ``` + + ``` sql + postgres=# \d+ hcatalog.default.sales_info_rcfile; + ``` + + ``` shell + PXF Hive Table "default.sales_info" + Column | Type | Source type + ------------------+--------+------------- + location | text | string + month | text | string + number_of_orders | int4 | int + total_sales | float8 | double + ``` +- Use `\d hcatalog.hive-db-name.*` to describe the whole database schema, i.e. all tables in `hive-db-name`. +- Use `\d hcatalog.*.*` to describe the whole schema, i.e. all databases and tables. + +When using `\d` or `\d+` commands in the `psql` HAWQ client, `hcatalog` will not be listed as a database. If you use other `psql` compatible clients, `hcatalog` will be listed as a database with a size value of `-1` since `hcatalog` is not a real database in HAWQ. + +Alternatively, you can use the `pxf_get_item_fields` user-defined function (UDF) to obtain Hive table descriptions from other client interfaces or third-party applications. The UDF takes a PXF profile and a table pattern string as its input parameters. **Note:** The only supported input profile at this time is `'Hive'`. + +- The following statement returns a description of a specific table. The description includes path, itemname (table), fieldname, fieldtype (HAWQ type), and sourcefieldtype (Hive type). + + ``` sql + postgres=# SELECT * FROM pxf_get_item_fields('Hive','default.sales_info'); + ``` + + ``` pre + path | itemname | fieldname | fieldtype | sourcefieldtype + ---------+------------+------------------+-----------+----------------- + default | sales_info | location | text | string + default | sales_info | month | text | string + default | sales_info | number_of_orders | int4 | int + default | sales_info | total_sales | float8 | double + ``` + +- The following statement returns table descriptions from the default database. + + ``` sql + postgres=# SELECT * FROM pxf_get_item_fields('Hive','default.*'); + ``` + +- The following statement returns a description of the entire schema. + + ``` sql + postgres=# SELECT * FROM pxf_get_item_fields('Hive', '*.*'); + ``` + +### <a id="topic_r5k_pst_25"></a>Limitations + +HCatalog integration has the following limitations: + +- HCatalog queries on Hive tables with complex type fields return those fields serialized as text. +- Even for primitive types, HCatalog metadata descriptions produced by `\d` are HAWQ's interpretation of the underlying Hive data types. For example, the Hive type `tinyint` is converted to HAWQ type `int2`. (See [Data Type Mapping](#hive_primdatatypes).) +- HAWQ reserves the database name `hcatalog` for system use. You cannot connect to or alter the system `hcatalog` database. + + + ## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data -The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`. +In the previous section, you used HCatalog integration to query a Hive table. You can also create a PXF/HAWQ external table to access Hive table data. This Hive table access mechanism requires that you identify an appropriate Hive profile. + +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`. `HiveText` and `HiveRC` profiles are specifically optimized for text and RC file formats, respectively. The `Hive` profile is optimized for all file storage types; use the `Hive` profile when the underlying Hive table is composed of multiple partitions with differing file formats. --- End diff -- Third sentence here should start with "The" > documentation changes for HAWQ-1228 > ----------------------------------- > > Key: HAWQ-1304 > URL: https://issues.apache.org/jira/browse/HAWQ-1304 > Project: Apache HAWQ > Issue Type: New Feature > Components: Documentation > Affects Versions: 2.1.0.0-incubating > Reporter: Lisa Owen > Assignee: David Yozie > Priority: Minor > > - new pxf-profiles.xml outputFormat parameter > - hive table access via external table and hcatalog now uses optimal profile > for each fragment > - others -- This message was sent by Atlassian JIRA (v6.3.15#6346)