[ https://issues.apache.org/jira/browse/HAWQ-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612281#comment-15612281 ]
ASF GitHub Bot commented on HAWQ-1107: -------------------------------------- Github user kavinderd commented on a diff in the pull request: https://github.com/apache/incubator-hawq-docs/pull/33#discussion_r85362384 --- Diff: pxf/HDFSFileDataPXF.html.md.erb --- @@ -2,506 +2,449 @@ title: Accessing HDFS File Data --- -## <a id="installingthepxfhdfsplugin"></a>Prerequisites +HDFS is the primary distributed storage mechanism used by Apache Hadoop applications. The PXF HDFS plug-in reads file data stored in HDFS. The plug-in supports plain delimited and comma-separated-value format text files. The HDFS plug-in also supports the Avro binary format. -Before working with HDFS file data using HAWQ and PXF, you should perform the following operations: +This section describes how to use PXF to access HDFS data, including how to create and query an external table from files in the HDFS data store. -- Test PXF on HDFS before connecting to Hive or HBase. -- Ensure that all HDFS users have read permissions to HDFS services and that write permissions have been limited to specific users. +## <a id="hdfsplugin_prereq"></a>Prerequisites -## <a id="syntax1"></a>Syntax +Before working with HDFS file data using HAWQ and PXF, ensure that: -The syntax for creating an external HDFS file is as follows: +- The HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information. +- All HDFS users have read permissions to HDFS services and that write permissions have been restricted to specific users. -``` sql -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name - ( column_name data_type [, ...] | LIKE other_table ) -LOCATION ('pxf://host[:port]/path-to-data?<pxf parameters>[&custom-option=value...]') - FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>); -``` +## <a id="hdfsplugin_fileformats"></a>HDFS File Formats -where `<pxf parameters>` is: +The PXF HDFS plug-in supports reading the following file formats: -``` pre - FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class] - | PROFILE=profile-name -``` +- Text File - comma-separated value (.csv) or delimited format plain text file +- Avro - JSON-defined, schema-based data serialization format -**Note:** Omit the `FRAGMENTER` parameter for `READABLE` external tables. +The PXF HDFS plug-in includes the following profiles to support the file formats listed above: -Use an SQL `SELECT` statement to read from an HDFS READABLE table: +- `HdfsTextSimple` - text files +- `HdfsTextMulti` - text files with embedded line feeds +- `Avro` - Avro files -``` sql -SELECT ... FROM table_name; +If you find that the pre-defined PXF HDFS profiles do not meet your needs, you may choose to create a custom HDFS profile from the existing HDFS serialization and deserialization classes. Refer to [Adding and Updating Profiles](ReadWritePXF.html#addingandupdatingprofiles) for information on creating a custom profile. + +## <a id="hdfsplugin_cmdline"></a>HDFS Shell Commands +Hadoop includes command-line tools that interact directly with HDFS. These tools support typical file system operations including copying and listing files, changing file permissions, and so forth. + +The HDFS file system command syntax is `hdfs dfs <options> [<file>]`. Invoked with no options, `hdfs dfs` lists the file system options supported by the tool. + +`hdfs dfs` options used in this topic are: + +| Option | Description | +|-------|-------------------------------------| +| `-cat` | Display file contents. | +| `-mkdir` | Create directory in HDFS. | +| `-put` | Copy file from local file system to HDFS. | + +Examples: + +Create a directory in HDFS: + +``` shell +$ sudo -u hdfs hdfs dfs -mkdir -p /data/exampledir ``` -Use an SQL `INSERT` statement to add data to an HDFS WRITABLE table: +Copy a text file to HDFS: -``` sql -INSERT INTO table_name ...; +``` shell +$ sudo -u hdfs hdfs dfs -put /tmp/example.txt /data/exampledir/ ``` -To read the data in the files or to write based on the existing format, use `FORMAT`, `PROFILE`, or one of the classes. - -This topic describes the following: - -- FORMAT clause -- Profile -- Accessor -- Resolver -- Avro - -**Note:** For more details about the API and classes, see [PXF External Tables and API](PXFExternalTableandAPIReference.html#pxfexternaltableandapireference). - -### <a id="formatclause"></a>FORMAT clause - -Use one of the following formats to read data with any PXF connector: - -- `FORMAT 'TEXT'`: Use with plain delimited text files on HDFS. -- `FORMAT 'CSV'`: Use with comma-separated value files on HDFS. -- `FORMAT 'CUSTOM'`: Use with all other files, including Avro format and binary formats. Must always be used with the built-in formatter '`pxfwritable_import`' (for read) or '`pxfwritable_export`' (for write). - -**Note:** When creating PXF external tables, you cannot use the `HEADER` option in your `FORMAT` specification. - -### <a id="topic_ab2_sxy_bv"></a>Profile - -For plain or comma-separated text files in HDFS use either the `HdfsTextSimple` or `HdfsTextMulti` Profile, or the classname org.apache.hawq.pxf.plugins.hdfs.*HdfsDataFragmenter*. Use the `Avro` profile for Avro files. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information. - -**Note:** For read tables, you must include a Profile or a Fragmenter in the table definition. - -### <a id="accessor"></a>Accessor - -The choice of an Accessor depends on the HDFS data file type. - -**Note:** You must include either a Profile or an Accessor in the table definition. - -<table> -<colgroup> -<col width="25%" /> -<col width="25%" /> -<col width="25%" /> -<col width="25%" /> -</colgroup> -<thead> -<tr class="header"> -<th>File Type</th> -<th>Accessor</th> -<th>FORMAT clause</th> -<th>Comments</th> -</tr> -</thead> -<tbody> -<tr class="odd"> -<td>Plain Text delimited</td> -<td>org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor</td> -<td>FORMAT 'TEXT' (<em>format param list</em>)</td> -<td> Read + Write -<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td> -</tr> -<tr class="even"> -<td>Plain Text CSV </td> -<td>org.apache.hawq.pxf.plugins. hdfs.LineBreakAccessor</td> -<td>FORMAT 'CSV' (<em>format param list</em>) </td> -<td><p>LineBreakAccessor is parallel and faster.</p> -<p>Use if each logical data row is a physical data line.</p> -<p>Read + Write </p> -<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td> -</tr> -<tr class="odd"> -<td>Plain Text CSV </td> -<td>org.apache.hawq.pxf.plugins. hdfs.QuotedLineBreakAccessor</td> -<td>FORMAT 'CSV' (<em>format param list</em>) </td> -<td><p>QuotedLineBreakAccessor is slower and non-parallel.</p> -<p>Use if the data includes embedded (quoted) linefeed characters.</p> -<p>Read Only </p> -<p>You cannot use the <code class="ph codeph">HEADER</code> option.</p></td> -</tr> -<tr class="even"> -<td>SequenceFile</td> -<td>org.apache.hawq.pxf.plugins. hdfs.SequenceFileAccessor</td> -<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td> -<td> Read + Write (use formatter='pxfwritable_export' for write)</td> -</tr> -<tr class="odd"> -<td>AvroFile</td> -<td>org.apache.hawq.pxf.plugins. hdfs.AvroFileAccessor</td> -<td>FORMAT 'CUSTOM' (formatter='pxfwritable_import')</td> -<td> Read Only</td> -</tr> -</tbody> -</table> - -### <a id="resolver"></a>Resolver - -Choose the Resolver format if data records are serialized in the HDFS file. - -**Note:** You must include a Profile or a Resolver in the table definition. - -<table> -<colgroup> -<col width="33%" /> -<col width="33%" /> -<col width="33%" /> -</colgroup> -<thead> -<tr class="header"> -<th>Record Serialization</th> -<th>Resolver</th> -<th>Comments</th> -</tr> -</thead> -<tbody> -<tr class="odd"> -<td>Avro</td> -<td>org.apache.hawq.pxf.plugins. hdfs.AvroResolver</td> -<td><ul> -<li>Avro files include the record schema, Avro serialization can be used in other file types (e.g, Sequence File). </li> -<li>For Avro serialized records outside of an Avro file, include a schema file name (.avsc) in the url under the optional <code class="ph codeph">Schema-Data </code>option.</li> -<li>Deserialize Only (Read) .</li> -</ul></td> -</tr> -<tr class="even"> -<td>Java Writable</td> -<td>org.apache.hawq.pxf.plugins. hdfs.WritableResolver</td> -<td><ul> -<li>Include the name of the Java class that uses Writable serialization in the URL under the optional <code class="ph codeph">Schema-Data.</code></li> -<li>The class file must exist in the public stage directory (or in Hadoop's class path).</li> -<li>Deserialize and Serialize (Read + Write). </li> -<li>See <a href="#customizedwritableschemafileguidelines">Customized Writable Schema File Guidelines</a>.</li> -</ul></td> -</tr> -<tr class="odd"> -<td>None (plain text)</td> -<td>org.apache.hawq.pxf.plugins. hdfs.StringPassResolver</td> -<td><ul> -<li>Does not serialize plain text records. The database parses plain records. Passes records as they are.</li> -<li>Deserialize and Serialize (Read + Write).</li> -</ul></td> -</tr> -</tbody> -</table> - -#### <a id="customizedwritableschemafileguidelines"></a>Schema File Guidelines for WritableResolver - -When using a WritableResolver, a schema file needs to be defined. The file needs to be a Java class file and must be on the class path of PXF. - -The class file must follow the following requirements: - -1. Must implement org.apache.hadoop.io.Writable interface. -2. WritableResolver uses reflection to recreate the schema and populate its fields (for both read and write). Then it uses the Writable interface functions to read/write. Therefore, fields must be public, to enable access to them. Private fields will be ignored. -3. Fields are accessed and populated in the order in which they are declared in the class file. -4. Supported field types: - - boolean - - byte array - - double - - float - - int - - long - - short - - string - - Arrays of any of the above types are supported, but the constructor must define the array size so the reflection will work. - -### <a id="additionaloptions"></a>Additional Options - -<a id="additionaloptions__table_skq_kpz_4p"></a> - -<table> -<caption><span class="tablecap">Table 1. Additional PXF Options</span></caption> -<colgroup> -<col width="50%" /> -<col width="50%" /> -</colgroup> -<thead> -<tr class="header"> -<th>Option Name</th> -<th>Description</th> -</tr> -</thead> -<tbody> -<tr class="odd"> -<td>COLLECTION_DELIM</td> -<td>(Avro or Hive profiles only.) The delimiter character(s) to place between entries in a top-level array, map, or record field when PXF maps a Hive or Avro complex data type to a text column. The default is a "," character.</td> -</tr> -<tr class="even"> -<td>COMPRESSION_CODEC</td> -<td><ul> -<li>Useful for WRITABLE PXF tables.</li> -<li>Specifies the compression codec class name for compressing the written data. The class must implement the org.apache.hadoop.io.compress.CompressionCodec interface.</li> -<li> Some valid values are org.apache.hadoop.io.compress.DefaultCodec org.apache.hadoop.io.compress.GzipCodec org.apache.hadoop.io.compress.BZip2Codec.</li> -<li>Note: org.apache.hadoop.io.compress.BZip2Codec runs in a single thread and can be slow.</li> -<li>This option has no default value. </li> -<li>When the option is not defined, no compression will be done.</li> -</ul></td> -</tr> -<tr class="odd"> -<td>COMPRESSION_TYPE</td> -<td><ul> -<li>Useful WRITABLE PXF tables with SequenceFileAccessor.</li> -<li>Ignored when COMPRESSION_CODEC is not defined.</li> -<li>Specifies the compression type for sequence file.</li> -<li>Valid options are: -<ul> -<li>RECORD - only the value part of each row is compressed.</li> -<li>BLOCK - both keys and values are collected in 'blocks' separately and compressed.</li> -</ul></li> -<li>Default value: RECORD.</li> -</ul></td> -</tr> -<tr class="even"> -<td>MAPKEY_DELIM</td> -<td>(Avro or Hive profiles only.) The delimiter character(s) to place between the key and value of a map entry when PXF maps a Hive or Avro complex data type to a text colum. The default is a ":" character.</td> -</tr> -<tr class="odd"> -<td>RECORDKEY_DELIM</td> -<td>(Avro profile only.) The delimiter character(s) to place between the field name and value of a record entry when PXF maps an Avro complex data type to a text colum. The default is a ":" character.</td> -</tr> -<tr class="even"> -<td>SCHEMA-DATA</td> -<td>The data schema file used to create and read the HDFS file. For example, you could create an avsc (for Avro), or a Java class (for Writable Serialization) file. Make sure that you have added any JAR files containing the schema to <code class="ph codeph">pxf-public.classpath</code>. -<p>This option has no default value.</p></td> -</tr> -<tr class="odd"> -<td>THREAD-SAFE</td> -<td>Determines if the table query can run in multithread mode or not. When set to FALSE, requests will be handled in a single thread. -<p>Should be set when a plug-in or other elements that are not thread safe are used (e.g. compression codec).</p> -<p>Allowed values: TRUE, FALSE. Default value is TRUE - requests can run in multithread mode.</p></td> -</tr> -<tr class="even"> -<td> <custom></td> -<td>Any option added to the pxf URI string will be accepted and passed, along with its value, to the Fragmenter, Accessor, and Resolver implementations.</td> -</tr> -</tbody> -</table> - -## <a id="accessingdataonahighavailabilityhdfscluster"></a>Accessing Data on a High Availability HDFS Cluster - -To access data on a High Availability HDFS cluster, change the authority in the URI in the LOCATION. Use *HA\_nameservice* instead of *name\_node\_host:51200*. +Display the contents of a text file in HDFS: + +``` shell +$ sudo -u hdfs hdfs dfs -cat /data/exampledir/example.txt +``` + + +## <a id="hdfsplugin_queryextdata"></a>Querying External HDFS Data +The PXF HDFS plug-in supports the `HdfsTextSimple`, `HdfsTextMulti`, and `Avro` profiles. + +Use the following syntax to create a HAWQ external table representing HDFS data: ``` sql -CREATE [READABLE|WRITABLE] EXTERNAL TABLE <tbl name> (<attr list>) -LOCATION ('pxf://<HA nameservice>/<path to file or directory>?Profile=profile[&<additional options>=<value>]') -FORMAT '[TEXT | CSV | CUSTOM]' (<formatting properties>); +CREATE EXTERNAL TABLE <table_name> + ( <column_name> <data_type> [, ...] | LIKE <other_table> ) +LOCATION ('pxf://<host>[:<port>]/<path-to-hdfs-file> + ?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro[&<custom-option>=<value>[...]]') +FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>); ``` -The opposite is true when a highly available HDFS cluster is reverted to a single namenode configuration. In that case, any table definition that has the nameservice specified should use the <NN host>:<NN rest port> syntax. +HDFS-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described in the table below. -## <a id="recordkeyinkey-valuefileformats"></a>Using a Record Key with Key-Value File Formats +| Keyword | Value | +|-------|-------------------------------------| +| \<host\>[:\<port\>] | The HDFS NameNode and port. | +| \<path-to-hdfs-file\> | The path to the file in the HDFS data store. | +| PROFILE | The `PROFILE` keyword must specify one of the values `HdfsTextSimple`, `HdfsTextMulti`, or `Avro`. | +| \<custom-option\> | \<custom-option\> is profile-specific. Profile-specific options are discussed in the relevant profile topic later in this section.| +| FORMAT 'TEXT' | Use '`TEXT`' `FORMAT` with the `HdfsTextSimple` profile when \<path-to-hdfs-file\> references a plain text delimited file. | +| FORMAT 'CSV' | Use '`CSV`' `FORMAT` with `HdfsTextSimple` and `HdfsTextMulti` profiles when \<path-to-hdfs-file\> references a comma-separated value file. | +| FORMAT 'CUSTOM' | Use the`CUSTOM` `FORMAT` with the `Avro` profile. The `Avro` '`CUSTOM`' `FORMAT` supports only the built-in `(formatter='pxfwritable_import')` \<formatting-property\> | + \<formatting-properties\> | \<formatting-properties\> are profile-specific. Profile-specific formatting options are discussed in the relevant profile topic later in this section. | -For sequence file and other file formats that store rows in a key-value format, the key value can be accessed through HAWQ by using the saved keyword '`recordkey`' as a field name. +*Note*: When creating PXF external tables, you cannot use the `HEADER` option in your `FORMAT` specification. -The field type must correspond to the key type, much as the other fields must match the HDFS data. +## <a id="profile_hdfstextsimple"></a>HdfsTextSimple Profile -WritableResolver supports read and write of recordkey, which can be of the following Writable Hadoop types: +Use the `HdfsTextSimple` profile when reading plain text delimited or .csv files where each row is a single record. -- BooleanWritable -- ByteWritable -- DoubleWritable -- FloatWritable -- IntWritable -- LongWritable -- Text +\<formatting-properties\> supported by the `HdfsTextSimple` profile include: -If the `recordkey` field is not defined, the key is ignored in read, and a default value (segment id as LongWritable) is written in write. +| Keyword | Value | +|-------|-------------------------------------| +| delimiter | The delimiter character in the file. Default value is a comma `,`.| -### <a id="example1"></a>Example +### <a id="profile_hdfstextsimple_query"></a>Example: Using the HdfsTextSimple Profile -A data schema `Babies.class` contains three fields: (name text, birthday text, weight float). An external table must include these three fields, and can either include or ignore the recordkey. +Perform the following steps to create a sample data file, copy the file to HDFS, and use the `HdfsTextSimple` profile to create PXF external tables to query the data: -``` sql --- writable table with recordkey -CREATE WRITABLE EXTERNAL TABLE babies_registry (recordkey int, name text, birthday text, weight float) - LOCATION ('pxf://namenode_host:51200/babies_1940s' - '?ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor' - '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver' - '&DATA-SCHEMA=Babies') - FORMAT 'CUSTOM' (formatter='pxfwritable_export'); -INSERT INTO babies_registry VALUES (123456, "James Paul McCartney", "June 18, 1942", 3.800); - --- writable table without recordkey -CREATE WRITABLE EXTERNAL TABLE babies_registry2 (name text, birthday text, weight float) - LOCATION ('pxf://namenode_host:51200/babies_1940s' - '?ACCESSOR=org.apache.hawq.pxf.plugins.SequenceFileAccessor' - '&RESOLVER=org.apache.hawq.pxf.plugins.WritableResolver' - '&DATA-SCHEMA=Babies') - FORMAT 'CUSTOM' (formatter='pxfwritable_export'); - --- this record's key will have some default value -INSERT INTO babies_registry VALUES ("Richard Starkey", "July 7, 1940", 4.0); -``` +1. Create an HDFS directory for PXF example data files: -The same goes for reading data from an existing file with a key-value format, e.g. a Sequence file. + ``` shell + $ sudo -u hdfs hdfs dfs -mkdir -p /data/pxf_examples + ``` -``` sql --- readable table with recordkey -CREATE EXTERNAL TABLE babies_1940 (recordkey int, name text, birthday text, weight float) - LOCATION ('pxf://namenode_host:51200/babies_1940s' - '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter' - '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor' - '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver' - '&DATA-SCHEMA=Babies') - FORMAT 'CUSTOM' (formatter='pxfwritable_import'); --- retrieve each record's key -SELECT * FROM babies_1940; - --- readable table without recordkey -CREATE EXTERNAL TABLE babies_1940_2 (name text, birthday text, weight float) - LOCATION ('pxf://namenode_host:51200/babies_1940s' - '?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter' - '&ACCESSOR=org.apache.hawq.pxf.plugins.hdfs.SequenceFileAccessor' - '&RESOLVER=org.apache.hawq.pxf.plugins.hdfs.WritableResolver' - '&DATA-SCHEMA=Babies') - FORMAT 'CUSTOM' (formatter='pxfwritable_import'); --- ignores the records' key -SELECT * FROM babies_1940_2; -``` +2. Create a delimited plain text data file named `pxf_hdfs_simple.txt`: + + ``` shell + $ echo 'Prague,Jan,101,4875.33 +Rome,Mar,87,1557.39 +Bangalore,May,317,8936.99 +Beijing,Jul,411,11600.67' >> pxf_hdfs_simple.txt --- End diff -- Maybe change the redirect operator to `>` otherwise if a user already has a `pxf_hdfs_simple.txt` file for whatever reason the above will append to it > PXF HDFS documentation - restructure content and include more examples > ---------------------------------------------------------------------- > > Key: HAWQ-1107 > URL: https://issues.apache.org/jira/browse/HAWQ-1107 > Project: Apache HAWQ > Issue Type: Improvement > Components: Documentation > Reporter: Lisa Owen > Assignee: David Yozie > Priority: Minor > Fix For: 2.0.1.0-incubating > > > the current PXF HDFS documentation does not include any runnable examples. > add runnable examples for all (HdfsTextSimple, HdfsTextMulti, SerialWritable, > Avro) profiles. restructure the content as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)