[jira] [Commented] (HAWQ-1119) create new documentation topic for PXF writable profiles

ASF GitHub Bot (JIRA) Wed, 02 Nov 2016 08:59:46 -0700

    [ 
https://issues.apache.org/jira/browse/HAWQ-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15629381#comment-15629381
 ]


ASF GitHub Bot commented on HAWQ-1119:
--------------------------------------

Github user lisakowen commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/46#discussion_r86175112
  
    --- Diff: pxf/HDFSWritablePXF.html.md.erb ---
    @@ -0,0 +1,416 @@
    +---
    +title: Writing Data to HDFS
    +---
    +
    +The PXF HDFS plug-in supports writable external tables using the 
`HdfsTextSimple` and `SequenceWritable` profiles.  You might create a writable 
table to export data from a HAWQ internal table to binary or text HDFS files.
    +
    +Use the `HdfsTextSimple` profile when writing text data. Use the 
`SequenceWritable` profile when dealing with binary data.
    +
    +This section describes how to use these PXF profiles to create writable 
external tables.
    +
    +**Note**: Tables that you create with writable profiles can only be used 
for INSERT operations.  If you want to query inserted data, you must define a 
separate external readable table that references the new HDFS file using the 
equivalent readable profile.  ??You can also create a Hive table to access the 
HDFS file.??
    +
    +## <a id="pxfwrite_prereq"></a>Prerequisites
    +
    +Before working with HDFS file data using HAWQ and PXF, ensure that:
    +
    +-   The HDFS plug-in is installed on all cluster nodes. See [Installing 
PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
    +-   All HDFS users have read permissions to HDFS services.
    +-   HDFS write permissions are provided to a restricted set of users.
    +
    +## <a id="hdfsplugin_writeextdata"></a>Writing to PXF External Tables
    +The PXF HDFS plug-in supports two writable profiles: `HdfsTextSimple` and 
`SequenceWritable`.
    +
    +Use the following syntax to create a HAWQ external writable table 
representing HDFS data: 
    +
    +``` sql
    +CREATE WRITABLE EXTERNAL TABLE <table_name> 
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<path-to-hdfs-file>
    +    
?PROFILE=HdfsTextSimple|SequenceWritable[&<custom-option>=<value>[...]]')
    +FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
    +```
    +
    +HDFS-plug-in-specific keywords and values used in the [CREATE EXTERNAL 
TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described in the 
table below.
    +
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:\<port\>]    | The HDFS NameNode and port. |
    +| \<path-to-hdfs-file\>    | The path to the file in the HDFS data store. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values 
`HdfsTextSimple` or `SequenceWritable`. |
    +| \<custom-option\>  | \<custom-option\> is profile-specific. These 
options are discussed in the next topic.|
    +| FORMAT 'TEXT' | Use '`TEXT`' `FORMAT` with the `HdfsTextSimple` profile 
to create a plain-text-delimited file at the location specified by 
\<path-to-hdfs-file\>. The `HdfsTextSimple` '`TEXT`' `FORMAT` supports only the 
built-in `(delimiter=<delim>)` \<formatting-property\>. |
    +| FORMAT 'CSV' | Use '`CSV`' `FORMAT` with the `HdfsTextSimple` profile to 
create a comma-separated-value file at the location specified by 
\<path-to-hdfs-file\>.  |
    +| FORMAT 'CUSTOM' | Use the `'CUSTOM'` `FORMAT` with the 
`SequenceWritable` profile. The `SequenceWritable` '`CUSTOM`' `FORMAT` supports 
only the built-in `(formatter='pxfwritable_export)` (write) and 
`(formatter='pxfwritable_import)` (read) \<formatting-properties\>.
    +
    +**Note**: When creating PXF external tables, you cannot use the `HEADER` 
option in your `FORMAT` specification.
    +
    +## <a id="profile_hdfstextsimple"></a>Custom Options
    +
    +The `HdfsTextSimple` and `SequenceWritable` profiles support the following 
custom options:
    +
    +| Option  | Value Description | Profile |
    +|-------|-------------------------------------|--------|
    +| COMPRESSION_CODEC    | The compression codec Java class name. If this 
option is not provided, no data compression is performed. Supported compression 
codecs include: `org.apache.hadoop.io.compress.DefaultCodec` and 
`org.apache.hadoop.io.compress.BZip2Codec` | HdfsTextSimple, SequenceWritable |
    +|    |  `org.apache.hadoop.io.compress.GzipCodec` | HdfsTextSimple |
    +| COMPRESSION_TYPE    | The compression type to employ; supported values 
are `RECORD` (the default) or `BLOCK`. | HdfsTextSimple, SequenceWritable |
    +| DATA-SCHEMA    | The name of the writer serialization/deserialization 
class. The jar file in which this class resides must be in the PXF class path. 
This option is required for the `SequenceWritable` profile and has no default 
value. | SequenceWritable|
    +| THREAD-SAFE | Boolean value determining if a table query can run in 
multi-threaded mode. The default value is `TRUE`. Set this option to `FALSE` to 
handle all requests in a single thread for operations that are not thread-safe 
(for example, compression). | HdfsTextSimple, SequenceWritable| 
    +
    +## <a id="profile_hdfstextsimple"></a>HdfsTextSimple Profile
    +
    +Use the `HdfsTextSimple` profile when writing delimited data to a plain 
text file where each row is a single record.
    +
    +Writable tables created using the `HdfsTextSimple` profile can optionally 
use record or block compression. The following compression codecs are supported:
    +
    +- org.apache.hadoop.io.compress.DefaultCodec
    +- org.apache.hadoop.io.compress.GzipCodec
    +- org.apache.hadoop.io.compress.BZip2Codec
    +
    +The `HdfsTextSimple` profile supports the following 
\<formatting-properties\>:
    +
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| delimiter    | The delimiter character to use when writing the file. 
Default value is a comma `,`.|
    +
    +
    +### <a id="profile_hdfstextsimple_writing"></a>Example: Writing Data Using 
the HdfsTextSimple Profile
    +
    +This example uses the data schema introduced in [Example: Using the 
HdfsTextSimple Profile] (HDFSFileDataPXF.html#profile_hdfstextsimple_query):
    +
    +
    +| Field Name  | Data Type |
    +|-------|-------------------------------------|
    +| location | text |
    +| month | text |
    +| number\_of\_orders | int |
    +| total\_sales | float8 |
    +
    +The example also uses the HAWQ table `pxf_hdfs_textsimple` created in that 
exercise and expects it to exist.
    +
    +
    +Perform the following operations to use the PXF `HdfsTextSimple` profile 
to create a HAWQ writable external table with the same data schema as defined 
above. You will also create a separate external readable table to read the 
associated HDFS file.
    +
    +1. Create a writable HAWQ external table with the data schema described 
above. Write to the HDFS file 
`/data/pxf_examples/pxfwritable_hdfs_textsimple1`. Create the table specifying 
a comma `,` as the delimiter:
    +
    +    ``` sql
    +    gpadmin=# CREATE WRITABLE EXTERNAL TABLE 
pxf_hdfs_writabletbl_1(location text, month text, num_orders int, total_sales 
float8)
    +                LOCATION 
('pxf://namenode:51200/data/pxf_examples/pxfwritable_hdfs_textsimple1?PROFILE=HdfsTextSimple')
    +              FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +    
    +    The `FORMAT` subclause `delimiter` value is specified as the single 
ascii comma character `,`. `E` escapes the character.
    +
    +2. Write a few records to the `pxfwritable_hdfs_textsimple1` HDFS file by 
invoking the SQL `INSERT` command on `pxf_hdfs_writabletbl_1`:
    +
    +    ``` sql
    +    gpadmin=# INSERT INTO pxf_hdfs_writabletbl_1 VALUES ( 'Frankfurt', 
'Mar', 777, 3956.98 );
    +    gpadmin=# INSERT INTO pxf_hdfs_writabletbl_1 VALUES ( 'Cleveland', 
'Oct', 3812, 96645.37 );
    +    ```
    +
    +3. Insert the contents of the `pxf_hdfs_textsimple` table created in 
[Example: Using the HdfsTextSimple Profile] 
(HDFSFileDataPXF.html#profile_hdfstextsimple_query) into 
`pxf_hdfs_writabletbl_1`:
    +
    +    ``` sql
    +    gpadmin=# INSERT INTO pxf_hdfs_writabletbl_1 SELECT * FROM 
pxf_hdfs_textsimple;
    +    ```
    +
    +4. View the file contents in HDFS:
    +
    +    ``` shell
    +    $ hdfs dfs -cat /data/pxf_examples/pxfwritable_hdfs_textsimple1/*
    +    Frankfurt,Mar,777,3956.98
    +    Cleveland,Oct,3812,96645.37
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    ```
    +
    +    Because you specified comma `,` as the delimiter, this character is 
the field separator used in each record of the HDFS file.
    +
    +5. Querying an external writable table is not supported in HAWQ. To query 
data from the newly-created HDFS file, create a readable external HAWQ table 
referencing the HDFS file:
    +
    +    ``` sql
    +    gpadmin=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_r1(location text, 
month text, num_orders int, total_sales float8)
    +                LOCATION 
('pxf://namenode:51200/data/pxf_examples/pxfwritable_hdfs_textsimple1?PROFILE=HdfsTextSimple')
    +                       FORMAT 'CSV';
    +    ```
    +
    +    The table is created with the `'CSV'` `FORMAT` because the delimiter 
character used when creating the writable table was a comma `,`.
    +
    +6. Query the readable external table `pxf_hdfs_textsimple_r1`:
    +
    +    ``` sql
    +    gpadmin=# SELECT * FROM pxf_hdfs_textsimple_r1;
    +    ```
    +
    +    ``` pre
    +     location  | month | num_orders | total_sales
    +    -----------+-------+------------+-------------
    +      Frankfurt | Mar   |        777 |     3956.98
    +      Cleveland | Oct   |       3812 |    96645.37
    +      Prague    | Jan   |        101 |     4875.33
    +      Rome      | Mar   |         87 |     1557.39
    +      Bangalore | May   |        317 |     8936.99
    +      Beijing   | Jul   |        411 |    11600.67
    +    (6 rows)
    +    ```
    +
    +    The table includes the records you individually inserted, as well as 
the full contents of the `pxf_hdfs_textsimple` table.
    +
    +7. Create a second HAWQ external writable table, this time using Gzip 
compression and employing a colon `:` as the delimiter:
    +
    +    ``` sql
    +    gpadmin=# CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writabletbl_2 
(location text, month text, num_orders int, total_sales float8)
    +                LOCATION 
('pxf://namenode:51200/data/pxf_examples/pxfwritable_hdfs_textsimple2?PROFILE=HdfsTextSimple&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec')
    +              FORMAT 'TEXT' (delimiter=E':');
    +    ```
    +
    +8. Write a few records to the `pxfwritable_hdfs_textsimple2` HDFS file by 
inserting into the `pxf_hdfs_writabletbl_2` table:
    +
    +    ``` sql
    +    gpadmin=# INSERT INTO pxf_hdfs_writabletbl_2 VALUES ( 'Frankfurt', 
'Mar', 777, 3956.98 );
    +    gpadmin=# INSERT INTO pxf_hdfs_writabletbl_2 VALUES ( 'Cleveland', 
'Oct', 3812, 96645.37 );
    +    ```
    +
    +9. View the file contents in HDFS; use the `-text` option to `hdfs dfs` to 
view the compressed data as text:
    +
    +    ``` shell
    +    $ hdfs dfs -text /data/pxf_examples/pxfwritable_hdfs_textsimple2/*
    +    Frankfurt:Mar:777:3956.98
    +    Cleveland:Oct:3812:96645.3
    +    ```
    +
    +    Notice that the colon `:` is the field separator in the HDFS file.
    +
    +    As described in Step 5 above, to query data from the newly-created 
HDFS file named `pxfwritable_hdfs_textsimple2`, you can create a readable 
external HAWQ table referencing this HDFS file.
    +
    +
    +## <a id="profile_hdfsseqwritable_using"></a>SequenceWritable Profile
    +
    +Use the HDFS plug-in `SequenceWritable` profile when writing SequenceFile 
format files. Files of this type consist of binary key/value pairs. Sequence 
files are a common data transfer format between MapReduce jobs. 
    +
    +SequenceFile format files can optionally use record or block compression. 
The following compression codecs are supported:
    +
    +- org.apache.hadoop.io.compress.DefaultCodec
    +- org.apache.hadoop.io.compress.BZip2Codec
    +
    +When using the `SequenceWritable` profile to write a SequenceFile format 
file, you must provide the name of the Java class used to serialize/deserialize 
the data. This class will provide custom read and write methods for the fields 
in the schema associated with the data you will be writing.
    +
    +### <a id="profile_hdfsseqwritable_writing"></a>Example: Writing Using the 
SequenceWritable Profile
    +
    +Before creating a writable external table with the `SequenceWritable` 
profile, you will first create a Java class named `PxfExample_CustomWritable` 
that will serialize/deserialize the fields in the sample schema used in 
previous examples.
    +
    +Perform the following steps to create the Java class and writable table.
    +
    +1. Prepare to create the sample Java class:
    +
    +    ``` shell
    +    $ mkdir -p pxfex/com/hawq/example/pxf/hdfs/writable/dataschema
    +    $ cd pxfex/com/hawq/example/pxf/hdfs/writable/dataschema
    +    $ vi PxfExample_CustomWritable.java
    +    ```
    +
    +2. Copy and paste the following text into the 
`PxfExample_CustomWritable.java` file:
    +
    +    ``` java
    +    package com.hawq.example.pxf.hdfs.writable.dataschema;
    +    
    +    import org.apache.hadoop.io.*;
    +import java.io.DataInput;
    +import java.io.DataOutput;
    +import java.io.IOException;
    +import java.lang.reflect.Field;
    +
    +    /**
    + * PxfExample_CustomWritable class - used to serialize and deserialize 
data with
    + * text, int, and float data types
    + */
    +public class PxfExample_CustomWritable implements Writable {
    +
    +    public String st1, st2;
    +    public int int1;
    +    public float ft;
    +
    +    public PxfExample_CustomWritable() {
    +        st1 = new String("");
    +        st2 = new String("");
    +        int1 = 0;
    +        ft = 0.f;
    +    }
    +
    +    public PxfExample_CustomWritable(int i1, int i2, int i3) {
    +
    +        st1 = new String("short_string___" + i1);
    +        st2 = new String("short_string___" + i1);
    +        int1 = i2;
    +        ft = i1 * 10.f * 2.3f;
    +
    +    }
    +
    +    String GetSt1() {
    +        return st1;
    +    }
    +
    +    String GetSt2() {
    +        return st2;
    +    }
    +
    +    int GetInt1() {
    +        return int1;
    +    }
    +    
    +    float GetFt() {
    +        return ft;
    +    }
    +
    +    @Override
    +    public void write(DataOutput out) throws IOException {
    +
    +        Text txt = new Text();
    +        txt.set(st1);
    +        txt.write(out);
    +        txt.set(st2);
    +        txt.write(out);
    +
    +        IntWritable intw = new IntWritable();
    +        intw.set(int1);
    +        intw.write(out);
    +
    +        FloatWritable fw = new FloatWritable();
    +        fw.set(ft);
    +        fw.write(out);
    +    }
    +
    +    @Override
    +    public void readFields(DataInput in) throws IOException {
    +
    +        Text txt = new Text();
    +        txt.readFields(in);
    +        st1 = txt.toString();
    +        txt.readFields(in);
    +        st2 = txt.toString();
    +
    +        IntWritable intw = new IntWritable();
    +        intw.readFields(in);
    +        int1 = intw.get();
    +
    +        FloatWritable fw = new FloatWritable();
    +        fw.readFields(in);
    +        ft = fw.get();
    +    }
    +
    +    public void printFieldTypes() {
    +        Class myClass = this.getClass();
    +        Field[] fields = myClass.getDeclaredFields();
    +
    +        for (int i = 0; i < fields.length; i++) {
    +            System.out.println(fields[i].getType().getName());
    +        }
    +    }
    +}
    +    ```
    +
    +3. Compile and create a Java class jar file for 
`PxfExample_CustomWritable`:
    +
    +    ``` shell
    +    $ javac -classpath /usr/hdp/2.5.0.0-1245/hadoop/hadoop-common.jar  
PxfExample_CustomWritable.java
    +    $ cd ../../../../../../../
    +    $ jar cf pxfex-customwritable.jar com
    +    $ cp pxfex-customwritable.jar /tmp/
    +    ```
    +
    +4. Include the new jar file in the PXF Agent class path by updating the 
`pxf-public.classpath`. If you use Ambari to manage your cluster, add the 
following line via the Ambari UI and restart the PXF Agent:
    +
    +    ``` pre
    +    /tmp/pxfex-customwritable.jar
    +    ```
    +
    +    If you have a command-line-managed HAWQ cluster, perform the following 
steps on each node in your HAWQ cluster:
    +   - Directly edit `/etc/pxf/conf/pxf-public.classpath` and add the line 
above.
    +   - Restart the PXF Agent:
    +
    +        ``` shell
    +        $ sudo service pxf-service restart
    +        ```
    +
    +5. Use the PXF `SequenceWritable` profile to create a writable HAWQ 
external table. Identify the serialization/deserialization Java class you 
created above in the `DATA-SCHEMA` \<custom-option\>. Use `BLOCK` mode 
compression with `BZip2` when creating the writable table.
    +
    +    ``` sql
    +    gpadmin=# CREATE WRITABLE EXTERNAL TABLE pxf_tbl_seqwrit (location 
text, month text, number_of_orders integer, total_sales real)
    +                LOCATION 
('pxf://namenode:51200/data/pxf_examples/pxf_seqwrit_file?PROFILE=SequenceWritable&DATA-SCHEMA=com.hawq.example.pxf.hdfs.writable.dataschema.PxfExample_CustomWritable&COMPRESSION_TYPE=BLOCK&COMPRESSION_CODEC=org.apache.hadoop.io.compress.BZip2Codec')
    +              FORMAT 'CUSTOM' (formatter='pxfwritable_export');
    +    ```
    +
    +    Notice that the `'CUSTOM'` `FORMAT` \<formatting-properties\> specify 
the built-in `pxfwritable_export` formatter.
    +
    +6. Insert some data into `pxf_tbl_seqwrit`:
    +
    +    ``` sql
    +    gpadmin=# INSERT INTO pxf_tbl_seqwrit VALUES ( 'Frankfurt', 'Mar', 
777, 3956.98 );
    +    gpadmin=# INSERT INTO pxf_tbl_seqwrit VALUES ( 'Cleveland', 'Oct', 
3812, 96645.37 );
    +    ```
    +
    +6. Recall that querying an external writable table is not supported in 
HAWQ. To read the newly-created writable table, create a HAWQ readable external 
table referencing the writable table's HDFS file:
    +
    +    ``` sql
    +    gpadmin=# CREATE EXTERNAL TABLE read_pxf_tbl_seqwrit (location text, 
month text, number_of_orders integer, total_sales real)
    +                LOCATION 
('pxf://namenode:51200/data/pxf_examples/pxf_seqwrit_file?PROFILE=SequenceWritable&DATA-SCHEMA=com.hawq.example.pxf.hdfs.writable.dataschema.PxfExample_CustomWritable')
    +              FORMAT 'CUSTOM' (formatter='pxfwritable_import');
    +    ```
    +    The `DATA-SCHEMA` \<custom-option\> must be specified when reading an 
HDFS file via the `SequenceWritable` profile. Compression-related options need 
not be provided.
    +    
    +6. Query the readable external table `read_pxf_tbl_seqwrit`:
    +
    +    ``` sql
    +    gpadmin=# SELECT * FROM read_pxf_tbl_seqwrit;
    +    ```
    +    
    +    ``` pre
    +     location  | month | number_of_orders | total_sales 
    +    -----------+-------+------------------+-------------
    +     Frankfurt | Mar   |              777 |     3956.98
    +     Cleveland | Oct   |             3812 |     96645.4
    +    (2 rows)
    +    ```
    +
    +7. To read the `pxf_seqwrit_file` HDFS file via Hive:
    +
    +    IS THIS A VALID USE CASE???  COULD NOT GET THIS TO WORK.
    +
    +## <a id="recordkeyinkey-valuefileformats"></a>Reading the Record Key 
    +
    +Sequence file and other file formats that store rows in a key-value format 
can access the key value through HAWQ by using the `recordkey` keyword as a 
field name.
    +
    +The field type of `recordkey` must correspond to the key type, much as the 
other fields must match the HDFS data. 
    +
    +`recordkey` can be any of the following Hadoop types:
    +
    +-   BooleanWritable
    +-   ByteWritable
    +-   DoubleWritable
    +-   FloatWritable
    +-   IntWritable
    +-   LongWritable
    +-   Text
    +
    +### <a id="example1"></a>Example: Using Record Keys
    +
    +Create an external readable table to access the record keys from the 
writable table `XXX` we created in SequenceFile format in the section above.  
The `recordkey` is of type `int8`. (We did not identify a record key when 
creating the writable table.
    +
    +``` sql
    +gpadmin=# CREATE  EXTERNAL TABLE read_pxf_tbl_seqwrit_RECKEY (recordkey 
int8, location text, month text, number_of_orders integer, total_sales real)
    +                LOCATION 
('pxf://namenode:51200/data/pxf_examples/pxf_seqwrit_file?PROFILE=SequenceWritable&DATA-SCHEMA=com.hawq.example.pxf.hdfs.writable.dataschema.PxfExample_CustomWritable')
    +          FORMAT 'CUSTOM' (formatter='pxfwritable_import');
    +gpadmin=# SELECT * FROM read_pxf_tbl_seqwrit_RECKEY;
    +```
    +
    +``` pre
    + recordkey |  location   | month | number_of_orders | total_sales 
    +-----------+-------------+-------+------------------+-------------
    +         0 | Frankfurt   | Mar   |              777 |     3956.98
    +         0 | Cleveland   | Oct   |             3812 |     96645.4
    +(2 rows)
    +```
    +
    +The `recordkey` is display as `0`; a record key was not identified when 
you inserted entries in the writable table.
    --- End diff --
    
    yep, i need to rework the example a bit.


> create new documentation topic for PXF writable profiles
> --------------------------------------------------------
>
>                 Key: HAWQ-1119
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1119
>             Project: Apache HAWQ
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Lisa Owen
>            Assignee: David Yozie
>             Fix For: 2.0.1.0-incubating
>
>
> certain profiles supported by the existing PXF plug-ins support writable 
> tables.  create some documentation content for these profiles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HAWQ-1119) create new documentation topic for PXF writable profiles

Reply via email to