[GitHub] incubator-hawq pull request #1353: HAWQ-1605. Support INSERT in PXF JDBC plu...

sansanichfb Wed, 18 Apr 2018 19:19:52 -0700

Github user sansanichfb commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq/pull/1353#discussion_r182618597
  
    --- Diff: pxf/pxf-jdbc/README.md ---
    @@ -1,139 +1,235 @@
    -# Accessing Jdbc Table Data
    +# PXF JDBC plugin
     
    -The PXF JDBC plug-in reads data stored in Traditional relational 
database,ie : mysql,ORACLE,postgresql.
    +The PXF JDBC plugin allows to access external databases that implement 
[the Java Database Connectivity 
API](http://www.oracle.com/technetwork/java/javase/jdbc/index.html). Both read 
(SELECT) and write (INSERT) operations are supported by the plugin.
     
    -PXF-JDBC plug-in is the client of the database, the host running the 
database engine does not need to
    -deploy PXF.
    +PXF JDBC plugin is a JDBC client. The host running the external database 
does not need to deploy PXF.
     
     
    -# Prerequisites
    +## Prerequisites
     
    -Check the following before using PXF to access JDBC Table:
    -* The PXF JDBC plug-in is installed on all cluster nodes.
    -* The JDBC JAR files are installed on all cluster nodes, and added to file 
- 'pxf-public.classpath'
    -* You have tested PXF on HDFS.
    +Check the following before using the PXF JDBC plugin:
     
    -# Using PXF Tables to Query JDBC Table
    -Jdbc tables are defined in same schema in PXF.The PXF table has the same 
column name
    -as Jdbc Table, and the column type requires a mapping of Jdbc-HAWQ.
    +* The PXF JDBC plugin is installed on all PXF nodes;
    +* The JDBC driver for external database is installed on all PXF nodes;
    +* All PXF nodes are allowed to connect to the external database.
     
    -## Syntax Example
    -The following PXF table definition is valid for Jdbc Table.
     
    -    CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name
    -        ( column_name data_type [, ...] | LIKE other_table )
    -    LOCATION 
('pxf://namenode[:port]/jdbc-schema-name.jdbc-table-name?<pxf-parameters><&custom-parameters>')
    -    FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    -If `jdbc-schema-name` is omitted, pxf will default to the `default` schema.
    +## Limitations
     
    -The `column_name` must exists in jdbc-table,`data_type` equals or similar 
to
    -the jdbc-column type.
    +Both **PXF table** **and** a **table in external database** **must have 
the same definiton**. Their columns must have the same names, and the columns' 
types must correspond.
     
    -where `<pxf-parameters>` is:
    +**Not all data types are supported** by the plugin. The following PXF data 
types are supported:
     
    -    [FRAGMENTER=org.apache.hawq.pxf.plugins.jdbc.JdbcPartitionFragmenter
    -    &ACCESSOR=org.apache.hawq.pxf.plugins.jdbc.JdbcReadAccessor
    -    &RESOLVER=org.apache.hawq.pxf.plugins.jdbc.JdbcReadResolver]
    -    | PROFILE=Jdbc
    +* `INTEGER`, `BIGINT`, `SMALLINT`
    +* `REAL`, `FLOAT8`
    +* `NUMERIC`
    +* `BOOLEAN`
    +* `VARCHAR`, `BPCHAR`, `TEXT`
    +* `DATE`
    +* `TIMESTAMP`
    +* `BYTEA`
     
    -where `<custom-parameters>` is:
    +The `<full_external_table_name>` (see below) **must not match** the 
[pattern](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html)
 `/.*/[0-9]*-[0-9]*_[0-9]*` (the name must not start with `/` and have an 
ending that consists of `/` and three groups of numbers of arbitrary length, 
the first two separated by `-` and the last two separated by `_`. For example, 
the following table name is not allowed: `/public.table/1-2_3`).
     
    -    JDBC_DRIVER=<jdbc-driver-class-name>
    -     &DB_URL=<jdbc-url>&USER=<database-user>&PASS=<password>
    +At the moment, one PXF external table cannot serve both SELECT and INSERT 
queries. A separate PXF external table is required for each type of queries.
     
    -## Jdbc Table to HAWQ Data Type Mapping
    -Jdbc-table and hawq-table data type system is similar to, does not require
    -a special type of mapping.
    -# Usage
    -The following to mysql, for example, describes the use of PDF-JDBC.
     
    -To query MySQL Table in HAWQ, perform the following steps:
    -1. create Table in MySQL
    +## Syntax
    +```
    +CREATE [ READABLE | WRITABLE ] EXTERNAL TABLE <table_name> (
    +    { <column_name> <data_type> [, ...] | LIKE <other_table> }
    +)
    +LOCATION (
    +    
'pxf://<full_external_table_name>?<pxf_parameters><jdbc_required_parameters><jdbc_login_parameters><plugin_parameters>'
    +)
    +FORMAT 'CUSTOM' (FORMATTER={'pxfwritable_import' | 'pxfwritable_export'})
    +```
     
    -         mysql> use demodb;
    -         mysql> create table myclass(
    -                 id int(4) not null primary key,
    -                 name varchar(20) not null,
    -                 gender int(4) not null default '0',
    -                 degree double(16,2));`
    -2. insert test data
    +The **`<pxf_parameters>`** are:
    +```
    +{
    +PROFILE=JDBC
    +|
    +FRAGMENTER=org.apache.hawq.pxf.plugins.jdbc.JdbcPartitionFragmenter
    +&ACCESSOR=org.apache.hawq.pxf.plugins.jdbc.JdbcAccessor
    +&RESOLVER=org.apache.hawq.pxf.plugins.jdbc.JdbcResolver
    +}
    +```
     
    -        insert into myclass values(1,"tom",1,90);
    -        insert into myclass values(2,'john',0,94);
    -        insert into myclass values(3,'simon',1,79);
    -3. copy mysql-jdbc jar files to `/usr/lib/pxf` (on all cluster nodes), and
    -edit `/etc/pxf/conf/pxf-public.classpath` , add :
    +The **`<jdbc_required_parameters>`** are:
    +```
    +&JDBC_DRIVER=<external_database_jdbc_driver>
    +&DB_URL=<external_database_url>
    +```
     
    -        /usr/lib/pxf/mysql-connector-java-*.jar
    +The **`<jdbc_login_parameters>`** are **optional**, but if provided, both 
of them must be present:
    +```
    +&USER=<external_database_login>
    +&PASS=<external_database_password>
    +```
     
    -     Restart all pxf-engine.
    +The **`<plugin_parameters>`** are **optional**.
     
    -4. create Table in HAWQ:
    +The meaning of `BATCH_SIZE` is given in section [batching of INSERT 
queries](#Batching).
     
    -        gpadmin=# CREATE EXTERNAL TABLE myclass(id integer,
    -             name text,
    -             gender integer,
    -             degree float8)
    -             LOCATION ('pxf://localhost:51200/demodb.myclass'
    -                     '?PROFILE=JDBC'
    -                     '&JDBC_DRIVER=com.mysql.jdbc.Driver'
    -                     
'&DB_URL=jdbc:mysql://192.168.200.6:3306/demodb&USER=root&PASS=root'
    -                     )
    -            FORMAT 'CUSTOM' (Formatter='pxfwritable_import');
    +The meaning of other parameters is given in section 
[partitioning](#Partitioning).
    +```
    +{
    +&BATCH_SIZE=<batch_size>
    +|
    +&PARTITION_BY=<column>:<column_type>
    +&RANGE=<start_value>:<end_value>
    +[&INTERVAL=<value>[:<unit>]]
    +}
    +```
     
    -MySQL instance IP: 192.168.200.6, port: 3306.
     
    -5. query mysql data in HAWQ:
    +## SELECT queries
     
    -        gpadmin=# select * from myclass;
    -        gpadmin=# select * from myclass where id=2;
    +The PXF JDBC plugin allows to perform SELECT queries to external tables.
     
    -# Jdbc Table Fragments
    -## intro
    -PXF-JDBC plug-in as a  client to access jdbc database.By default, there is
    -only one pxf-instance connectied JDBC Table.If the jdbc table data is 
large,
    -you can also use multiple pxf-instance to access the JDBC table by 
fragments.
    +To perform SELECT queries, create an `EXTERNAL READABLE TABLE` or just an 
`EXTERNAL TABLE` with `FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import')` in PXF.
     
    -## Syntax
    -where `<custom-parameters>` can use following partition parameters:
    -
    -    
PARTITION_BY=column_name:column_type&RANGE=start_value[:end_value]&INTERVAL=interval_num[:interval_unit]
    -The `PARTITION_BY` parameter indicates which  column to use as the 
partition column.
    -It can be split by colon(':'),the `column_type` current supported : 
`date|int|enum` .
    -The Date format is `yyyy-MM-dd`.
    -The `PARTITION_BY` parameter can be null, and there will be only one 
fragment.
    -
    -The `RANGE` parameter indicates the range of data to be queried , it can 
be split by colon(':').
    - The range is left-closed, ie: `>= start_value AND < end_value` .
    -
    -The `INTERVAL` parameter can be split by colon(':'), indicate the interval
    - value of one fragment. When `column_type` is `date`,this parameter must
    - be split by colon, and `interval_unit` can be `year|month|day`. When
    - `column_type` is int, the `interval_unit` can be empty. When `column_type`
    - is enum,the `INTERVAL` parameter can be empty.
    -
    -The syntax examples is :
    -
    -    * 
PARTITION_BY=createdate:date&RANGE=2008-01-01:2010-01-01&INTERVAL=1:month'
    -    * PARTITION_BY=year:int&RANGE=2008:2010&INTERVAL=1
    -    * PARTITION_BY=grade:enum&RANGE=excellent:good:general:bad
    -
    -## Usage
    -MySQL Table:
    -
    -    CREATE TABLE sales (id int primary key, cdate date, amt 
decimal(10,2),grade varchar(30))
    -HAWQ Table:
    -
    -    CREATE EXTERNAL TABLE sales(id integer,
    -                 cdate date,
    -                 amt float8,
    -                 grade text)
    -                 LOCATION ('pxf://localhost:51200/sales'
    -                         '?PROFILE=JDBC'
    -                         '&JDBC_DRIVER=com.mysql.jdbc.Driver'
    -                         
'&DB_URL=jdbc:mysql://192.168.200.6:3306/demodb&USER=root&PASS=root'
    -                         
'&PARTITION_BY=cdate:date&RANGE=2008-01-01:2010-01-01&INTERVAL=1:year'
    -                         )
    -                 FORMAT 'CUSTOM' (Formatter='pxfwritable_import');
    -At PXF-JDBC plugin,this will generate 2 fragments.Then HAWQ assign these 
fragments to 2 PXF-instance
    -to access jdbc table data.
    \ No newline at end of file
    +The `BATCH_SIZE` parameter is *not used* in such tables. *However*, if 
this parameter is present, its value will be checked for correctness (it must 
be an integer).
    +
    +
    +## INSERT queries
    +
    +The PXF JDBC plugin allows to perform INSERT queries to external tables.
    +
    +To perform INSERT queries, create an `EXTERNAL WRITABLE TABLE` with 
`FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export')` in PXF.
    +
    +The `PARTITION_BY`, `RANGE` and `INTERVAL` is ignored in such tables.
    +
    +
    +### Batching
    +
    +The INSERT queries can be batched. This may significantly increase 
perfomance if batching is supported by the external database.
    +
    +If an **external database does not support transactions** and the 
INSERTion of one of the tuples failed, some tuples still may be inserted into 
the external database. The actual result depends on the external database.
    +
    +To enable batching, create an external table with the parameter 
`BATCH_SIZE` set to:
    +* `integer > 1`. Batches of the given size will be used;
    +* `integer < 0`. A batch of infinite size will be used (all tuples will be 
sent in one huge JDBC query). Note that this behaviour **may cause errors**, as 
each database has its own limit on the size of JDBC queries;
    +* `0` or `1`. Batching will not be used.
    +
    +Batching must be supported by the JDBC driver of an external database. If 
the driver does not support batching, it will not be used, but PXF plugin will 
try to INSERT values anyway, and an information message will be added to PXF 
logs.
    +
    +By default (`BATCH_SIZE` is absent), batching is not used.
    +
    +
    +## Partitioning
    +
    +PXF JDBC plugin supports simultaneous *read* access to an external table 
from multiple PXF segments. This function is called partitioning.
    +
    +
    +### Syntax
    +Use the following `<plugin_parameters>` (mentioned above) to activate 
partitioning:
    +
    +```
    +&PARTITION_BY=<column>:<column_type>
    +&RANGE=<start_value>:<end_value>
    +[&INTERVAL=<value>[:<unit>]]
    +```
    +
    +* The `PARTITION_BY` parameter indicates which column to use as a 
partition column. Only one column can be used as the partition column
    +    * The `<column>` is the name of a partition column;
    +    * The `<column_type>` is the data type of a partition column. At the 
moment, the **supported types** are `INT`, `DATE` and `ENUM`.
    +
    +* The `RANGE` parameter indicates the range of data to be queried.
    +    * If the partition type is `ENUM`, the `RANGE` parameter must be a 
list of values, each of which forms its own fragment;
    +    * If the partition type is `INT` or `DATE`, the `RANGE` parameter must 
be a finite left-closed range ( `... >= start_value AND ... < end_value`);
    +    * For `DATE` partitions, the date format must be `yyyy-MM-dd`.
    +
    +* The `INTERVAL` parameter is **required** for `INT` and `DATE` 
partitions. It is ignored if `<column_type>` is `ENUM`.
    +    * The `<value>` is the size of each fragment (the last one may be made 
smaller by the plugin);
    +    * The `<unit>` **must** be provided if `<column_type>` is `DATE`. 
`year`, `month` and `day` are supported. This parameter is ignored in case of 
any other `<column_type>`.
    +
    +Example partitions:
    +* `&PARTITION_BY=id:int&RANGE=42:142&INTERVAL=2`
    +* 
`&PARTITION_BY=createdate:date&RANGE=2008-01-01:2010-01-01&INTERVAL=1:month`
    +* `&PARTITION_BY=grade:enum&RANGE=excellent:good:general:bad`
    +
    +
    +### Mechanism
    +
    +Note that **by default PXF does not support more than 100 fragments**.
    --- End diff --
    
    Can you please explain where 100 comes from?

---

[GitHub] incubator-hawq pull request #1353: HAWQ-1605. Support INSERT in PXF JDBC plu...

Reply via email to