Github user sansanichfb commented on a diff in the pull request: https://github.com/apache/incubator-hawq/pull/1353#discussion_r182618597 --- Diff: pxf/pxf-jdbc/README.md --- @@ -1,139 +1,235 @@ -# Accessing Jdbc Table Data +# PXF JDBC plugin -The PXF JDBC plug-in reads data stored in Traditional relational database,ie : mysql,ORACLE,postgresql. +The PXF JDBC plugin allows to access external databases that implement [the Java Database Connectivity API](http://www.oracle.com/technetwork/java/javase/jdbc/index.html). Both read (SELECT) and write (INSERT) operations are supported by the plugin. -PXF-JDBC plug-in is the client of the database, the host running the database engine does not need to -deploy PXF. +PXF JDBC plugin is a JDBC client. The host running the external database does not need to deploy PXF. -# Prerequisites +## Prerequisites -Check the following before using PXF to access JDBC Table: -* The PXF JDBC plug-in is installed on all cluster nodes. -* The JDBC JAR files are installed on all cluster nodes, and added to file - 'pxf-public.classpath' -* You have tested PXF on HDFS. +Check the following before using the PXF JDBC plugin: -# Using PXF Tables to Query JDBC Table -Jdbc tables are defined in same schema in PXF.The PXF table has the same column name -as Jdbc Table, and the column type requires a mapping of Jdbc-HAWQ. +* The PXF JDBC plugin is installed on all PXF nodes; +* The JDBC driver for external database is installed on all PXF nodes; +* All PXF nodes are allowed to connect to the external database. -## Syntax Example -The following PXF table definition is valid for Jdbc Table. - CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name - ( column_name data_type [, ...] | LIKE other_table ) - LOCATION ('pxf://namenode[:port]/jdbc-schema-name.jdbc-table-name?<pxf-parameters><&custom-parameters>') - FORMAT 'CUSTOM' (formatter='pxfwritable_import') -If `jdbc-schema-name` is omitted, pxf will default to the `default` schema. +## Limitations -The `column_name` must exists in jdbc-table,`data_type` equals or similar to -the jdbc-column type. +Both **PXF table** **and** a **table in external database** **must have the same definiton**. Their columns must have the same names, and the columns' types must correspond. -where `<pxf-parameters>` is: +**Not all data types are supported** by the plugin. The following PXF data types are supported: - [FRAGMENTER=org.apache.hawq.pxf.plugins.jdbc.JdbcPartitionFragmenter - &ACCESSOR=org.apache.hawq.pxf.plugins.jdbc.JdbcReadAccessor - &RESOLVER=org.apache.hawq.pxf.plugins.jdbc.JdbcReadResolver] - | PROFILE=Jdbc +* `INTEGER`, `BIGINT`, `SMALLINT` +* `REAL`, `FLOAT8` +* `NUMERIC` +* `BOOLEAN` +* `VARCHAR`, `BPCHAR`, `TEXT` +* `DATE` +* `TIMESTAMP` +* `BYTEA` -where `<custom-parameters>` is: +The `<full_external_table_name>` (see below) **must not match** the [pattern](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) `/.*/[0-9]*-[0-9]*_[0-9]*` (the name must not start with `/` and have an ending that consists of `/` and three groups of numbers of arbitrary length, the first two separated by `-` and the last two separated by `_`. For example, the following table name is not allowed: `/public.table/1-2_3`). - JDBC_DRIVER=<jdbc-driver-class-name> - &DB_URL=<jdbc-url>&USER=<database-user>&PASS=<password> +At the moment, one PXF external table cannot serve both SELECT and INSERT queries. A separate PXF external table is required for each type of queries. -## Jdbc Table to HAWQ Data Type Mapping -Jdbc-table and hawq-table data type system is similar to, does not require -a special type of mapping. -# Usage -The following to mysql, for example, describes the use of PDF-JDBC. -To query MySQL Table in HAWQ, perform the following steps: -1. create Table in MySQL +## Syntax +``` +CREATE [ READABLE | WRITABLE ] EXTERNAL TABLE <table_name> ( + { <column_name> <data_type> [, ...] | LIKE <other_table> } +) +LOCATION ( + 'pxf://<full_external_table_name>?<pxf_parameters><jdbc_required_parameters><jdbc_login_parameters><plugin_parameters>' +) +FORMAT 'CUSTOM' (FORMATTER={'pxfwritable_import' | 'pxfwritable_export'}) +``` - mysql> use demodb; - mysql> create table myclass( - id int(4) not null primary key, - name varchar(20) not null, - gender int(4) not null default '0', - degree double(16,2));` -2. insert test data +The **`<pxf_parameters>`** are: +``` +{ +PROFILE=JDBC +| +FRAGMENTER=org.apache.hawq.pxf.plugins.jdbc.JdbcPartitionFragmenter +&ACCESSOR=org.apache.hawq.pxf.plugins.jdbc.JdbcAccessor +&RESOLVER=org.apache.hawq.pxf.plugins.jdbc.JdbcResolver +} +``` - insert into myclass values(1,"tom",1,90); - insert into myclass values(2,'john',0,94); - insert into myclass values(3,'simon',1,79); -3. copy mysql-jdbc jar files to `/usr/lib/pxf` (on all cluster nodes), and -edit `/etc/pxf/conf/pxf-public.classpath` , add : +The **`<jdbc_required_parameters>`** are: +``` +&JDBC_DRIVER=<external_database_jdbc_driver> +&DB_URL=<external_database_url> +``` - /usr/lib/pxf/mysql-connector-java-*.jar +The **`<jdbc_login_parameters>`** are **optional**, but if provided, both of them must be present: +``` +&USER=<external_database_login> +&PASS=<external_database_password> +``` - Restart all pxf-engine. +The **`<plugin_parameters>`** are **optional**. -4. create Table in HAWQ: +The meaning of `BATCH_SIZE` is given in section [batching of INSERT queries](#Batching). - gpadmin=# CREATE EXTERNAL TABLE myclass(id integer, - name text, - gender integer, - degree float8) - LOCATION ('pxf://localhost:51200/demodb.myclass' - '?PROFILE=JDBC' - '&JDBC_DRIVER=com.mysql.jdbc.Driver' - '&DB_URL=jdbc:mysql://192.168.200.6:3306/demodb&USER=root&PASS=root' - ) - FORMAT 'CUSTOM' (Formatter='pxfwritable_import'); +The meaning of other parameters is given in section [partitioning](#Partitioning). +``` +{ +&BATCH_SIZE=<batch_size> +| +&PARTITION_BY=<column>:<column_type> +&RANGE=<start_value>:<end_value> +[&INTERVAL=<value>[:<unit>]] +} +``` -MySQL instance IP: 192.168.200.6, port: 3306. -5. query mysql data in HAWQ: +## SELECT queries - gpadmin=# select * from myclass; - gpadmin=# select * from myclass where id=2; +The PXF JDBC plugin allows to perform SELECT queries to external tables. -# Jdbc Table Fragments -## intro -PXF-JDBC plug-in as a client to access jdbc database.By default, there is -only one pxf-instance connectied JDBC Table.If the jdbc table data is large, -you can also use multiple pxf-instance to access the JDBC table by fragments. +To perform SELECT queries, create an `EXTERNAL READABLE TABLE` or just an `EXTERNAL TABLE` with `FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import')` in PXF. -## Syntax -where `<custom-parameters>` can use following partition parameters: - - PARTITION_BY=column_name:column_type&RANGE=start_value[:end_value]&INTERVAL=interval_num[:interval_unit] -The `PARTITION_BY` parameter indicates which column to use as the partition column. -It can be split by colon(':'),the `column_type` current supported : `date|int|enum` . -The Date format is `yyyy-MM-dd`. -The `PARTITION_BY` parameter can be null, and there will be only one fragment. - -The `RANGE` parameter indicates the range of data to be queried , it can be split by colon(':'). - The range is left-closed, ie: `>= start_value AND < end_value` . - -The `INTERVAL` parameter can be split by colon(':'), indicate the interval - value of one fragment. When `column_type` is `date`,this parameter must - be split by colon, and `interval_unit` can be `year|month|day`. When - `column_type` is int, the `interval_unit` can be empty. When `column_type` - is enum,the `INTERVAL` parameter can be empty. - -The syntax examples is : - - * PARTITION_BY=createdate:date&RANGE=2008-01-01:2010-01-01&INTERVAL=1:month' - * PARTITION_BY=year:int&RANGE=2008:2010&INTERVAL=1 - * PARTITION_BY=grade:enum&RANGE=excellent:good:general:bad - -## Usage -MySQL Table: - - CREATE TABLE sales (id int primary key, cdate date, amt decimal(10,2),grade varchar(30)) -HAWQ Table: - - CREATE EXTERNAL TABLE sales(id integer, - cdate date, - amt float8, - grade text) - LOCATION ('pxf://localhost:51200/sales' - '?PROFILE=JDBC' - '&JDBC_DRIVER=com.mysql.jdbc.Driver' - '&DB_URL=jdbc:mysql://192.168.200.6:3306/demodb&USER=root&PASS=root' - '&PARTITION_BY=cdate:date&RANGE=2008-01-01:2010-01-01&INTERVAL=1:year' - ) - FORMAT 'CUSTOM' (Formatter='pxfwritable_import'); -At PXF-JDBC plugin,this will generate 2 fragments.Then HAWQ assign these fragments to 2 PXF-instance -to access jdbc table data. \ No newline at end of file +The `BATCH_SIZE` parameter is *not used* in such tables. *However*, if this parameter is present, its value will be checked for correctness (it must be an integer). + + +## INSERT queries + +The PXF JDBC plugin allows to perform INSERT queries to external tables. + +To perform INSERT queries, create an `EXTERNAL WRITABLE TABLE` with `FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export')` in PXF. + +The `PARTITION_BY`, `RANGE` and `INTERVAL` is ignored in such tables. + + +### Batching + +The INSERT queries can be batched. This may significantly increase perfomance if batching is supported by the external database. + +If an **external database does not support transactions** and the INSERTion of one of the tuples failed, some tuples still may be inserted into the external database. The actual result depends on the external database. + +To enable batching, create an external table with the parameter `BATCH_SIZE` set to: +* `integer > 1`. Batches of the given size will be used; +* `integer < 0`. A batch of infinite size will be used (all tuples will be sent in one huge JDBC query). Note that this behaviour **may cause errors**, as each database has its own limit on the size of JDBC queries; +* `0` or `1`. Batching will not be used. + +Batching must be supported by the JDBC driver of an external database. If the driver does not support batching, it will not be used, but PXF plugin will try to INSERT values anyway, and an information message will be added to PXF logs. + +By default (`BATCH_SIZE` is absent), batching is not used. + + +## Partitioning + +PXF JDBC plugin supports simultaneous *read* access to an external table from multiple PXF segments. This function is called partitioning. + + +### Syntax +Use the following `<plugin_parameters>` (mentioned above) to activate partitioning: + +``` +&PARTITION_BY=<column>:<column_type> +&RANGE=<start_value>:<end_value> +[&INTERVAL=<value>[:<unit>]] +``` + +* The `PARTITION_BY` parameter indicates which column to use as a partition column. Only one column can be used as the partition column + * The `<column>` is the name of a partition column; + * The `<column_type>` is the data type of a partition column. At the moment, the **supported types** are `INT`, `DATE` and `ENUM`. + +* The `RANGE` parameter indicates the range of data to be queried. + * If the partition type is `ENUM`, the `RANGE` parameter must be a list of values, each of which forms its own fragment; + * If the partition type is `INT` or `DATE`, the `RANGE` parameter must be a finite left-closed range ( `... >= start_value AND ... < end_value`); + * For `DATE` partitions, the date format must be `yyyy-MM-dd`. + +* The `INTERVAL` parameter is **required** for `INT` and `DATE` partitions. It is ignored if `<column_type>` is `ENUM`. + * The `<value>` is the size of each fragment (the last one may be made smaller by the plugin); + * The `<unit>` **must** be provided if `<column_type>` is `DATE`. `year`, `month` and `day` are supported. This parameter is ignored in case of any other `<column_type>`. + +Example partitions: +* `&PARTITION_BY=id:int&RANGE=42:142&INTERVAL=2` +* `&PARTITION_BY=createdate:date&RANGE=2008-01-01:2010-01-01&INTERVAL=1:month` +* `&PARTITION_BY=grade:enum&RANGE=excellent:good:general:bad` + + +### Mechanism + +Note that **by default PXF does not support more than 100 fragments**. --- End diff -- Can you please explain where 100 comes from?
---