[ 
https://issues.apache.org/jira/browse/HAWQ-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Goden Yao updated HAWQ-435:
---------------------------
    Description: 
*Overview*
When retrieving data via PXF, there're two ways to limit the data it needs to 
read:

1. Defining which file or directories pxf needs to access, using the LOCATION 
clause in the table’s creation:
{code:sql}
CREATE EXTERNAL TABLE … LOCATION 
(‘pxf://<ip:port>/some/directory/or/files*?PROFILE=...’) …
{code}

PXF supports limited wildcard notation (same as HDFS), there is no support for 
full regular expressions.

2. Querying for a subset of the data using predicates filtering:
e.g.
{code:sql}
SELECT … FROM table WHERE a > 10 ...
{code}

PXF framework supports partial filter pushdown, where it passes the 
predicates/filters from pxf to plugins and plugins can use the information to 
filter the data before it transfers it back to HAWQ.

Currently, this feature only supports basic operators (=, !=, >, <, >=, <=) and 
basic types (int, text, etc.).

*HDFS Filter Pushdown Support*
Users often store their data on HDFS in a hierarchic directory structure. 
e.g.
{noformat}
/data/sales /
--- California /
--- --- Palo Alto /
--- --- ---  PA01 / 
--- --- --- ---  Grocery /
--- --- --- ---  Produce /
--- --- --- ---  Bakery / 
...
--- --- San Diego /
--- --- ---  SD05 / 
--- --- --- ---  Grocery /
--- --- --- ---  Produce /
--- --- --- ---  Bakery / 
{noformat}

We can either 1) create an external table which only contains a subset of the 
data we're interested in (so we *filter* the data) e.g.
{code:sql}
CREATE EXTERNAL TABLE sales_CA 
(sale_date date, price float, product_id bigint, customer_id bigint -- /* 
regular fields */
 state text, city text, store_name text, department text -- /* template fields 
*/ )
LOCATION (‘pxf://ip:port/data/sales/California/*?PROFILE=...’)
FORMAT ‘TEXT’ (delimiter “,”);
{code}
However, this is not user friendly and hard to maintain (e.g. need to join the 
holistic table with other tables) if users have to create multiple tables just 
for the purpose of data filtering.

2) We still create ONE external table for all data , but support filter 
pushdown (this JIRA) to filter the data during query.
The root directory that defines as the table’s path will be “/data/sales/”. A 
template will be of the form “/state/city/store_name/department/” and the table 
will have additional fields “state”, “city”, “store_name” and “department” that 
will correspond to the template.

{code:sql}
-- still create one external table
CREATE EXTERNAL TABLE sales_by_location 
(sale_date date, price float, product_id bigint, customer_id bigint -- /* 
regular fields */
 state text, city text, store_name text, department text -- /* template fields 
*/ )
LOCATION 
(‘pxf://ip:port/data/sales?PROFILE=HdfsTextPartition&TEMPLATE=/state/city/store_name/department/’)
FORMAT ‘TEXT’ (delimiter “,”);
{code}

Then a query for all grocery profits in California will be
{code:sql}
SELECT * FROM sales_by_location WHERE state=”California” AND 
department=”Grocery”;
{code}

*Possible Scenarios*
Folder structures can be strings or time series. We should consider supporting 
both scenarios but break down the stories to sub-tasks so each scenario can be 
verified.


  was:
*Overview*
When retrieving data via PXF, there're two ways to limit the data it needs to 
read:

1. Defining which file or directories pxf needs to access, using the LOCATION 
clause in the table’s creation:
{code:sql}
CREATE EXTERNAL TABLE … LOCATION 
(‘pxf://<ip:port>/some/directory/or/files*?PROFILE=...’) …
{code}

PXF supports limited wildcard notation (same as HDFS), there is no support for 
full regular expressions.

2. Querying for a subset of the data using predicates filtering:
e.g.
{code:sql}
SELECT … FROM table WHERE a > 10 ...
{code}

PXF framework supports partial filter pushdown, where it passes the 
predicates/filters from pxf to plugins and plugins can use the information to 
filter the data before it transfers it back to HAWQ.

Currently, this feature only supports basic operators (=, !=, >, <, >=, <=) and 
basic types (int, text, etc.).

*HDFS Filter Pushdown Support*
Users often store their data on HDFS in a hierarchic directory structure. 
e.g.
{noformat}
/data/sales /
--- California /
--- --- Palo Alto /
--- --- ---  PA01 / 
--- --- --- ---  Grocery /
--- --- --- ---  Produce /
--- --- --- ---  Bakery / 
...
--- --- San Diego /
--- --- ---  SD05 / 
--- --- --- ---  Grocery /
--- --- --- ---  Produce /
--- --- --- ---  Bakery / 
{noformat}

We can either 1) create an external table which only contains a subset of the 
data we're interested in (so we *filter* the data) e.g.
{code:sql}
CREATE EXTERNAL TABLE sales_CA 
(sale_date date, price float, product_id bigint, customer_id bigint -- /* 
regular fields */
 state text, city text, store_name text, department text -- /* template fields 
*/ )
LOCATION (‘pxf://ip:port/data/sales/California/*?PROFILE=...’)
FORMAT ‘TEXT’ (delimiter “,”);
{code}
However, this is not user friendly and hard to maintain (e.g. need to join the 
holistic table with other tables) if users have to create multiple tables just 
for the purpose of data filtering.

2) We still create ONE external table for all data , but support filter 
pushdown (this JIRA) to filter the data during query.

{code:sql}
-- still create one external table
CREATE EXTERNAL TABLE sales_by_location 
(sale_date date, price float, product_id bigint, customer_id bigint -- /* 
regular fields */
 state text, city text, store_name text, department text -- /* template fields 
*/ )
LOCATION 
(‘pxf://ip:port/data/sales?PROFILE=HdfsTextPartition&TEMPLATE=/state/city/store_name/department/’)
FORMAT ‘TEXT’ (delimiter “,”);
{code}

Then a query for all grocery profits in California will be
{code:sql}
SELECT * FROM sales_by_location WHERE state=”California” AND 
department=”Grocery”;
{code}

*Possible Scenarios*
Folder structures can be strings or time series. We should consider supporting 
both scenarios but break down the stories to sub-tasks so each scenario can be 
verified.



> Support HDFS profile filter push down
> -------------------------------------
>
>                 Key: HAWQ-435
>                 URL: https://issues.apache.org/jira/browse/HAWQ-435
>             Project: Apache HAWQ
>          Issue Type: New Feature
>          Components: PXF
>            Reporter: Goden Yao
>            Assignee: Goden Yao
>              Labels: gsoc2016
>             Fix For: backlog
>
>
> *Overview*
> When retrieving data via PXF, there're two ways to limit the data it needs to 
> read:
> 1. Defining which file or directories pxf needs to access, using the LOCATION 
> clause in the table’s creation:
> {code:sql}
> CREATE EXTERNAL TABLE … LOCATION 
> (‘pxf://<ip:port>/some/directory/or/files*?PROFILE=...’) …
> {code}
> PXF supports limited wildcard notation (same as HDFS), there is no support 
> for full regular expressions.
> 2. Querying for a subset of the data using predicates filtering:
> e.g.
> {code:sql}
> SELECT … FROM table WHERE a > 10 ...
> {code}
> PXF framework supports partial filter pushdown, where it passes the 
> predicates/filters from pxf to plugins and plugins can use the information to 
> filter the data before it transfers it back to HAWQ.
> Currently, this feature only supports basic operators (=, !=, >, <, >=, <=) 
> and basic types (int, text, etc.).
> *HDFS Filter Pushdown Support*
> Users often store their data on HDFS in a hierarchic directory structure. 
> e.g.
> {noformat}
> /data/sales /
> --- California /
> --- --- Palo Alto /
> --- --- ---  PA01 / 
> --- --- --- ---  Grocery /
> --- --- --- ---  Produce /
> --- --- --- ---  Bakery / 
> ...
> --- --- San Diego /
> --- --- ---  SD05 / 
> --- --- --- ---  Grocery /
> --- --- --- ---  Produce /
> --- --- --- ---  Bakery / 
> {noformat}
> We can either 1) create an external table which only contains a subset of the 
> data we're interested in (so we *filter* the data) e.g.
> {code:sql}
> CREATE EXTERNAL TABLE sales_CA 
> (sale_date date, price float, product_id bigint, customer_id bigint -- /* 
> regular fields */
>  state text, city text, store_name text, department text -- /* template 
> fields */ )
> LOCATION (‘pxf://ip:port/data/sales/California/*?PROFILE=...’)
> FORMAT ‘TEXT’ (delimiter “,”);
> {code}
> However, this is not user friendly and hard to maintain (e.g. need to join 
> the holistic table with other tables) if users have to create multiple tables 
> just for the purpose of data filtering.
> 2) We still create ONE external table for all data , but support filter 
> pushdown (this JIRA) to filter the data during query.
> The root directory that defines as the table’s path will be “/data/sales/”. A 
> template will be of the form “/state/city/store_name/department/” and the 
> table will have additional fields “state”, “city”, “store_name” and 
> “department” that will correspond to the template.
> {code:sql}
> -- still create one external table
> CREATE EXTERNAL TABLE sales_by_location 
> (sale_date date, price float, product_id bigint, customer_id bigint -- /* 
> regular fields */
>  state text, city text, store_name text, department text -- /* template 
> fields */ )
> LOCATION 
> (‘pxf://ip:port/data/sales?PROFILE=HdfsTextPartition&TEMPLATE=/state/city/store_name/department/’)
> FORMAT ‘TEXT’ (delimiter “,”);
> {code}
> Then a query for all grocery profits in California will be
> {code:sql}
> SELECT * FROM sales_by_location WHERE state=”California” AND 
> department=”Grocery”;
> {code}
> *Possible Scenarios*
> Folder structures can be strings or time series. We should consider 
> supporting both scenarios but break down the stories to sub-tasks so each 
> scenario can be verified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to