[ 
https://issues.apache.org/jira/browse/HAWQ-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Goden Yao updated HAWQ-435:
---------------------------
    Fix Version/s: backlog

> Support HDFS profile filter push down
> -------------------------------------
>
>                 Key: HAWQ-435
>                 URL: https://issues.apache.org/jira/browse/HAWQ-435
>             Project: Apache HAWQ
>          Issue Type: New Feature
>          Components: PXF
>            Reporter: Goden Yao
>            Assignee: Goden Yao
>              Labels: gsoc2016
>             Fix For: backlog
>
>
> *Overview*
> When retrieving data via PXF, there're two ways to limit the data it needs to 
> read:
> 1. Defining which file or directories pxf needs to access, using the LOCATION 
> clause in the table’s creation:
> {code:sql}
> CREATE EXTERNAL TABLE … LOCATION 
> (‘pxf://<ip:port>/some/directory/or/files*?PROFILE=...’) …
> {code}
> PXF supports limited wildcard notation (same as HDFS), there is no support 
> for full regular expressions.
> 2. Querying for a subset of the data using predicates filtering:
> e.g.
> {code:sql}
> SELECT … FROM table WHERE a > 10 ...
> {code}
> PXF framework supports partial filter pushdown, where it passes the 
> predicates/filters from pxf to plugins and plugins can use the information to 
> filter the data before it transfers it back to HAWQ.
> Currently, this feature only supports basic operators (=, !=, >, <, >=, <=) 
> and basic types (int, text, etc.).
> *HDFS Filter Pushdown Support*
> Users often store their data on HDFS in a hierarchic directory structure. 
> e.g.
> {noformat}
> /data/sales /
> --- California /
> --- --- Palo Alto /
> --- --- ---  PA01 / 
> --- --- --- ---  Grocery /
> --- --- --- ---  Produce /
> --- --- --- ---  Bakery / 
> ...
> --- --- San Diego /
> --- --- ---  SD05 / 
> --- --- --- ---  Grocery /
> --- --- --- ---  Produce /
> --- --- --- ---  Bakery / 
> {noformat}
> We can either 1) create an external table which only contains a subset of the 
> data we're interested in (so we *filter* the data) e.g.
> {code:sql}
> CREATE EXTERNAL TABLE sales_CA 
> (sale_date date, price float, product_id bigint, customer_id bigint -- /* 
> regular fields */
>  state text, city text, store_name text, department text -- /* template 
> fields */ )
> LOCATION (‘pxf://ip:port/data/sales/California/*?PROFILE=...’)
> FORMAT ‘TEXT’ (delimiter “,”);
> {code}
> However, this is not user friendly and hard to maintain (e.g. need to join 
> the holistic table with other tables) if users have to create multiple tables 
> just for the purpose of data filtering.
> 2) We still create ONE external table for all data , but support filter 
> pushdown (this JIRA) to filter the data during query.
> {code:sql}
> -- still create one external table
> CREATE EXTERNAL TABLE sales_by_location 
> (sale_date date, price float, product_id bigint, customer_id bigint -- /* 
> regular fields */
>  state text, city text, store_name text, department text -- /* template 
> fields */ )
> LOCATION 
> (‘pxf://ip:port/data/sales?PROFILE=HdfsTextPartition&TEMPLATE=/state/city/store_name/department/’)
> FORMAT ‘TEXT’ (delimiter “,”);
> {code}
> Then a query for all grocery profits in California will be
> {code:sql}
> SELECT * FROM sales_by_location WHERE state=”California” AND 
> department=”Grocery”;
> {code}
> *Possible Scenarios*
> Folder structures can be strings or time series. We should consider 
> supporting both scenarios but break down the stories to sub-tasks so each 
> scenario can be verified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to