[ https://issues.apache.org/jira/browse/HAWQ-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Goden Yao updated HAWQ-435: --------------------------- Labels: gsoc2016 (was: ) > Support HDFS profile filter push down > ------------------------------------- > > Key: HAWQ-435 > URL: https://issues.apache.org/jira/browse/HAWQ-435 > Project: Apache HAWQ > Issue Type: New Feature > Components: PXF > Reporter: Goden Yao > Assignee: Goden Yao > Labels: gsoc2016 > > *Overview* > When retrieving data via PXF, there're two ways to limit the data it needs to > read: > 1. Defining which file or directories pxf needs to access, using the LOCATION > clause in the table’s creation: > {code:sql} > CREATE EXTERNAL TABLE … LOCATION > (‘pxf://<ip:port>/some/directory/or/files*?PROFILE=...’) … > {code} > PXF supports limited wildcard notation (same as HDFS), there is no support > for full regular expressions. > 2. Querying for a subset of the data using predicates filtering: > e.g. > {code:sql} > SELECT … FROM table WHERE a > 10 ... > {code} > PXF framework supports partial filter pushdown, where it passes the > predicates/filters from pxf to plugins and plugins can use the information to > filter the data before it transfers it back to HAWQ. > Currently, this feature only supports basic operators (=, !=, >, <, >=, <=) > and basic types (int, text, etc.). > *HDFS Filter Pushdown Support* > Users often store their data on HDFS in a hierarchic directory structure. > e.g. > {noformat} > /data/sales / > --- California / > --- --- Palo Alto / > --- --- --- PA01 / > --- --- --- --- Grocery / > --- --- --- --- Produce / > --- --- --- --- Bakery / > ... > --- --- San Diego / > --- --- --- SD05 / > --- --- --- --- Grocery / > --- --- --- --- Produce / > --- --- --- --- Bakery / > {noformat} > We can either 1) create an external table which only contains a subset of the > data we're interested in (so we *filter* the data) e.g. > {code:sql} > CREATE EXTERNAL TABLE sales_CA > (sale_date date, price float, product_id bigint, customer_id bigint -- /* > regular fields */ > state text, city text, store_name text, department text -- /* template > fields */ ) > LOCATION (‘pxf://ip:port/data/sales/California/*?PROFILE=...’) > FORMAT ‘TEXT’ (delimiter “,”); > {code} > However, this is not user friendly and hard to maintain (e.g. need to join > the holistic table with other tables) if users have to create multiple tables > just for the purpose of data filtering. > 2) We still create ONE external table for all data , but support filter > pushdown (this JIRA) to filter the data during query. > {code:sql} > -- still create one external table > CREATE EXTERNAL TABLE sales_by_location > (sale_date date, price float, product_id bigint, customer_id bigint -- /* > regular fields */ > state text, city text, store_name text, department text -- /* template > fields */ ) > LOCATION > (‘pxf://ip:port/data/sales?PROFILE=HdfsTextPartition&TEMPLATE=/state/city/store_name/department/’) > FORMAT ‘TEXT’ (delimiter “,”); > {code} > Then a query for all grocery profits in California will be > {code:sql} > SELECT * FROM sales_by_location WHERE state=”California” AND > department=”Grocery”; > {code} > *Possible Scenarios* > Folder structures can be strings or time series. We should consider > supporting both scenarios but break down the stories to sub-tasks so each > scenario can be verified. -- This message was sent by Atlassian JIRA (v6.3.4#6332)