[ https://issues.apache.org/jira/browse/HAWQ-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Goden Yao updated HAWQ-435: --------------------------- Description: *Overview* When retrieving data via PXF, there're two ways to limit the data it needs to read: 1. Defining which file or directories pxf needs to access, using the LOCATION clause in the table’s creation: {code:sql} CREATE EXTERNAL TABLE … LOCATION (‘pxf://<ip:port>/some/directory/or/files*?PROFILE=...’) … {code} PXF supports limited wildcard notation (same as HDFS), there is no support for full regular expressions. 2. Querying for a subset of the data using predicates filtering: e.g. {code:sql} SELECT … FROM table WHERE a > 10 ... {code} PXF framework supports partial filter pushdown, where it passes the predicates/filters from pxf to plugins and plugins can use the information to filter the data before it transfers it back to HAWQ. Currently, this feature only supports basic operators (=, !=, >, <, >=, <=) and basic types (int, text, etc.). *HDFS Filter Pushdown Support* Users often store their data on HDFS in a hierarchic directory structure. e.g. {noformat} /data/sales / --- California / --- --- Palo Alto / --- --- --- PA01 / --- --- --- --- Grocery / --- --- --- --- Produce / --- --- --- --- Bakery / ... --- --- San Diego / --- --- --- SD05 / --- --- --- --- Grocery / --- --- --- --- Produce / --- --- --- --- Bakery / {noformat} We can either 1) create an external table which only contains a subset of the data we're interested in (so we *filter* the data) e.g. {code:sql} CREATE EXTERNAL TABLE sales_CA (sale_date date, price float, product_id bigint, customer_id bigint -- /* regular fields */ state text, city text, store_name text, department text -- /* template fields */ ) LOCATION (‘pxf://ip:port/data/sales/California/*?PROFILE=...’) FORMAT ‘TEXT’ (delimiter “,”); {code} However, this is not user friendly and hard to maintain (e.g. need to join the holistic table with other tables) if users have to create multiple tables just for the purpose of data filtering. 2) We still create ONE external table for all data , but support filter pushdown (this JIRA) to filter the data during query. The root directory that defines as the table’s path will be “/data/sales/”. A template will be of the form “/state/city/store_name/department/” and the table will have additional fields “state”, “city”, “store_name” and “department” that will correspond to the template. {code:sql} -- still create one external table CREATE EXTERNAL TABLE sales_by_location (sale_date date, price float, product_id bigint, customer_id bigint -- /* regular fields */ state text, city text, store_name text, department text -- /* template fields */ ) LOCATION (‘pxf://ip:port/data/sales?PROFILE=HdfsTextPartition&TEMPLATE=/state/city/store_name/department/’) FORMAT ‘TEXT’ (delimiter “,”); {code} Then a query for all grocery profits in California will be {code:sql} SELECT * FROM sales_by_location WHERE state=”California” AND department=”Grocery”; {code} *Possible Scenarios* Folder structures can be strings or time series. We should consider supporting both scenarios but break down the stories to sub-tasks so each scenario can be verified. was: *Overview* When retrieving data via PXF, there're two ways to limit the data it needs to read: 1. Defining which file or directories pxf needs to access, using the LOCATION clause in the table’s creation: {code:sql} CREATE EXTERNAL TABLE … LOCATION (‘pxf://<ip:port>/some/directory/or/files*?PROFILE=...’) … {code} PXF supports limited wildcard notation (same as HDFS), there is no support for full regular expressions. 2. Querying for a subset of the data using predicates filtering: e.g. {code:sql} SELECT … FROM table WHERE a > 10 ... {code} PXF framework supports partial filter pushdown, where it passes the predicates/filters from pxf to plugins and plugins can use the information to filter the data before it transfers it back to HAWQ. Currently, this feature only supports basic operators (=, !=, >, <, >=, <=) and basic types (int, text, etc.). *HDFS Filter Pushdown Support* Users often store their data on HDFS in a hierarchic directory structure. e.g. {noformat} /data/sales / --- California / --- --- Palo Alto / --- --- --- PA01 / --- --- --- --- Grocery / --- --- --- --- Produce / --- --- --- --- Bakery / ... --- --- San Diego / --- --- --- SD05 / --- --- --- --- Grocery / --- --- --- --- Produce / --- --- --- --- Bakery / {noformat} We can either 1) create an external table which only contains a subset of the data we're interested in (so we *filter* the data) e.g. {code:sql} CREATE EXTERNAL TABLE sales_CA (sale_date date, price float, product_id bigint, customer_id bigint -- /* regular fields */ state text, city text, store_name text, department text -- /* template fields */ ) LOCATION (‘pxf://ip:port/data/sales/California/*?PROFILE=...’) FORMAT ‘TEXT’ (delimiter “,”); {code} However, this is not user friendly and hard to maintain (e.g. need to join the holistic table with other tables) if users have to create multiple tables just for the purpose of data filtering. 2) We still create ONE external table for all data , but support filter pushdown (this JIRA) to filter the data during query. {code:sql} -- still create one external table CREATE EXTERNAL TABLE sales_by_location (sale_date date, price float, product_id bigint, customer_id bigint -- /* regular fields */ state text, city text, store_name text, department text -- /* template fields */ ) LOCATION (‘pxf://ip:port/data/sales?PROFILE=HdfsTextPartition&TEMPLATE=/state/city/store_name/department/’) FORMAT ‘TEXT’ (delimiter “,”); {code} Then a query for all grocery profits in California will be {code:sql} SELECT * FROM sales_by_location WHERE state=”California” AND department=”Grocery”; {code} *Possible Scenarios* Folder structures can be strings or time series. We should consider supporting both scenarios but break down the stories to sub-tasks so each scenario can be verified. > Support HDFS profile filter push down > ------------------------------------- > > Key: HAWQ-435 > URL: https://issues.apache.org/jira/browse/HAWQ-435 > Project: Apache HAWQ > Issue Type: New Feature > Components: PXF > Reporter: Goden Yao > Assignee: Goden Yao > Labels: gsoc2016 > Fix For: backlog > > > *Overview* > When retrieving data via PXF, there're two ways to limit the data it needs to > read: > 1. Defining which file or directories pxf needs to access, using the LOCATION > clause in the table’s creation: > {code:sql} > CREATE EXTERNAL TABLE … LOCATION > (‘pxf://<ip:port>/some/directory/or/files*?PROFILE=...’) … > {code} > PXF supports limited wildcard notation (same as HDFS), there is no support > for full regular expressions. > 2. Querying for a subset of the data using predicates filtering: > e.g. > {code:sql} > SELECT … FROM table WHERE a > 10 ... > {code} > PXF framework supports partial filter pushdown, where it passes the > predicates/filters from pxf to plugins and plugins can use the information to > filter the data before it transfers it back to HAWQ. > Currently, this feature only supports basic operators (=, !=, >, <, >=, <=) > and basic types (int, text, etc.). > *HDFS Filter Pushdown Support* > Users often store their data on HDFS in a hierarchic directory structure. > e.g. > {noformat} > /data/sales / > --- California / > --- --- Palo Alto / > --- --- --- PA01 / > --- --- --- --- Grocery / > --- --- --- --- Produce / > --- --- --- --- Bakery / > ... > --- --- San Diego / > --- --- --- SD05 / > --- --- --- --- Grocery / > --- --- --- --- Produce / > --- --- --- --- Bakery / > {noformat} > We can either 1) create an external table which only contains a subset of the > data we're interested in (so we *filter* the data) e.g. > {code:sql} > CREATE EXTERNAL TABLE sales_CA > (sale_date date, price float, product_id bigint, customer_id bigint -- /* > regular fields */ > state text, city text, store_name text, department text -- /* template > fields */ ) > LOCATION (‘pxf://ip:port/data/sales/California/*?PROFILE=...’) > FORMAT ‘TEXT’ (delimiter “,”); > {code} > However, this is not user friendly and hard to maintain (e.g. need to join > the holistic table with other tables) if users have to create multiple tables > just for the purpose of data filtering. > 2) We still create ONE external table for all data , but support filter > pushdown (this JIRA) to filter the data during query. > The root directory that defines as the table’s path will be “/data/sales/”. A > template will be of the form “/state/city/store_name/department/” and the > table will have additional fields “state”, “city”, “store_name” and > “department” that will correspond to the template. > {code:sql} > -- still create one external table > CREATE EXTERNAL TABLE sales_by_location > (sale_date date, price float, product_id bigint, customer_id bigint -- /* > regular fields */ > state text, city text, store_name text, department text -- /* template > fields */ ) > LOCATION > (‘pxf://ip:port/data/sales?PROFILE=HdfsTextPartition&TEMPLATE=/state/city/store_name/department/’) > FORMAT ‘TEXT’ (delimiter “,”); > {code} > Then a query for all grocery profits in California will be > {code:sql} > SELECT * FROM sales_by_location WHERE state=”California” AND > department=”Grocery”; > {code} > *Possible Scenarios* > Folder structures can be strings or time series. We should consider > supporting both scenarios but break down the stories to sub-tasks so each > scenario can be verified. -- This message was sent by Atlassian JIRA (v6.3.4#6332)