[jira] [Commented] (HIVE-2050) batch processing partition pruning process
[ https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012635#comment-13012635 ] Namit Jain commented on HIVE-2050: -- +1 > batch processing partition pruning process > -- > > Key: HIVE-2050 > URL: https://issues.apache.org/jira/browse/HIVE-2050 > Project: Hive > Issue Type: Sub-task >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-2050.2.patch, HIVE-2050.3.patch, HIVE-2050.4.patch, > HIVE-2050.patch > > > For partition predicates that cannot be pushed down to JDO filtering > (HIVE-2049), we should fall back to the old approach of listing all partition > names first and use Hive's expression evaluation engine to select the correct > partitions. Then the partition pruner should hand Hive a list of partition > names and return a list of Partition Object (this should be added to the Hive > API). > A possible optimization is that the the partition pruner should give Hive a > set of ranges of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and > the JDO query should be formulated as range queries. Range queries are > possible because the first step list all partition names in sorted order. > It's easy to come up with a range and it is guaranteed that the JDO range > query results should be equivalent to the query with a list of partition > names. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2050) batch processing partition pruning process
[ https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012543#comment-13012543 ] Ning Zhang commented on HIVE-2050: -- updated the review board. Also my tests passed. > batch processing partition pruning process > -- > > Key: HIVE-2050 > URL: https://issues.apache.org/jira/browse/HIVE-2050 > Project: Hive > Issue Type: Sub-task >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-2050.2.patch, HIVE-2050.3.patch, HIVE-2050.4.patch, > HIVE-2050.patch > > > For partition predicates that cannot be pushed down to JDO filtering > (HIVE-2049), we should fall back to the old approach of listing all partition > names first and use Hive's expression evaluation engine to select the correct > partitions. Then the partition pruner should hand Hive a list of partition > names and return a list of Partition Object (this should be added to the Hive > API). > A possible optimization is that the the partition pruner should give Hive a > set of ranges of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and > the JDO query should be formulated as range queries. Range queries are > possible because the first step list all partition names in sorted order. > It's easy to come up with a range and it is guaranteed that the JDO range > query results should be equivalent to the query with a list of partition > names. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2050) batch processing partition pruning process
[ https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012339#comment-13012339 ] Namit Jain commented on HIVE-2050: -- can you update review board also ? > batch processing partition pruning process > -- > > Key: HIVE-2050 > URL: https://issues.apache.org/jira/browse/HIVE-2050 > Project: Hive > Issue Type: Sub-task >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-2050.2.patch, HIVE-2050.3.patch, HIVE-2050.4.patch, > HIVE-2050.patch > > > For partition predicates that cannot be pushed down to JDO filtering > (HIVE-2049), we should fall back to the old approach of listing all partition > names first and use Hive's expression evaluation engine to select the correct > partitions. Then the partition pruner should hand Hive a list of partition > names and return a list of Partition Object (this should be added to the Hive > API). > A possible optimization is that the the partition pruner should give Hive a > set of ranges of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and > the JDO query should be formulated as range queries. Range queries are > possible because the first step list all partition names in sorted order. > It's easy to come up with a range and it is guaranteed that the JDO range > query results should be equivalent to the query with a list of partition > names. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2050) batch processing partition pruning process
[ https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012283#comment-13012283 ] Namit Jain commented on HIVE-2050: -- The test pcr.q is failing - can you take a look ? The results look wrong. > batch processing partition pruning process > -- > > Key: HIVE-2050 > URL: https://issues.apache.org/jira/browse/HIVE-2050 > Project: Hive > Issue Type: Sub-task >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-2050.2.patch, HIVE-2050.3.patch, HIVE-2050.patch > > > For partition predicates that cannot be pushed down to JDO filtering > (HIVE-2049), we should fall back to the old approach of listing all partition > names first and use Hive's expression evaluation engine to select the correct > partitions. Then the partition pruner should hand Hive a list of partition > names and return a list of Partition Object (this should be added to the Hive > API). > A possible optimization is that the the partition pruner should give Hive a > set of ranges of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and > the JDO query should be formulated as range queries. Range queries are > possible because the first step list all partition names in sorted order. > It's easy to come up with a range and it is guaranteed that the JDO range > query results should be equivalent to the query with a list of partition > names. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2050) batch processing partition pruning process
[ https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012167#comment-13012167 ] Namit Jain commented on HIVE-2050: -- +1 > batch processing partition pruning process > -- > > Key: HIVE-2050 > URL: https://issues.apache.org/jira/browse/HIVE-2050 > Project: Hive > Issue Type: Sub-task >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-2050.2.patch, HIVE-2050.3.patch, HIVE-2050.patch > > > For partition predicates that cannot be pushed down to JDO filtering > (HIVE-2049), we should fall back to the old approach of listing all partition > names first and use Hive's expression evaluation engine to select the correct > partitions. Then the partition pruner should hand Hive a list of partition > names and return a list of Partition Object (this should be added to the Hive > API). > A possible optimization is that the the partition pruner should give Hive a > set of ranges of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and > the JDO query should be formulated as range queries. Range queries are > possible because the first step list all partition names in sorted order. > It's easy to come up with a range and it is guaranteed that the JDO range > query results should be equivalent to the query with a list of partition > names. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2050) batch processing partition pruning process
[ https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012149#comment-13012149 ] Namit Jain commented on HIVE-2050: -- Comments posted on the review board > batch processing partition pruning process > -- > > Key: HIVE-2050 > URL: https://issues.apache.org/jira/browse/HIVE-2050 > Project: Hive > Issue Type: Sub-task >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-2050.2.patch, HIVE-2050.patch > > > For partition predicates that cannot be pushed down to JDO filtering > (HIVE-2049), we should fall back to the old approach of listing all partition > names first and use Hive's expression evaluation engine to select the correct > partitions. Then the partition pruner should hand Hive a list of partition > names and return a list of Partition Object (this should be added to the Hive > API). > A possible optimization is that the the partition pruner should give Hive a > set of ranges of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and > the JDO query should be formulated as range queries. Range queries are > possible because the first step list all partition names in sorted order. > It's easy to come up with a range and it is guaranteed that the JDO range > query results should be equivalent to the query with a list of partition > names. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2050) batch processing partition pruning process
[ https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011518#comment-13011518 ] Namit Jain commented on HIVE-2050: -- Based on an offline review, this may increase memory, we need to return the partition names periodically to put a memory bound > batch processing partition pruning process > -- > > Key: HIVE-2050 > URL: https://issues.apache.org/jira/browse/HIVE-2050 > Project: Hive > Issue Type: Sub-task >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-2050.patch > > > For partition predicates that cannot be pushed down to JDO filtering > (HIVE-2049), we should fall back to the old approach of listing all partition > names first and use Hive's expression evaluation engine to select the correct > partitions. Then the partition pruner should hand Hive a list of partition > names and return a list of Partition Object (this should be added to the Hive > API). > A possible optimization is that the the partition pruner should give Hive a > set of ranges of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and > the JDO query should be formulated as range queries. Range queries are > possible because the first step list all partition names in sorted order. > It's easy to come up with a range and it is guaranteed that the JDO range > query results should be equivalent to the query with a list of partition > names. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2050) batch processing partition pruning process
[ https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010001#comment-13010001 ] Ning Zhang commented on HIVE-2050: -- Note that this patch implements a simple API that passes a list of partition names rather than a range of partition names. My performance testing indicates that bottleneck is not in the JDO query itself. The JDO queries that getting the list of all MPartitions takes about 5 secs for a list of 20k partitions. However converting these 20k MPartitions to Partitions took about 3 mins. Committing the transaction took another 3 mins. Note that converting MPartitions to Partitions and committing transactions are common operations. Even though we use JDO pushdown (HIVE-2048) or use range queries, these costs are still there. We need to optimize these costs away in the next step. > batch processing partition pruning process > -- > > Key: HIVE-2050 > URL: https://issues.apache.org/jira/browse/HIVE-2050 > Project: Hive > Issue Type: Sub-task >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-2050.patch > > > For partition predicates that cannot be pushed down to JDO filtering > (HIVE-2049), we should fall back to the old approach of listing all partition > names first and use Hive's expression evaluation engine to select the correct > partitions. Then the partition pruner should hand Hive a list of partition > names and return a list of Partition Object (this should be added to the Hive > API). > A possible optimization is that the the partition pruner should give Hive a > set of ranges of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and > the JDO query should be formulated as range queries. Range queries are > possible because the first step list all partition names in sorted order. > It's easy to come up with a range and it is guaranteed that the JDO range > query results should be equivalent to the query with a list of partition > names. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2050) batch processing partition pruning process
[ https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009993#comment-13009993 ] Ning Zhang commented on HIVE-2050: -- passed all unit tests. > batch processing partition pruning process > -- > > Key: HIVE-2050 > URL: https://issues.apache.org/jira/browse/HIVE-2050 > Project: Hive > Issue Type: Sub-task >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-2050.patch > > > For partition predicates that cannot be pushed down to JDO filtering > (HIVE-2049), we should fall back to the old approach of listing all partition > names first and use Hive's expression evaluation engine to select the correct > partitions. Then the partition pruner should hand Hive a list of partition > names and return a list of Partition Object (this should be added to the Hive > API). > A possible optimization is that the the partition pruner should give Hive a > set of ranges of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and > the JDO query should be formulated as range queries. Range queries are > possible because the first step list all partition names in sorted order. > It's easy to come up with a range and it is guaranteed that the JDO range > query results should be equivalent to the query with a list of partition > names. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira