[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2014-11-11 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207516#comment-14207516
 ] 

Lefty Leverenz commented on HIVE-4963:
--

bq.  Could someone either document this on the Wiki or explain it to me?

The wiki doesn't have a section about PTFs yet, and the description of 
*hive.join.cache.size* hasn't been changed since Hive 0.5.0:  "How many rows in 
the joining tables (except the streaming table) should be cached in memory."

So I'm adding a TODOC12 label.  What should the wiki say?

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
>Assignee: Harish Butani
> Fix For: 0.12.0
>
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, HIVE-4963.D12279.3.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-09-23 Thread Harish Butani (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13775896#comment-13775896
 ] 

Harish Butani commented on HIVE-4963:
-

Sorry forgot to respond.
Original plan was to have the user give a hint on whether partitions fits in 
memory. This would aid in reducing serialization/deserialization cost when 
partitions fit in memory. But based on discussions with Ashutosh, we decided to 
move to using RowContainers for holding rows in a Partition; this way we share 
the same code as Joins; get the functionality and performance benefits of using 
RowContainers. PTFPartitions are now controlled by ConfVars.HIVEJOINCACHESIZE; 
use of ConfVars.HIVE_PTF_PARTITION_PERSISTENT_SIZE has been removed.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
>Assignee: Harish Butani
> Fix For: 0.12.0
>
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, HIVE-4963.D12279.3.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-09-02 Thread Lars Francke (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755946#comment-13755946
 ] 

Lars Francke commented on HIVE-4963:


Could someone either document this on the Wiki or explain it to me? The 
proposed configuration parameter {{hive.ptf.partition.fits.in.mem}} does not 
seem to be added by this patch. Instead {{hive.join.cache.size}}, correct? What 
are the semantics of this?

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
>Assignee: Harish Butani
> Fix For: 0.12.0
>
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, HIVE-4963.D12279.3.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749764#comment-13749764
 ] 

Hudson commented on HIVE-4963:
--

ABORTED: Integrated in Hive-trunk-hadoop2 #380 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2/380/])
HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh 
Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236)
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java
* /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q
* 
/hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q
* /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out


> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
>Assignee: Harish Butani
> Fix For: 0.12.0
>
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, HIVE-4963.D12279.3.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749703#comment-13749703
 ] 

Hudson commented on HIVE-4963:
--

FAILURE: Integrated in Hive-trunk-h0.21 #2288 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/2288/])
HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh 
Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236)
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java
* /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q
* 
/hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q
* /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out


> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
>Assignee: Harish Butani
> Fix For: 0.12.0
>
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, HIVE-4963.D12279.3.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749572#comment-13749572
 ] 

Hudson commented on HIVE-4963:
--

FAILURE: Integrated in Hive-trunk-hadoop1-ptest #137 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/137/])
HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh 
Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236)
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java
* /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q
* 
/hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q
* /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out


> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
>Assignee: Harish Butani
> Fix For: 0.12.0
>
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, HIVE-4963.D12279.3.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749567#comment-13749567
 ] 

Hudson commented on HIVE-4963:
--

FAILURE: Integrated in Hive-trunk-hadoop2-ptest #69 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/69/])
HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh 
Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236)
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java
* /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q
* 
/hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q
* /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out


> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
>Assignee: Harish Butani
> Fix For: 0.12.0
>
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, HIVE-4963.D12279.3.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-23 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749314#comment-13749314
 ] 

Phabricator commented on HIVE-4963:
---

ashutoshc has accepted the revision "HIVE-4963 [jira] Support in memory PTF 
partitions".

  +1

REVISION DETAIL
  https://reviews.facebook.net/D12279

BRANCH
  HIVE-4963-2

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, hbutani


> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
>Assignee: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, HIVE-4963.D12279.3.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-21 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746617#comment-13746617
 ] 

Edward Capriolo commented on HIVE-4963:
---

{quote}
Yes this is much more work to do. More importantly, its not PTF specific 
either, its in existing code which Harish has chosen to reuse. I dont think its 
fair to hold on to this patch for this. It can be done in a follow-up.
{quote}
Agreed. If extending an existing component that already does it this way, 
changing both is out-of-scope.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
>Assignee: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746557#comment-13746557
 ] 

Ashutosh Chauhan commented on HIVE-4963:


Harish, Also can you get rid of config variables in HiveConf which were about 
size of persistence byte list, those will become relevant after this patch. 
Also, do you think we can word title of this jira better so it helps folks to 
understand this work better.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746554#comment-13746554
 ] 

Ashutosh Chauhan commented on HIVE-4963:


Yes this is much more work to do. More importantly, its not PTF specific 
either, its in existing code which Harish has chosen to reuse. I dont think its 
fair to hold on to this patch for this. It can be done in a follow-up.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: New Feature
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-21 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746535#comment-13746535
 ] 

Edward Capriolo commented on HIVE-4963:
---

{quote}
ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:89 This I think 
you need to do because current RowContainers can only hold crisp java objects. 
Seems like we can improve this by writing RowContainer which can hold 
writables, thus avoiding unnecessary deserialization and mem-cpy here. 
Something worth exploring as follow-up issue.
{quote}
Is it much more work to do this now? There are already a number of PTF 
-to-be-cleaned-ups and I would hate to add more.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-21 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746530#comment-13746530
 ] 

Phabricator commented on HIVE-4963:
---

ashutoshc has commented on the revision "HIVE-4963 [jira] Support in memory PTF 
partitions".

  Seems like there are more opportunities to make this efficient, but those can 
be digged into later. This patch is a step in a right direction by reusing 
existing infra. Any improvements we now make may benefit other spilling 
operators like join too. Really makes me happy : )
  Apart from code comments, I will also request you to add a testcase which 
sets the config value (cachesize) to zero, so that it spills for every record 
and exercise all these new codepath.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:89 This I think 
you need to do because current RowContainers can only hold crisp java objects. 
Seems like we can improve this by writing RowContainer which can hold 
writables, thus avoiding unnecessary deserialization and mem-cpy here. 
Something worth exploring as follow-up issue.
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:57 this config 
should really govern how much memory we are willing to allocate (in bytes), not 
in number of rows, but thats a topic for another jira since you are reusing 
existing code.
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:148 This sanity 
check is in tight loop. Ideally we should not have such checks in inner loop. 
But lets leave it here till we get more confidence in the code. Will be good to 
add a note about what will be the assumption if we are to get rid of this check 
in future.
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:137 Instead of 
try-catch-rethrow, shall we just add throws in method signature, makes code 
readable and arguably faster.
  ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:160 Similar 
comment about try-catch-rethrow.
  
ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java:80 
Awesome comments!
  
ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java:94 
If I get this right, this function will again do serialization before spilling, 
so in case of memory pressure, we are doing a  round trip of ser-deser without 
performing useful work. This ties back to my earlier comment on eager 
deserialization.
  This whole mechanism is worth exploring later.

REVISION DETAIL
  https://reviews.facebook.net/D12279

To: JIRA, ashutoshc, hbutani


> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-21 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746034#comment-13746034
 ] 

Edward Capriolo commented on HIVE-4963:
---

I have a couple small comments.

The variable sz i do not think we need it. Cant we determine the size from the 
collection. A couple places were we are using array list on the left side.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> HIVE-4963.D12279.2.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-15 Thread Harish Butani (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13741680#comment-13741680
 ] 

Harish Butani commented on HIVE-4963:
-

No XMLEncoder doesn't honor the transient qualifier. 
http://www.oracle.com/technetwork/java/persistence4-140124.html#transient

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-15 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13741670#comment-13741670
 ] 

Edward Capriolo commented on HIVE-4963:
---

Why cant we mark the fields as transient? Do they need to be serialized in 
other contexts? If they need to be serialized sometimes and not others maybe 
what we need is two different fields?

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-15 Thread Harish Butani (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13741665#comment-13741665
 ] 

Harish Butani commented on HIVE-4963:
-

This is to get around the issue of XMLEncoder trying to serialize all fields 
with accessors.


> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-15 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740980#comment-13740980
 ] 

Edward Capriolo commented on HIVE-4963:
---

Can you please describe why these calls are needed

{noformat}
  PTFUtils.makeTransient(PTFDesc.class, "llInfo");
59  ​PTFUtils.makeTransient(PTFDesc.class, 
"cfg");
{noformat}

This looks like a code-smell. Is there any other way of handling this?


> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, HIVE-4963.D12279.1.patch, 
> PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738834#comment-13738834
 ] 

Ashutosh Chauhan commented on HIVE-4963:


Thanks for explanation. Sounds good. Lets proceed with this.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-13 Thread Harish Butani (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738822#comment-13738822
 ] 

Harish Butani commented on HIVE-4963:
-

- PTFRecordWriter is needed to provide access to the underlying SeqFile.Writer. 
So that at the time of writing to the Container, we can record the locations in 
the file where the individual Blocks start.
- PTFHiveSequenceFileOutputFormat is there so that on getHiveRecordWriter call, 
we return the PTFRecordWriter.
- PTFSequenceFileRecordReader allows the PTFRowContainer to seek to the 
startOffset of the block. So a getAt request that needs to fetch data, first 
figures out the Split to read and then seeks to the startOffset, from where the 
RecordReader should start.
- PTFSequenceFileInputFormat is needed to to return PTFSequenceFileRecordReader 
in the getRecordReader call.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738806#comment-13738806
 ] 

Ashutosh Chauhan commented on HIVE-4963:


Thanks a lot [~rhbutani] for digging into this. Much appreciated. I think this 
is the right direction to go. We should eventually get rid of ByteBasedList and 
friends and use this approach. 

One implementation question I have is why you needed to have PTFRecordWriter, 
PTFOutputFormat, PTFInputFormat etc. It seems they don't have any special 
logic. Whats the reason we need those and simply cant use 
HiveSequenceFileOutFormat and friends.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-13 Thread Harish Butani (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738692#comment-13738692
 ] 

Harish Butani commented on HIVE-4963:
-

[~ashutoshc] have attached a patch with PTFRowContainer that extends 
RowContainer.  PTFRowContainer is different because need to provide random 
access to rows. PTFRowContainer would replace classes in PTFPersistence: 
ByteBasedList, PartitionedByteBasedList... PTFRowContainer does utilize a lot 
of the code from RowContainer; another advantage is that all data is in 1 
SeqFile. Can you please take a look to see this approach is acceptable. Will 
work on connecting PTFPartition to PTFRowContainer.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch, PTFRowContainer.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13735068#comment-13735068
 ] 

Ashutosh Chauhan commented on HIVE-4963:


I would also suggest to take a look at how Join Operator handles this. It has a 
same problem to solve and it solves nearly in same fashion (atleast 
conceptually). Instead of building an alternative infra for spilling to disk 
under memory load, it will be better to reuse those classes and mechanism, if 
possible.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-09 Thread Harish Butani (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13735038#comment-13735038
 ] 

Harish Butani commented on HIVE-4963:
-

We already do this. The rows are accumulated in a ByteBasedList; when it fills 
up it is spilled to disk and a new ByteBasedList is added. So if there are less 
than 32Mb bytes needed(or whatever is set by the user), there is no I/O.
The saving here comes from not holding the objects in a serialized form. 
Currently every field access goes through deserialization. InMemoryPartition 
was going to be the case where the user guarantees that there is enough memory 
so we just hold the deserialized objects. Am working on a Caching wrapper on 
the PTFPartition which would hold onto deserialized objects, but is backed by 
the serialized bytes in case we run out of memory.
 
But yes it would be nice to merge these 2 concepts into one thing. There is an 
overhead in Caching over InMemoryPartition: at least an extra serialization, 
potentially more in both time and space. But the overhead may not matter that 
much. Give me a couple of days to work through this..

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4963) Support in memory PTF partitions

2013-08-08 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734401#comment-13734401
 ] 

Ashutosh Chauhan commented on HIVE-4963:


I have a question: While rows are accumulating we serialize and store them in 
PersistenceByteList (PBL), once they cross limit (32MB) we spill the list to 
disk. Now by adding this new config, we assume since accumulated data will fit 
into memory, we don't need PBL and create new type of PTFPartition. So, what we 
are saving is this serialization and deserialization out of this list. Is that 
correct? If so, I think better way might be to not write first 32MBs in PBL and 
just keep them in memory, once they cross the limit at that time serialize them 
and dump to disk.
I dont like this new config knob, since user has no way of knowing when to turn 
the flag on, it depends both on query as well as data. If we can get rid of 
this knob and do this smartly that will be real cool.

> Support in memory PTF partitions
> 
>
> Key: HIVE-4963
> URL: https://issues.apache.org/jira/browse/HIVE-4963
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Reporter: Harish Butani
> Attachments: HIVE-4963.D11955.1.patch
>
>
> PTF partitions apply the defensive mode of assuming that partitions will not 
> fit in memory. Because of this there is a significant deserialization 
> overhead when accessing elements. 
> Allow the user to specify that there is enough memory to hold partitions 
> through a 'hive.ptf.partition.fits.in.mem' option.  
> Savings depends on partition size and in case of windowing the number of 
> UDAFs and the window ranges. For eg for the following (admittedly extreme) 
> case the PTFOperator exec times went from 39 secs to 8 secs.
>  
> {noformat}
> select t, s, i, b, f, d,
> min(t) over(partition by 1 rows between unbounded preceding and current row), 
> min(s) over(partition by 1 rows between unbounded preceding and current row), 
> min(i) over(partition by 1 rows between unbounded preceding and current row), 
> min(b) over(partition by 1 rows between unbounded preceding and current row) 
> from over10k
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira