[jira] [Comment Edited] (ORC-299) Improve heuristics for bailing on dictionary encoding

Yulei Yang (JIRA) Wed, 23 Jan 2019 18:34:23 -0800


    [ 
https://issues.apache.org/jira/browse/ORC-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750632#comment-16750632
 ]


Yulei Yang edited comment on ORC-299 at 1/24/19 2:26 AM:
---------------------------------------------------------

We met this issue in hive 1.2.1, and have resoved it by reduce the value of 
orc.row.index.stride.

the root cause is that rowgroup size is too large to build a dictionary.

You can see another exception, which is caused by the same reason:

Caused by: java.lang.ArrayIndexOutOfBoundsException: xxxxx at 
org.apache.hadoop.hive.ql.io.orc.DynamicByteArray.add(DynamicByteArray.java:115)

 

BTW, set hive.exec.orc.dictionary.key.size.threshold=0 does not work.


was (Author: noatime):
We met this issue in hive 1.2.1, and have resoved this issue by reduce the 
value of orc.row.index.stride.

the root cause is that rowgroup size is too large to build a dictionary.

You can see another exception, which is caused by the same reason:

Caused by: java.lang.ArrayIndexOutOfBoundsException: xxxxx at 
org.apache.hadoop.hive.ql.io.orc.DynamicByteArray.add(DynamicByteArray.java:115)

 

BTW, set hive.exec.orc.dictionary.key.size.threshold=0 does not work.

> Improve heuristics for bailing on dictionary encoding
> -----------------------------------------------------
>
>                 Key: ORC-299
>                 URL: https://issues.apache.org/jira/browse/ORC-299
>             Project: ORC
>          Issue Type: Improvement
>            Reporter: Chris Drome
>            Priority: Major
>
> Recently a user ran into the following failure:
> {noformat}
> Caused by: java.lang.NullPointerException at 
> java.lang.System.arraycopy(Native Method) at
>   
> org.apache.hadoop.hive.ql.io.orc.DynamicByteArray.add(DynamicByteArray.java:115)
>  at
>   
> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.addNewKey(StringRedBlackTree.java:48)
>  at
>   
> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:55)
>  at
>   
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:1250)
>  at
>   
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1797)
>  at
>   org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2469) at
>   
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86)
>  at
>   
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
>  at
>   org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
>   
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) 
> at
>   org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
>   
> org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:122)
>  at
>   org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
>   
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:110)
>  at
>   
> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:165)
>  at
>   org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:536) 
> ... 18 more
> {noformat}
>  
> I tracked this down to the following in DynamicByteArray.java, which is being 
> used to create the dictionary for a particular column:
> {noformat}
> private int length;
> {noformat}
>  
> This has the side-effect of capping the memory available for the dictionary 
> at 2GB.
>  
> Given the size of column values in this use case, and the fact that the user 
> is exceeding this 2GB limit, there should probably be some heuristics that 
> bail early on dictionary creation, so this limitation is never reached. Given 
> the size of data that would be required to hit this limit, it is unlikely 
> that a dictionary would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ORC-299) Improve heuristics for bailing on dictionary encoding

Reply via email to