[jira] [Created] (HIVE-18576) Support to read nested complex type with Parquet in vectorization mode
Colin Ma created HIVE-18576: --- Summary: Support to read nested complex type with Parquet in vectorization mode Key: HIVE-18576 URL: https://issues.apache.org/jira/browse/HIVE-18576 Project: Hive Issue Type: Sub-task Reporter: Colin Ma Assignee: Colin Ma Nested complex type is common used, eg: Struct, s2 List>. Currently, nested complex type can't be parsed in vectorization mode, this ticket is target to support it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18411) Fix ArrayIndexOutOfBoundsException for VectorizedListColumnReader
Colin Ma created HIVE-18411: --- Summary: Fix ArrayIndexOutOfBoundsException for VectorizedListColumnReader Key: HIVE-18411 URL: https://issues.apache.org/jira/browse/HIVE-18411 Project: Hive Issue Type: Sub-task Reporter: Colin Ma Assignee: Colin Ma Priority: Critical ColumnVector should be initialized to the default size at the begin of readBatch(), otherwise, ArrayIndexOutOfBoundsException will be thrown because the size of ColumnVector may be updated in the last readBatch(). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18211) Support to read multiple level definition for Map type in Parquet file
Colin Ma created HIVE-18211: --- Summary: Support to read multiple level definition for Map type in Parquet file Key: HIVE-18211 URL: https://issues.apache.org/jira/browse/HIVE-18211 Project: Hive Issue Type: Sub-task Reporter: Colin Ma Assignee: Colin Ma For the current implementation with VectorizedParquetRecordReader, only following definition for map type is supported: {code} repeated group map (MAP_KEY_VALUE) { required binary key (UTF8); optional binary value (UTF8);} } {code} The implementation should support multiple level definition like: {code} optional group m1 (MAP) { repeated group map (MAP_KEY_VALUE) {required binary key (UTF8); optional binary value (UTF8);} } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18209) Fix wrong API call in VectorizedListColumnReader to get value from BytesColumnVector
Colin Ma created HIVE-18209: --- Summary: Fix wrong API call in VectorizedListColumnReader to get value from BytesColumnVector Key: HIVE-18209 URL: https://issues.apache.org/jira/browse/HIVE-18209 Project: Hive Issue Type: Sub-task Reporter: Colin Ma Assignee: Colin Ma BytesColumnVector.setRef() should be used instead of BytesColumnVector.setVal() to get the result as expected. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18207) vector_complex_join
Colin Ma created HIVE-18207: --- Summary: vector_complex_join Key: HIVE-18207 URL: https://issues.apache.org/jira/browse/HIVE-18207 Project: Hive Issue Type: Bug Reporter: Colin Ma -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18159) Vectorization: Support Map type in MapWork
Colin Ma created HIVE-18159: --- Summary: Vectorization: Support Map type in MapWork Key: HIVE-18159 URL: https://issues.apache.org/jira/browse/HIVE-18159 Project: Hive Issue Type: Improvement Reporter: Colin Ma Assignee: Colin Ma Support Complex Types in vectorization is finished in HIVE-16589, but Map type is still not support in MapWork. This ticket is target to support it for MapWork when vectorization is enable. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18048) Add qtests for Struct type with vectorization
Colin Ma created HIVE-18048: --- Summary: Add qtests for Struct type with vectorization Key: HIVE-18048 URL: https://issues.apache.org/jira/browse/HIVE-18048 Project: Hive Issue Type: Sub-task Reporter: Colin Ma Assignee: Colin Ma Struct type is supported in vectorization, but there is no qtests to test such case. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18043) Vectorization: Support List type in MapWork
Colin Ma created HIVE-18043: --- Summary: Vectorization: Support List type in MapWork Key: HIVE-18043 URL: https://issues.apache.org/jira/browse/HIVE-18043 Project: Hive Issue Type: Improvement Reporter: Colin Ma Assignee: Colin Ma Support Complex Types in vectorization is finished in HIVE-16589, but List type is still not support in MapWork. It should be supported to improve the performance when vectorization is enable. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17972) Implement Parquet vectorization reader for Map type
Colin Ma created HIVE-17972: --- Summary: Implement Parquet vectorization reader for Map type Key: HIVE-17972 URL: https://issues.apache.org/jira/browse/HIVE-17972 Project: Hive Issue Type: Sub-task Reporter: Colin Ma Assignee: Colin Ma Priority: Major Parquet vectorized reader can't support map type, it should be supported to improve the performance when the query with map type. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17931) Implement Parquet vectorization reader for Array type
Colin Ma created HIVE-17931: --- Summary: Implement Parquet vectorization reader for Array type Key: HIVE-17931 URL: https://issues.apache.org/jira/browse/HIVE-17931 Project: Hive Issue Type: Sub-task Reporter: Colin Ma Assignee: Colin Ma Parquet vectorized reader can't support array type, it should be supported to improve the performance when the query with array type. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17033) Miss the jar when create slider package for llap
Colin Ma created HIVE-17033: --- Summary: Miss the jar when create slider package for llap Key: HIVE-17033 URL: https://issues.apache.org/jira/browse/HIVE-17033 Project: Hive Issue Type: Bug Components: llap Affects Versions: 3.0.0 Reporter: Colin Ma Assignee: Colin Ma Fix For: 3.0.0 When create the slider package for llap, the jar for log4j-1.2-api is missed. The root cause is org.apache.log4j.NDC used to get jar of log4j-1.2-api, but this class is also existed in the jar of log4j. So, the jar of log4j-1.2-api won't be included. As a result, log4j-1.2-api-2.6.2.jar can't be found in llap-2.2.0-S.zip. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-16969) Improvement performance of MapOperator for Parquet
Colin Ma created HIVE-16969: --- Summary: Improvement performance of MapOperator for Parquet Key: HIVE-16969 URL: https://issues.apache.org/jira/browse/HIVE-16969 Project: Hive Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Colin Ma Assignee: Colin Ma Fix For: 3.0.0 For a table with many partition files, MapOperator.cloneConfsForNestedColPruning() will update the hive.io.file.readNestedColumn.paths many times. The larger value of hive.io.file.readNestedColumn.paths will cause the poor performance for ParquetHiveSerDe.processRawPrunedPaths(). So, the unnecessary paths should be appended to hive.io.file.readNestedColumn.paths. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-16765) ParquetFileReader should be closed to avoid resource leak
Colin Ma created HIVE-16765: --- Summary: ParquetFileReader should be closed to avoid resource leak Key: HIVE-16765 URL: https://issues.apache.org/jira/browse/HIVE-16765 Project: Hive Issue Type: Bug Reporter: Colin Ma Assignee: Colin Ma Priority: Critical ParquetFileReader should be closed to avoid resource leak -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16672) Parquet vectorization doesn't work for tables with partition info
Colin Ma created HIVE-16672: --- Summary: Parquet vectorization doesn't work for tables with partition info Key: HIVE-16672 URL: https://issues.apache.org/jira/browse/HIVE-16672 Project: Hive Issue Type: Sub-task Reporter: Colin Ma Assignee: Colin Ma Priority: Critical Fix For: 3.0.0 VectorizedParquetRecordReader doesn't check and update partition cols, this should be fixed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16465) NullPointer Exception when enable vectorization for Parquet file format
Colin Ma created HIVE-16465: --- Summary: NullPointer Exception when enable vectorization for Parquet file format Key: HIVE-16465 URL: https://issues.apache.org/jira/browse/HIVE-16465 Project: Hive Issue Type: Bug Reporter: Colin Ma Assignee: Colin Ma Priority: Critical NullPointer Exception when enable vectorization for Parquet file format. It is caused by the null value of the InputSplit. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16311) Improve the performance for FastHiveDecimalImpl.fastDivide
Colin Ma created HIVE-16311: --- Summary: Improve the performance for FastHiveDecimalImpl.fastDivide Key: HIVE-16311 URL: https://issues.apache.org/jira/browse/HIVE-16311 Project: Hive Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Colin Ma Assignee: Colin Ma Fix For: 2.2.0 FastHiveDecimalImpl.fastDivide is poor performance when evaluate the expression as 12345.67/123.45 There are 2 points can be improved: 1. Don't always use HiveDecimal.MAX_SCALE as scale when do the BigDecimal.divide. 2. Get the precision for BigInteger in a fast way if possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16004) OutOfMemory in SparkReduceRecordHandler with vectorization mode
Colin Ma created HIVE-16004: --- Summary: OutOfMemory in SparkReduceRecordHandler with vectorization mode Key: HIVE-16004 URL: https://issues.apache.org/jira/browse/HIVE-16004 Project: Hive Issue Type: Bug Reporter: Colin Ma Assignee: Colin Ma For the query 28 of TPCs-BB with 1T data, the executor memory is set as 30G. Get the following exception: java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setVector(VectorizedBatchUtil.java:467) at org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatchFrom(VectorizedBatchUtil.java:238) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:367) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:286) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:220) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:49) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) I think DataOutputBuffer isn't cleared on time cause this problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15718) Fix the NullPointer problem caused by split phase
Colin Ma created HIVE-15718: --- Summary: Fix the NullPointer problem caused by split phase Key: HIVE-15718 URL: https://issues.apache.org/jira/browse/HIVE-15718 Project: Hive Issue Type: Sub-task Reporter: Colin Ma Assignee: Colin Ma Priority: Critical VectorizedParquetRecordReader.initialize() will throw NullPointer Exception because the input split is null. This split should be ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)