[jira] [Created] (HIVE-18576) Support to read nested complex type with Parquet in vectorization mode

2018-01-29 Thread Colin Ma (JIRA)
Colin Ma created HIVE-18576:
---

 Summary: Support to read nested complex type with Parquet in 
vectorization mode
 Key: HIVE-18576
 URL: https://issues.apache.org/jira/browse/HIVE-18576
 Project: Hive
  Issue Type: Sub-task
Reporter: Colin Ma
Assignee: Colin Ma


Nested complex type is common used, eg: Struct, s2 
List>. Currently, nested complex type can't be parsed in vectorization 
mode, this ticket is target to support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-18411) Fix ArrayIndexOutOfBoundsException for VectorizedListColumnReader

2018-01-08 Thread Colin Ma (JIRA)
Colin Ma created HIVE-18411:
---

 Summary: Fix ArrayIndexOutOfBoundsException for 
VectorizedListColumnReader
 Key: HIVE-18411
 URL: https://issues.apache.org/jira/browse/HIVE-18411
 Project: Hive
  Issue Type: Sub-task
Reporter: Colin Ma
Assignee: Colin Ma
Priority: Critical


ColumnVector should be initialized to the default size at the begin of 
readBatch(), otherwise, ArrayIndexOutOfBoundsException will be thrown because 
the size of ColumnVector may be updated in the last readBatch().



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-18211) Support to read multiple level definition for Map type in Parquet file

2017-12-03 Thread Colin Ma (JIRA)
Colin Ma created HIVE-18211:
---

 Summary: Support to read multiple level definition for Map type in 
Parquet file
 Key: HIVE-18211
 URL: https://issues.apache.org/jira/browse/HIVE-18211
 Project: Hive
  Issue Type: Sub-task
Reporter: Colin Ma
Assignee: Colin Ma


For the current implementation with VectorizedParquetRecordReader, only 
following definition for map type is supported:
{code}
repeated group map (MAP_KEY_VALUE) {
required binary key (UTF8); optional binary value (UTF8);}
}
{code}
The implementation should support multiple level definition like:
{code}
optional group m1 (MAP) {
repeated group map (MAP_KEY_VALUE)
{required binary key (UTF8); optional binary value (UTF8);}
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-18209) Fix wrong API call in VectorizedListColumnReader to get value from BytesColumnVector

2017-12-03 Thread Colin Ma (JIRA)
Colin Ma created HIVE-18209:
---

 Summary: Fix wrong API call in VectorizedListColumnReader to get 
value from BytesColumnVector
 Key: HIVE-18209
 URL: https://issues.apache.org/jira/browse/HIVE-18209
 Project: Hive
  Issue Type: Sub-task
Reporter: Colin Ma
Assignee: Colin Ma


BytesColumnVector.setRef() should be used instead of BytesColumnVector.setVal() 
to get the result as expected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-18207) vector_complex_join

2017-12-03 Thread Colin Ma (JIRA)
Colin Ma created HIVE-18207:
---

 Summary: vector_complex_join
 Key: HIVE-18207
 URL: https://issues.apache.org/jira/browse/HIVE-18207
 Project: Hive
  Issue Type: Bug
Reporter: Colin Ma






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-18159) Vectorization: Support Map type in MapWork

2017-11-27 Thread Colin Ma (JIRA)
Colin Ma created HIVE-18159:
---

 Summary: Vectorization: Support Map type in MapWork
 Key: HIVE-18159
 URL: https://issues.apache.org/jira/browse/HIVE-18159
 Project: Hive
  Issue Type: Improvement
Reporter: Colin Ma
Assignee: Colin Ma


Support Complex Types in vectorization is finished in HIVE-16589, but Map type 
is still not support in MapWork. This ticket is target to support it for 
MapWork when vectorization is enable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-18048) Add qtests for Struct type with vectorization

2017-11-12 Thread Colin Ma (JIRA)
Colin Ma created HIVE-18048:
---

 Summary: Add qtests for Struct type with vectorization
 Key: HIVE-18048
 URL: https://issues.apache.org/jira/browse/HIVE-18048
 Project: Hive
  Issue Type: Sub-task
Reporter: Colin Ma
Assignee: Colin Ma


Struct type is supported in vectorization, but there is no qtests to test such 
case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-18043) Vectorization: Support List type in MapWork

2017-11-10 Thread Colin Ma (JIRA)
Colin Ma created HIVE-18043:
---

 Summary: Vectorization: Support List type in MapWork
 Key: HIVE-18043
 URL: https://issues.apache.org/jira/browse/HIVE-18043
 Project: Hive
  Issue Type: Improvement
Reporter: Colin Ma
Assignee: Colin Ma


Support Complex Types in vectorization is finished in HIVE-16589, but List type 
is still not support in MapWork. It should be supported to improve the 
performance when vectorization is enable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17972) Implement Parquet vectorization reader for Map type

2017-11-02 Thread Colin Ma (JIRA)
Colin Ma created HIVE-17972:
---

 Summary: Implement Parquet vectorization reader for Map type
 Key: HIVE-17972
 URL: https://issues.apache.org/jira/browse/HIVE-17972
 Project: Hive
  Issue Type: Sub-task
Reporter: Colin Ma
Assignee: Colin Ma
Priority: Major


Parquet vectorized reader can't support map type, it should be supported to 
improve the performance when the query with map type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17931) Implement Parquet vectorization reader for Array type

2017-10-29 Thread Colin Ma (JIRA)
Colin Ma created HIVE-17931:
---

 Summary: Implement Parquet vectorization reader for Array type
 Key: HIVE-17931
 URL: https://issues.apache.org/jira/browse/HIVE-17931
 Project: Hive
  Issue Type: Sub-task
Reporter: Colin Ma
Assignee: Colin Ma


Parquet vectorized reader can't support array type, it should be supported to 
improve the performance when the query with array type. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-17033) Miss the jar when create slider package for llap

2017-07-04 Thread Colin Ma (JIRA)
Colin Ma created HIVE-17033:
---

 Summary: Miss the jar when create slider package for llap
 Key: HIVE-17033
 URL: https://issues.apache.org/jira/browse/HIVE-17033
 Project: Hive
  Issue Type: Bug
  Components: llap
Affects Versions: 3.0.0
Reporter: Colin Ma
Assignee: Colin Ma
 Fix For: 3.0.0


When create the slider package for llap, the jar for log4j-1.2-api is missed. 
The root cause is org.apache.log4j.NDC used to get jar of log4j-1.2-api, but 
this class is also existed in the jar of log4j. So, the jar of log4j-1.2-api 
won't be included.
As a result, log4j-1.2-api-2.6.2.jar can't be found in llap-2.2.0-S.zip.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-16969) Improvement performance of MapOperator for Parquet

2017-06-26 Thread Colin Ma (JIRA)
Colin Ma created HIVE-16969:
---

 Summary: Improvement performance of MapOperator for Parquet
 Key: HIVE-16969
 URL: https://issues.apache.org/jira/browse/HIVE-16969
 Project: Hive
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Colin Ma
Assignee: Colin Ma
 Fix For: 3.0.0


For a table with many partition files, 
MapOperator.cloneConfsForNestedColPruning() will update the 
hive.io.file.readNestedColumn.paths many times. The larger value of 
hive.io.file.readNestedColumn.paths will cause the poor performance for 
ParquetHiveSerDe.processRawPrunedPaths(). 
So, the unnecessary paths should be appended to 
hive.io.file.readNestedColumn.paths.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HIVE-16765) ParquetFileReader should be closed to avoid resource leak

2017-05-25 Thread Colin Ma (JIRA)
Colin Ma created HIVE-16765:
---

 Summary: ParquetFileReader should be closed to avoid resource leak
 Key: HIVE-16765
 URL: https://issues.apache.org/jira/browse/HIVE-16765
 Project: Hive
  Issue Type: Bug
Reporter: Colin Ma
Assignee: Colin Ma
Priority: Critical


ParquetFileReader should be closed to avoid resource leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16672) Parquet vectorization doesn't work for tables with partition info

2017-05-16 Thread Colin Ma (JIRA)
Colin Ma created HIVE-16672:
---

 Summary: Parquet vectorization doesn't work for tables with 
partition info
 Key: HIVE-16672
 URL: https://issues.apache.org/jira/browse/HIVE-16672
 Project: Hive
  Issue Type: Sub-task
Reporter: Colin Ma
Assignee: Colin Ma
Priority: Critical
 Fix For: 3.0.0


VectorizedParquetRecordReader doesn't check and update partition cols, this 
should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16465) NullPointer Exception when enable vectorization for Parquet file format

2017-04-17 Thread Colin Ma (JIRA)
Colin Ma created HIVE-16465:
---

 Summary: NullPointer Exception when enable vectorization for 
Parquet file format
 Key: HIVE-16465
 URL: https://issues.apache.org/jira/browse/HIVE-16465
 Project: Hive
  Issue Type: Bug
Reporter: Colin Ma
Assignee: Colin Ma
Priority: Critical


NullPointer Exception when enable vectorization for Parquet file format. It is 
caused by the null value of the InputSplit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16311) Improve the performance for FastHiveDecimalImpl.fastDivide

2017-03-28 Thread Colin Ma (JIRA)
Colin Ma created HIVE-16311:
---

 Summary: Improve the performance for FastHiveDecimalImpl.fastDivide
 Key: HIVE-16311
 URL: https://issues.apache.org/jira/browse/HIVE-16311
 Project: Hive
  Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Colin Ma
Assignee: Colin Ma
 Fix For: 2.2.0


FastHiveDecimalImpl.fastDivide is poor performance when evaluate the expression 
as 12345.67/123.45
There are 2 points can be improved:
1. Don't always use HiveDecimal.MAX_SCALE as scale when do the 
BigDecimal.divide.
2. Get the precision for BigInteger in a fast way if possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16004) OutOfMemory in SparkReduceRecordHandler with vectorization mode

2017-02-21 Thread Colin Ma (JIRA)
Colin Ma created HIVE-16004:
---

 Summary: OutOfMemory in SparkReduceRecordHandler with 
vectorization mode
 Key: HIVE-16004
 URL: https://issues.apache.org/jira/browse/HIVE-16004
 Project: Hive
  Issue Type: Bug
Reporter: Colin Ma
Assignee: Colin Ma


For the query 28 of TPCs-BB with 1T data, the executor memory is set as 30G. 
Get the following exception:
java.lang.OutOfMemoryError
at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setVector(VectorizedBatchUtil.java:467)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatchFrom(VectorizedBatchUtil.java:238)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:367)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:286)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:220)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:49)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
at 
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
at 
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974)
at 
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) 

I think DataOutputBuffer isn't cleared on time cause this problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-15718) Fix the NullPointer problem caused by split phase

2017-01-24 Thread Colin Ma (JIRA)
Colin Ma created HIVE-15718:
---

 Summary: Fix the NullPointer problem caused by split phase
 Key: HIVE-15718
 URL: https://issues.apache.org/jira/browse/HIVE-15718
 Project: Hive
  Issue Type: Sub-task
Reporter: Colin Ma
Assignee: Colin Ma
Priority: Critical


VectorizedParquetRecordReader.initialize() will throw NullPointer Exception 
because the input split is null. This split should be ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)