[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-08-09 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091875#comment-14091875
 ] 

Lefty Leverenz commented on HIVE-7219:
--

Configuration parameter *hive.exec.orc.encoding.strategy* is documented in the 
wiki here for release 0.14.0:

* [Configuration Properties -- hive.exec.orc.encoding.strategy | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.orc.encoding.strategy]

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Fix For: 0.14.0

 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 HIVE-7219.4.patch, orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-19 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037771#comment-14037771
 ] 

Prasanth J commented on HIVE-7219:
--

bq. Question: Should the following information from Prasanth J also be 
documented, and if so does it belong in the ORC wikidoc or with the parameter 
description in Configuration Properties?
bq. For integers, this patch will improve only very specific cases. If the 
encoding uses SHORT_REPEAT, DELTA (esp. fixed delta), PATCHED_BLOB then this 
patch will NOT have any effect, as these encodings does not use bit packing. 
The bit packed encodings like DIRECT, DELTA (variable delta) will see 
improvements.

I think these are too specific for it to be put into user documentation.

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: TODOC14
 Fix For: 0.14.0

 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 HIVE-7219.4.patch, orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-17 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034100#comment-14034100
 ] 

Hive QA commented on HIVE-7219:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12650656/HIVE-7219.4.patch

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 5653 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_columnar
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas
org.apache.hadoop.hive.ql.exec.tez.TestTezTask.testSubmit
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/491/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/491/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-491/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12650656

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 HIVE-7219.4.patch, orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-15 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032062#comment-14032062
 ] 

Hive QA commented on HIVE-7219:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12650413/HIVE-7219.3.patch

{color:red}ERROR:{color} -1 due to 20 failed/errored test(s), 5578 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_filter
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_groupby
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_join
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_union
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_opt_vectorization
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_analyze
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_predicate_pushdown
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_split_elimination
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_columnar
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_decimal_aggregate
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_decimal_mapjoin
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_short_regress
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas
org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDictionaryThreshold
org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDump
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadDataPrimitiveTypes
org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testStoreFuncAllSimpleTypes
org.apache.hive.jdbc.miniHS2.TestHiveServer2.testConnection
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/476/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/476/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-476/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 20 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12650413

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-15 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032090#comment-14032090
 ] 

Gunther Hagleitner commented on HIVE-7219:
--

These failures look related to the patch (at least some of them). Looked at 
orc_analyze: Need to update golden files with new sizes. orc_split_elimination: 
Seems the order of records has changed in some queries, not sure how this patch 
causes it, but should take a look.

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-14 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031758#comment-14031758
 ] 

Gunther Hagleitner commented on HIVE-7219:
--

+1 assuming tests will pass.

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-13 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031324#comment-14031324
 ] 

Gopal V commented on HIVE-7219:
---

Posted comments about config options, but that is from a documentation  user 
configuration angle.

Doing a 1Tb insert with this patch in a few minutes, will measure perf at scale.

+1 (NB)

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-13 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031393#comment-14031393
 ] 

Gopal V commented on HIVE-7219:
---

1243 seconds to insert 1Tb, which is not bad at all.

The only ORC method which shows up on my profiler is the RedBlackTree::add().

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-13 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031396#comment-14031396
 ] 

Prasanth J commented on HIVE-7219:
--

Thanks [~gopalv] for running this experiment! Yes, we should fix RBTree 
improvements in a different jira.

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-13 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031418#comment-14031418
 ] 

Gopal V commented on HIVE-7219:
---

inserted table was sizes in Gb in both modes.

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-13 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031417#comment-14031417
 ] 

Gopal V commented on HIVE-7219:
---

Ran both mode inserts (1243 vs 1269s) with Snappy (which is only 2% faster, but 
the columns had 1:1:2 ratio of int, double, string). 

String columns will speed up with HIVE-7144, hopefully.

||Table || Size || Speed || Difference ||
|customer | 10.5 | 10.6| + 0.98 % |
|lineitem | 204.7 | 220.7 | + 7.81% | 
|orders | 48.7 | 50.8 | + 4.31% |
|part | 4.3 | 4.6| + 6.97 % |
|partsupp | 39.1 | 39.5 | + 1.02%|
|supplier | 0.68 | 0.69 | + 1.47% |

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC

2014-06-13 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031423#comment-14031423
 ] 

Prasanth J commented on HIVE-7219:
--

For integers, this patch will improve only very specific cases. If the encoding 
uses SHORT_REPEAT, DELTA (esp. fixed delta), PATCHED_BLOB then this patch will 
NOT have any effect, as these encodings does not use bit packing. The bit 
packed encodings like DIRECT, DELTA (variable delta) will see improvements.

 Improve performance of serialization utils in ORC
 -

 Key: HIVE-7219
 URL: https://issues.apache.org/jira/browse/HIVE-7219
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.14.0
Reporter: Prasanth J
Assignee: Prasanth J
 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, 
 orc-read-perf-jmh-benchmark.png


 ORC uses serialization utils heavily for reading and writing data. The 
 bitpacking and unpacking code in writeInts() and readInts() can be unrolled 
 for better performance. Also double reader/writer performance can be improved 
 by bulk reading/writing from/to byte array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)