[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091875#comment-14091875 ] Lefty Leverenz commented on HIVE-7219: -- Configuration parameter *hive.exec.orc.encoding.strategy* is documented in the wiki here for release 0.14.0: * [Configuration Properties -- hive.exec.orc.encoding.strategy | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.orc.encoding.strategy] Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Fix For: 0.14.0 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, HIVE-7219.4.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037771#comment-14037771 ] Prasanth J commented on HIVE-7219: -- bq. Question: Should the following information from Prasanth J also be documented, and if so does it belong in the ORC wikidoc or with the parameter description in Configuration Properties? bq. For integers, this patch will improve only very specific cases. If the encoding uses SHORT_REPEAT, DELTA (esp. fixed delta), PATCHED_BLOB then this patch will NOT have any effect, as these encodings does not use bit packing. The bit packed encodings like DIRECT, DELTA (variable delta) will see improvements. I think these are too specific for it to be put into user documentation. Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Labels: TODOC14 Fix For: 0.14.0 Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, HIVE-7219.4.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034100#comment-14034100 ] Hive QA commented on HIVE-7219: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12650656/HIVE-7219.4.patch {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 5653 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_columnar org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas org.apache.hadoop.hive.ql.exec.tez.TestTezTask.testSubmit {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/491/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/491/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-491/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12650656 Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, HIVE-7219.4.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032062#comment-14032062 ] Hive QA commented on HIVE-7219: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12650413/HIVE-7219.3.patch {color:red}ERROR:{color} -1 due to 20 failed/errored test(s), 5578 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_filter org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_groupby org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_union org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_opt_vectorization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_analyze org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_predicate_pushdown org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_split_elimination org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_columnar org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_decimal_aggregate org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vector_decimal_mapjoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_short_regress org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDictionaryThreshold org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDump org.apache.hive.hcatalog.pig.TestHCatLoader.testReadDataPrimitiveTypes org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testStoreFuncAllSimpleTypes org.apache.hive.jdbc.miniHS2.TestHiveServer2.testConnection {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/476/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/476/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-476/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 20 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12650413 Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032090#comment-14032090 ] Gunther Hagleitner commented on HIVE-7219: -- These failures look related to the patch (at least some of them). Looked at orc_analyze: Need to update golden files with new sizes. orc_split_elimination: Seems the order of records has changed in some queries, not sure how this patch causes it, but should take a look. Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031758#comment-14031758 ] Gunther Hagleitner commented on HIVE-7219: -- +1 assuming tests will pass. Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031324#comment-14031324 ] Gopal V commented on HIVE-7219: --- Posted comments about config options, but that is from a documentation user configuration angle. Doing a 1Tb insert with this patch in a few minutes, will measure perf at scale. +1 (NB) Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031393#comment-14031393 ] Gopal V commented on HIVE-7219: --- 1243 seconds to insert 1Tb, which is not bad at all. The only ORC method which shows up on my profiler is the RedBlackTree::add(). Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031396#comment-14031396 ] Prasanth J commented on HIVE-7219: -- Thanks [~gopalv] for running this experiment! Yes, we should fix RBTree improvements in a different jira. Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031418#comment-14031418 ] Gopal V commented on HIVE-7219: --- inserted table was sizes in Gb in both modes. Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031417#comment-14031417 ] Gopal V commented on HIVE-7219: --- Ran both mode inserts (1243 vs 1269s) with Snappy (which is only 2% faster, but the columns had 1:1:2 ratio of int, double, string). String columns will speed up with HIVE-7144, hopefully. ||Table || Size || Speed || Difference || |customer | 10.5 | 10.6| + 0.98 % | |lineitem | 204.7 | 220.7 | + 7.81% | |orders | 48.7 | 50.8 | + 4.31% | |part | 4.3 | 4.6| + 6.97 % | |partsupp | 39.1 | 39.5 | + 1.02%| |supplier | 0.68 | 0.69 | + 1.47% | Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7219) Improve performance of serialization utils in ORC
[ https://issues.apache.org/jira/browse/HIVE-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031423#comment-14031423 ] Prasanth J commented on HIVE-7219: -- For integers, this patch will improve only very specific cases. If the encoding uses SHORT_REPEAT, DELTA (esp. fixed delta), PATCHED_BLOB then this patch will NOT have any effect, as these encodings does not use bit packing. The bit packed encodings like DIRECT, DELTA (variable delta) will see improvements. Improve performance of serialization utils in ORC - Key: HIVE-7219 URL: https://issues.apache.org/jira/browse/HIVE-7219 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.14.0 Reporter: Prasanth J Assignee: Prasanth J Attachments: HIVE-7219.1.patch, HIVE-7219.2.patch, HIVE-7219.3.patch, orc-read-perf-jmh-benchmark.png ORC uses serialization utils heavily for reading and writing data. The bitpacking and unpacking code in writeInts() and readInts() can be unrolled for better performance. Also double reader/writer performance can be improved by bulk reading/writing from/to byte array. -- This message was sent by Atlassian JIRA (v6.2#6252)