[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317583#comment-14317583 ] Hive QA commented on HIVE-9333: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12698234/HIVE-9333.7.patch {color:green}SUCCESS:{color} +1 7540 tests passed Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2767/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2767/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2767/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12698234 - PreCommit-HIVE-TRUNK-Build Move parquet serialize implementation to DataWritableWriter to improve write speeds --- Key: HIVE-9333 URL: https://issues.apache.org/jira/browse/HIVE-9333 Project: Hive Issue Type: Sub-task Reporter: Sergio Peña Assignee: Sergio Peña Attachments: HIVE-9333.5.patch, HIVE-9333.6.patch, HIVE-9333.7.patch The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWriter.write() may be reduced to use just one loop into the DataWritableWriter.write() method in order to increment the writing process speed for Hive parquet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the Writable object and thus avoid the loop that serialize() does, and leave the loop parser to the DataWritableWriter.write() method. We can see how ORC does this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, so I don't think it is necessary to create and keep the writable objects in the serialize() method as they won't be used until the writing process starts (DataWritableWriter.write()). This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307770#comment-14307770 ] Brock Noland commented on HIVE-9333: +1 Move parquet serialize implementation to DataWritableWriter to improve write speeds --- Key: HIVE-9333 URL: https://issues.apache.org/jira/browse/HIVE-9333 Project: Hive Issue Type: Sub-task Reporter: Sergio Peña Assignee: Sergio Peña Attachments: HIVE-9333.5.patch, HIVE-9333.6.patch The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWriter.write() may be reduced to use just one loop into the DataWritableWriter.write() method in order to increment the writing process speed for Hive parquet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the Writable object and thus avoid the loop that serialize() does, and leave the loop parser to the DataWritableWriter.write() method. We can see how ORC does this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, so I don't think it is necessary to create and keep the writable objects in the serialize() method as they won't be used until the writing process starts (DataWritableWriter.write()). This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300930#comment-14300930 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12695870/HIVE-9333.6.patch {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 7407 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby3_map org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler org.apache.hive.hcatalog.templeton.TestWebHCatE2e.getHiveVersion {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2612/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2612/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2612/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12695870 - PreCommit-HIVE-TRUNK-Build Move parquet serialize implementation to DataWritableWriter to improve write speeds --- Key: HIVE-9333 URL: https://issues.apache.org/jira/browse/HIVE-9333 Project: Hive Issue Type: Sub-task Reporter: Sergio Peña Assignee: Sergio Peña Attachments: HIVE-9333.5.patch, HIVE-9333.6.patch The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWriter.write() may be reduced to use just one loop into the DataWritableWriter.write() method in order to increment the writing process speed for Hive parquet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the Writable object and thus avoid the loop that serialize() does, and leave the loop parser to the DataWritableWriter.write() method. We can see how ORC does this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, so I don't think it is necessary to create and keep the writable objects in the serialize() method as they won't be used until the writing process starts (DataWritableWriter.write()). This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299274#comment-14299274 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12695588/HIVE-9333.5.patch {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 7407 tests executed *Failed tests:* {noformat} org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchCommit_Json org.apache.hive.hcatalog.templeton.TestWebHCatE2e.getHiveVersion {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2591/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2591/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2591/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12695588 - PreCommit-HIVE-TRUNK-Build Move parquet serialize implementation to DataWritableWriter to improve write speeds --- Key: HIVE-9333 URL: https://issues.apache.org/jira/browse/HIVE-9333 Project: Hive Issue Type: Sub-task Reporter: Sergio Peña Assignee: Sergio Peña Attachments: HIVE-9333.2.patch, HIVE-9333.3.patch, HIVE-9333.4.patch, HIVE-9333.5.patch The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWriter.write() may be reduced to use just one loop into the DataWritableWriter.write() method in order to increment the writing process speed for Hive parquet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the Writable object and thus avoid the loop that serialize() does, and leave the loop parser to the DataWritableWriter.write() method. We can see how ORC does this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, so I don't think it is necessary to create and keep the writable objects in the serialize() method as they won't be used until the writing process starts (DataWritableWriter.write()). This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298629#comment-14298629 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 no tests executed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12695450/HIVE-9333.4.patch Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2585/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2585/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2585/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]] + export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera + JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera + export PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin + PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hive-ptest/working/ + tee /data/hive-ptest/logs/PreCommit-HIVE-TRUNK-Build-2585/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ svn = \s\v\n ]] + [[ -n '' ]] + [[ -d apache-svn-trunk-source ]] + [[ ! -d apache-svn-trunk-source/.svn ]] + [[ ! -d apache-svn-trunk-source ]] + cd apache-svn-trunk-source + svn revert -R . Reverted '.gitignore' Reverted 'cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestFileDump.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestRecordReaderImpl.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java' Reverted 'ql/src/test/resources/orc-file-dump-dictionary-threshold.out' Reverted 'ql/src/test/resources/orc-file-has-null.out' Reverted 'ql/src/test/resources/orc-file-dump.out' Reverted 'ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/StreamName.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/FileDump.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java' Reverted 'ql/src/gen/protobuf/gen-java/org/apache/hadoop/hive/ql/io/orc/OrcProto.java' ++ awk '{print $2}' ++ egrep -v '^X|^Performing status on external' ++ svn status --no-ignore + rm -rf target datanucleus.log ant/target shims/target shims/0.20S/target shims/0.23/target shims/aggregator/target shims/common/target shims/scheduler/target packaging/target hbase-handler/target testutils/target jdbc/target metastore/target itests/target itests/thirdparty itests/hcatalog-unit/target itests/test-serde/target itests/qtest/target itests/hive-unit-hadoop2/target itests/hive-minikdc/target itests/hive-jmh/target itests/hive-unit/target itests/custom-serde/target itests/util/target itests/qtest-spark/target hcatalog/target hcatalog/core/target hcatalog/streaming/target hcatalog/server-extensions/target hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target hcatalog/hcatalog-pig-adapter/target accumulo-handler/target hwi/target common/target common/src/gen spark-client/target contrib/target service/target serde/target beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target ql/src/test/org/apache/hadoop/hive/ql/io/filters ql/src/test/resources/orc-file-dump-bloomfilter.out ql/src/test/resources/orc-file-dump-bloomfilter2.out ql/src/java/org/apache/hadoop/hive/ql/io/filters ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcUtils.java + svn update Fetching external item into 'hcatalog/src/test/e2e/harness' External at revision 1656014. At revision 1656014. + patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hive-ptest/working/scratch/build.patch + [[ -f /data/hive-ptest/working/scratch/build.patch ]] +
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298082#comment-14298082 ] Ferdinand Xu commented on HIVE-9333: Thanks Sergio for your patch. LGTM +1 Move parquet serialize implementation to DataWritableWriter to improve write speeds --- Key: HIVE-9333 URL: https://issues.apache.org/jira/browse/HIVE-9333 Project: Hive Issue Type: Sub-task Reporter: Sergio Peña Assignee: Sergio Peña Attachments: HIVE-9333.2.patch, HIVE-9333.3.patch The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWriter.write() may be reduced to use just one loop into the DataWritableWriter.write() method in order to increment the writing process speed for Hive parquet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the Writable object and thus avoid the loop that serialize() does, and leave the loop parser to the DataWritableWriter.write() method. We can see how ORC does this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, so I don't think it is necessary to create and keep the writable objects in the serialize() method as they won't be used until the writing process starts (DataWritableWriter.write()). This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294145#comment-14294145 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12694817/HIVE-9333.3.patch {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 7396 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2535/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2535/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2535/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12694817 - PreCommit-HIVE-TRUNK-Build Move parquet serialize implementation to DataWritableWriter to improve write speeds --- Key: HIVE-9333 URL: https://issues.apache.org/jira/browse/HIVE-9333 Project: Hive Issue Type: Sub-task Reporter: Sergio Peña Assignee: Sergio Peña Attachments: HIVE-9333.2.patch, HIVE-9333.3.patch The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWriter.write() may be reduced to use just one loop into the DataWritableWriter.write() method in order to increment the writing process speed for Hive parquet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the Writable object and thus avoid the loop that serialize() does, and leave the loop parser to the DataWritableWriter.write() method. We can see how ORC does this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, so I don't think it is necessary to create and keep the writable objects in the serialize() method as they won't be used until the writing process starts (DataWritableWriter.write()). This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292502#comment-14292502 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12694605/HIVE-9333.1.patch {color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 7373 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_decimal org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_decimal1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_types org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorized_parquet org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorized_parquet org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2522/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2522/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2522/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 6 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12694605 - PreCommit-HIVE-TRUNK-Build Move parquet serialize implementation to DataWritableWriter to improve write speeds --- Key: HIVE-9333 URL: https://issues.apache.org/jira/browse/HIVE-9333 Project: Hive Issue Type: Sub-task Reporter: Sergio Peña Assignee: Sergio Peña Attachments: HIVE-9333.1.patch The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWriter.write() may be reduced to use just one loop into the DataWritableWriter.write() method in order to increment the writing process speed for Hive parquet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the Writable object and thus avoid the loop that serialize() does, and leave the loop parser to the DataWritableWriter.write() method. We can see how ORC does this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, so I don't think it is necessary to create and keep the writable objects in the serialize() method as they won't be used until the writing process starts (DataWritableWriter.write()). We might save 200% of extra time by doing such change. This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292936#comment-14292936 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12694678/HIVE-9333.2.patch {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 7394 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2527/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2527/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2527/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12694678 - PreCommit-HIVE-TRUNK-Build Move parquet serialize implementation to DataWritableWriter to improve write speeds --- Key: HIVE-9333 URL: https://issues.apache.org/jira/browse/HIVE-9333 Project: Hive Issue Type: Sub-task Reporter: Sergio Peña Assignee: Sergio Peña Attachments: HIVE-9333.2.patch The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWriter.write() may be reduced to use just one loop into the DataWritableWriter.write() method in order to increment the writing process speed for Hive parquet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the Writable object and thus avoid the loop that serialize() does, and leave the loop parser to the DataWritableWriter.write() method. We can see how ORC does this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, so I don't think it is necessary to create and keep the writable objects in the serialize() method as they won't be used until the writing process starts (DataWritableWriter.write()). This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292844#comment-14292844 ] Ferdinand Xu commented on HIVE-9333: Thanks Sergio for your patch. I have left some general questions in the review board. Move parquet serialize implementation to DataWritableWriter to improve write speeds --- Key: HIVE-9333 URL: https://issues.apache.org/jira/browse/HIVE-9333 Project: Hive Issue Type: Sub-task Reporter: Sergio Peña Assignee: Sergio Peña Attachments: HIVE-9333.2.patch The serialize process on ParquetHiveSerDe parses a Hive object to a Writable object by looping through all the Hive object children, and creating new Writables objects per child. These final writables objects are passed in to the Parquet writing function, and parsed again on the DataWritableWriter class by looping through the ArrayWritable object. These two loops (ParquetHiveSerDe.serialize() and DataWritableWriter.write() may be reduced to use just one loop into the DataWritableWriter.write() method in order to increment the writing process speed for Hive parquet. In order to achieve this, we can wrap the Hive object and object inspector on ParquetHiveSerDe.serialize() method into an object that implements the Writable object and thus avoid the loop that serialize() does, and leave the loop parser to the DataWritableWriter.write() method. We can see how ORC does this with the OrcSerde.OrcSerdeRow class. Writable objects are organized differently on any kind of storage formats, so I don't think it is necessary to create and keep the writable objects in the serialize() method as they won't be used until the writing process starts (DataWritableWriter.write()). This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)