[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317583#comment-14317583 ] Hive QA commented on HIVE-9333: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12698234/HIVE-9333.7.patch {color:green}SUCCESS:{color} +1 7540 tests passed Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2767/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2767/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2767/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12698234 - PreCommit-HIVE-TRUNK-Build > Move parquet serialize implementation to DataWritableWriter to improve write > speeds > --- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-9333.5.patch, HIVE-9333.6.patch, HIVE-9333.7.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and > DataWritableWriter.write() may be reduced to use just one loop into the > DataWritableWriter.write() method in order to increment the writing process > speed for Hive parquet. > In order to achieve this, we can wrap the Hive object and object inspector > on ParquetHiveSerDe.serialize() method into an object that implements the > Writable object and thus avoid the loop that serialize() does, and leave the > loop parser to the DataWritableWriter.write() method. We can see how ORC does > this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats, so > I don't think it is necessary to create and keep the writable objects in the > serialize() method as they won't be used until the writing process starts > (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307770#comment-14307770 ] Brock Noland commented on HIVE-9333: +1 > Move parquet serialize implementation to DataWritableWriter to improve write > speeds > --- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-9333.5.patch, HIVE-9333.6.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and > DataWritableWriter.write() may be reduced to use just one loop into the > DataWritableWriter.write() method in order to increment the writing process > speed for Hive parquet. > In order to achieve this, we can wrap the Hive object and object inspector > on ParquetHiveSerDe.serialize() method into an object that implements the > Writable object and thus avoid the loop that serialize() does, and leave the > loop parser to the DataWritableWriter.write() method. We can see how ORC does > this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats, so > I don't think it is necessary to create and keep the writable objects in the > serialize() method as they won't be used until the writing process starts > (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14300930#comment-14300930 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12695870/HIVE-9333.6.patch {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 7407 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby3_map org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler org.apache.hive.hcatalog.templeton.TestWebHCatE2e.getHiveVersion {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2612/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2612/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2612/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12695870 - PreCommit-HIVE-TRUNK-Build > Move parquet serialize implementation to DataWritableWriter to improve write > speeds > --- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-9333.5.patch, HIVE-9333.6.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and > DataWritableWriter.write() may be reduced to use just one loop into the > DataWritableWriter.write() method in order to increment the writing process > speed for Hive parquet. > In order to achieve this, we can wrap the Hive object and object inspector > on ParquetHiveSerDe.serialize() method into an object that implements the > Writable object and thus avoid the loop that serialize() does, and leave the > loop parser to the DataWritableWriter.write() method. We can see how ORC does > this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats, so > I don't think it is necessary to create and keep the writable objects in the > serialize() method as they won't be used until the writing process starts > (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299274#comment-14299274 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12695588/HIVE-9333.5.patch {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 7407 tests executed *Failed tests:* {noformat} org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchCommit_Json org.apache.hive.hcatalog.templeton.TestWebHCatE2e.getHiveVersion {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2591/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2591/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2591/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12695588 - PreCommit-HIVE-TRUNK-Build > Move parquet serialize implementation to DataWritableWriter to improve write > speeds > --- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-9333.2.patch, HIVE-9333.3.patch, HIVE-9333.4.patch, > HIVE-9333.5.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and > DataWritableWriter.write() may be reduced to use just one loop into the > DataWritableWriter.write() method in order to increment the writing process > speed for Hive parquet. > In order to achieve this, we can wrap the Hive object and object inspector > on ParquetHiveSerDe.serialize() method into an object that implements the > Writable object and thus avoid the loop that serialize() does, and leave the > loop parser to the DataWritableWriter.write() method. We can see how ORC does > this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats, so > I don't think it is necessary to create and keep the writable objects in the > serialize() method as they won't be used until the writing process starts > (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298629#comment-14298629 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 no tests executed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12695450/HIVE-9333.4.patch Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2585/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2585/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2585/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]] + export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera + JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera + export PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin + PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hive-ptest/working/ + tee /data/hive-ptest/logs/PreCommit-HIVE-TRUNK-Build-2585/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ svn = \s\v\n ]] + [[ -n '' ]] + [[ -d apache-svn-trunk-source ]] + [[ ! -d apache-svn-trunk-source/.svn ]] + [[ ! -d apache-svn-trunk-source ]] + cd apache-svn-trunk-source + svn revert -R . Reverted '.gitignore' Reverted 'cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestFileDump.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestRecordReaderImpl.java' Reverted 'ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java' Reverted 'ql/src/test/resources/orc-file-dump-dictionary-threshold.out' Reverted 'ql/src/test/resources/orc-file-has-null.out' Reverted 'ql/src/test/resources/orc-file-dump.out' Reverted 'ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/StreamName.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/FileDump.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java' Reverted 'ql/src/gen/protobuf/gen-java/org/apache/hadoop/hive/ql/io/orc/OrcProto.java' ++ awk '{print $2}' ++ egrep -v '^X|^Performing status on external' ++ svn status --no-ignore + rm -rf target datanucleus.log ant/target shims/target shims/0.20S/target shims/0.23/target shims/aggregator/target shims/common/target shims/scheduler/target packaging/target hbase-handler/target testutils/target jdbc/target metastore/target itests/target itests/thirdparty itests/hcatalog-unit/target itests/test-serde/target itests/qtest/target itests/hive-unit-hadoop2/target itests/hive-minikdc/target itests/hive-jmh/target itests/hive-unit/target itests/custom-serde/target itests/util/target itests/qtest-spark/target hcatalog/target hcatalog/core/target hcatalog/streaming/target hcatalog/server-extensions/target hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target hcatalog/hcatalog-pig-adapter/target accumulo-handler/target hwi/target common/target common/src/gen spark-client/target contrib/target service/target serde/target beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target ql/src/test/org/apache/hadoop/hive/ql/io/filters ql/src/test/resources/orc-file-dump-bloomfilter.out ql/src/test/resources/orc-file-dump-bloomfilter2.out ql/src/java/org/apache/hadoop/hive/ql/io/filters ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcUtils.java + svn update Fetching external item into 'hcatalog/src/test/e2e/harness' External at revision 1656014. At revision 1656014. + patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hive-ptest/working/scratch/build.patch + [[ -f /data/hive-ptest/working/scratch/build.patch ]] +
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298082#comment-14298082 ] Ferdinand Xu commented on HIVE-9333: Thanks Sergio for your patch. LGTM +1 > Move parquet serialize implementation to DataWritableWriter to improve write > speeds > --- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-9333.2.patch, HIVE-9333.3.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and > DataWritableWriter.write() may be reduced to use just one loop into the > DataWritableWriter.write() method in order to increment the writing process > speed for Hive parquet. > In order to achieve this, we can wrap the Hive object and object inspector > on ParquetHiveSerDe.serialize() method into an object that implements the > Writable object and thus avoid the loop that serialize() does, and leave the > loop parser to the DataWritableWriter.write() method. We can see how ORC does > this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats, so > I don't think it is necessary to create and keep the writable objects in the > serialize() method as they won't be used until the writing process starts > (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294145#comment-14294145 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12694817/HIVE-9333.3.patch {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 7396 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2535/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2535/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2535/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12694817 - PreCommit-HIVE-TRUNK-Build > Move parquet serialize implementation to DataWritableWriter to improve write > speeds > --- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-9333.2.patch, HIVE-9333.3.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and > DataWritableWriter.write() may be reduced to use just one loop into the > DataWritableWriter.write() method in order to increment the writing process > speed for Hive parquet. > In order to achieve this, we can wrap the Hive object and object inspector > on ParquetHiveSerDe.serialize() method into an object that implements the > Writable object and thus avoid the loop that serialize() does, and leave the > loop parser to the DataWritableWriter.write() method. We can see how ORC does > this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats, so > I don't think it is necessary to create and keep the writable objects in the > serialize() method as they won't be used until the writing process starts > (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292936#comment-14292936 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12694678/HIVE-9333.2.patch {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 7394 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2527/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2527/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2527/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12694678 - PreCommit-HIVE-TRUNK-Build > Move parquet serialize implementation to DataWritableWriter to improve write > speeds > --- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-9333.2.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and > DataWritableWriter.write() may be reduced to use just one loop into the > DataWritableWriter.write() method in order to increment the writing process > speed for Hive parquet. > In order to achieve this, we can wrap the Hive object and object inspector > on ParquetHiveSerDe.serialize() method into an object that implements the > Writable object and thus avoid the loop that serialize() does, and leave the > loop parser to the DataWritableWriter.write() method. We can see how ORC does > this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats, so > I don't think it is necessary to create and keep the writable objects in the > serialize() method as they won't be used until the writing process starts > (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292844#comment-14292844 ] Ferdinand Xu commented on HIVE-9333: Thanks Sergio for your patch. I have left some general questions in the review board. > Move parquet serialize implementation to DataWritableWriter to improve write > speeds > --- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-9333.2.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and > DataWritableWriter.write() may be reduced to use just one loop into the > DataWritableWriter.write() method in order to increment the writing process > speed for Hive parquet. > In order to achieve this, we can wrap the Hive object and object inspector > on ParquetHiveSerDe.serialize() method into an object that implements the > Writable object and thus avoid the loop that serialize() does, and leave the > loop parser to the DataWritableWriter.write() method. We can see how ORC does > this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats, so > I don't think it is necessary to create and keep the writable objects in the > serialize() method as they won't be used until the writing process starts > (DataWritableWriter.write()). > This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9333) Move parquet serialize implementation to DataWritableWriter to improve write speeds
[ https://issues.apache.org/jira/browse/HIVE-9333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292502#comment-14292502 ] Hive QA commented on HIVE-9333: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12694605/HIVE-9333.1.patch {color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 7373 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_decimal org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_decimal1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_parquet_types org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorized_parquet org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorized_parquet org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2522/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2522/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2522/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 6 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12694605 - PreCommit-HIVE-TRUNK-Build > Move parquet serialize implementation to DataWritableWriter to improve write > speeds > --- > > Key: HIVE-9333 > URL: https://issues.apache.org/jira/browse/HIVE-9333 > Project: Hive > Issue Type: Sub-task >Reporter: Sergio Peña >Assignee: Sergio Peña > Attachments: HIVE-9333.1.patch > > > The serialize process on ParquetHiveSerDe parses a Hive object > to a Writable object by looping through all the Hive object children, > and creating new Writables objects per child. These final writables > objects are passed in to the Parquet writing function, and parsed again > on the DataWritableWriter class by looping through the ArrayWritable > object. These two loops (ParquetHiveSerDe.serialize() and > DataWritableWriter.write() may be reduced to use just one loop into the > DataWritableWriter.write() method in order to increment the writing process > speed for Hive parquet. > In order to achieve this, we can wrap the Hive object and object inspector > on ParquetHiveSerDe.serialize() method into an object that implements the > Writable object and thus avoid the loop that serialize() does, and leave the > loop parser to the DataWritableWriter.write() method. We can see how ORC does > this with the OrcSerde.OrcSerdeRow class. > Writable objects are organized differently on any kind of storage formats, so > I don't think it is necessary to create and keep the writable objects in the > serialize() method as they won't be used until the writing process starts > (DataWritableWriter.write()). > We might save 200% of extra time by doing such change. > This performance issue was found using microbenchmark tests from HIVE-8121. -- This message was sent by Atlassian JIRA (v6.3.4#6332)