[ https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708056#action_12708056 ]
He Yongqiang edited comment on HIVE-477 at 5/11/09 10:06 PM: ------------------------------------------------------------- Using hadoop-streaming.jar RCFile: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar -input /user/hive/warehouse/tablerc1 -output testHiveWriter -inputformat org.apache.hadoop.hive.ql.io.RCFileInputFormat -outputformat org.apache.hadoop.hive.ql.io.RCFileOutputFormat -mapper org.apache.hadoop.mapred.lib.IdentityMapper -jobconf mapred.work.output.dir=. -jobconf hive.io.rcfile.column.number.conf=32 -jobconf mapred.output.compress=true -numReduceTasks 0 It costs 100+3 seconds. And in order to execute this command succuessfully, we need to change the RCFile's Generic signature to <WritableComparable,...>. was (Author: he yongqiang): Using hadoop-streaming.jar RCFile: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar -input /user/hive/warehouse/tablerc1 -output testHiveWriter -mapper org.apache.hadoop.mapred.lib.IdentityMapper -numReduceTasks 0 SequenceFile: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar -input /user/hive/warehouse/tableseq1 -output testHiveWriter -mapper org.apache.hadoop.mapred.lib.IdentityMapper -numReduceTasks 0 They all cost less than 10 seconds. > Some optimization thoughts for Hive > ----------------------------------- > > Key: HIVE-477 > URL: https://issues.apache.org/jira/browse/HIVE-477 > Project: Hadoop Hive > Issue Type: Improvement > Reporter: He Yongqiang > > Before we can start working on Hive-461. I am doing some profiling for hive. > And here are some thoughts for improvements: > minor : > 1) add a new HiveText to replace Text. It can avoid byte copy when init > LazyString. I have done a draft one, it shows ~1% performance gains. > 2) let StructObjectInspector's > {noformat} > public List<Object> getStructFieldsDataAsList(Object data); > {noformat} > to be > {noformat} > public Object[] getStructFieldsDataAsArray(Object data); > {noformat} > In my profiling test, it shows some performace gains. but in acutal execution > it did not. Anyway, let it return java array will reduce gc's burden of > collection ArrayList > not so minor: > 3) split FileSinkOperator's Writer into another Thread. Adding a > producer-consumer array as the bridge between the Operators thread and the > Writer thread. > 4) the operator stack is kind of deep. In order to avoid instruction cache > misses, and increase the efficiency data cache, I suggest to let Hive's > operator can process an array of rows instead of processing only one row at a > time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.