[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

He Yongqiang (JIRA) Sun, 10 May 2009 20:18:12 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707872#action_12707872
 ]


He Yongqiang commented on HIVE-477:
-----------------------------------

I did a test to see how much time is used in the RecordWriter, and how much 
time is used in OperatorProcessing.

Use : insert overwrite table tableRC2 select * from tableRC1;

tableRC1 is about 132M, and will use 2 maps (block size is 64M).

Normal:
{noformat} 
it costs about 126s, and each map cost about 114s
{noformat}

Comment out outWriter.write(recordValue) in FileSinkOperator's process method
{noformat}
The whole job costs about 80s. But one mapper only executes about 32 sec, and 
the other mapper process very slowly (costs about 70s). The slow mapper leads 
to the whole job's processing time upto 80s (i think the slow mapper is caused 
by other reasons).
{noformat}

Comment out the whole ExecMapper's map(), and this equals to only read and do 
nothing.
{noformat}
it costs about 27s, and each map cost about 15s
{noformat}

Using hadoop-streaming.jar 
$HADOOP_HOME/bin/hadoop  jar 
$HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar  -input 
/user/hive/warehouse/tablerc1/HiveStackTestData  -output testHiveWriter  
-mapper org.apache.hadoop.mapred.lib.IdentityMapper  -numReduceTasks 0
{noformat}
It costs about 55s. One mapper costs about 5s, the other mapper costs about 50s.
{noformat}


> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. 
> And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init 
> LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution 
> it did not. Anyway, let it return java array will reduce gc's burden of 
> collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a 
> producer-consumer array as the bridge between the Operators thread and the 
> Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache 
> misses, and increase the efficiency data cache, I suggest to let Hive's 
> operator can process an array of rows instead of processing only one row at a 
> time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

Reply via email to