[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2011-08-29 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093007#comment-13093007
 ] 

He Yongqiang commented on MAPREDUCE-2841:
-

we are also evaluating the approach of optimizing the existing Hadoop Java map 
side sort algorithms (like playing the same set of tricks used in this c++ 
impl: bucket sort, prefix key comparison, a better crc32 etc).

The main problem we are interested is how big is the memory problem for the 
java impl. 

Also it will be very useful here to define an open benchmark.

 Task level native optimization
 --

 Key: MAPREDUCE-2841
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
 Environment: x86-64 Linux
Reporter: Binglin Chang
Assignee: Binglin Chang
 Attachments: MAPREDUCE-2841.v1.patch, dualpivot-0.patch, 
 dualpivotv20-0.patch


 I'm recently working on native optimization for MapTask based on JNI. 
 The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
 emitted by mapper, therefore sort, spill, IFile serialization can all be done 
 in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
 results:
 1. Sort is about 3x-10x as fast as java(only binary string compare is 
 supported)
 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
 CRC32C is used, things can get much faster(1G/s).
 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
 prevent mid-spill
 This leads to a total speed up of 2x~3x for the whole MapTask, if 
 IdentityMapper(mapper does nothing) is used.
 There are limitations of course, currently only Text and BytesWritable is 
 supported, and I have not think through many things right now, such as how to 
 support map side combine. I had some discussion with somebody familiar with 
 hive, it seems that these limitations won't be much problem for Hive to 
 benefit from those optimizations, at least. Advices or discussions about 
 improving compatibility are most welcome:) 
 Currently NativeMapOutputCollector has a static method called canEnable(), 
 which checks if key/value type, comparator type, combiner are all compatible, 
 then MapTask can choose to enable NativeMapOutputCollector.
 This is only a preliminary test, more work need to be done. I expect better 
 final results, and I believe similar optimization can be adopt to reduce task 
 and shuffle too. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2011-08-29 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093174#comment-13093174
 ] 

He Yongqiang commented on MAPREDUCE-2841:
-

bq. The bucketed sort used from 0.10 to 0.16 had more internal fragmentation 
and a less predictable memory footprint (particularly for jobs with lots of 
reducers).

If the java impl use the similar impl as the c++ one here, the only difference 
will be language. right? Sorry, can you explain more about how the c++ can do a 
better job here for predictable memory footprint? in the current java impl, all 
records (no matter which reducer it is going) are stored in a central byte 
array. In the c++ impl, on one mapper task, each reducer will have one 
corresponding partition bucket which maintains its own memory buffer. From what 
i understand, one partition bucket is for one reducer. and all records going to 
that reducer from the current maptask are stored there, will be sorted and 
spilled from there. From the sort part is that it save the number of comparison 
since the original sort will need to compared records from difference reducers. 
And the c++ impl has trick of doing prefix comparison which reduces the number 
of cpu ops (8 bytes compare - one long cmp op).

bq. Subsequent implementations focused on reducing the number of spills for 
each task, because the cost of spilling dominated the cost of the sort.Even 
with a significant speedup in the sort step, avoiding a merge by managing 
memory more carefully usually effects faster task times.

I totally agree the spill will be the dominate factor if it is there. So here 
comes the problem that how much more memory the java impl will need compared to 
the c++ one. 20% or 50% or 100%? so we can calculate the chance of avoidable 
spilling if using the c++ impl.
(Note: based on our analysis on jobs running during the past one month, most 
jobs need to shuffle less than 700MB data per mapper.)


 Task level native optimization
 --

 Key: MAPREDUCE-2841
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
 Environment: x86-64 Linux
Reporter: Binglin Chang
Assignee: Binglin Chang
 Attachments: MAPREDUCE-2841.v1.patch, dualpivot-0.patch, 
 dualpivotv20-0.patch


 I'm recently working on native optimization for MapTask based on JNI. 
 The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
 emitted by mapper, therefore sort, spill, IFile serialization can all be done 
 in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
 results:
 1. Sort is about 3x-10x as fast as java(only binary string compare is 
 supported)
 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
 CRC32C is used, things can get much faster(1G/s).
 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
 prevent mid-spill
 This leads to a total speed up of 2x~3x for the whole MapTask, if 
 IdentityMapper(mapper does nothing) is used.
 There are limitations of course, currently only Text and BytesWritable is 
 supported, and I have not think through many things right now, such as how to 
 support map side combine. I had some discussion with somebody familiar with 
 hive, it seems that these limitations won't be much problem for Hive to 
 benefit from those optimizations, at least. Advices or discussions about 
 improving compatibility are most welcome:) 
 Currently NativeMapOutputCollector has a static method called canEnable(), 
 which checks if key/value type, comparator type, combiner are all compatible, 
 then MapTask can choose to enable NativeMapOutputCollector.
 This is only a preliminary test, more work need to be done. I expect better 
 final results, and I believe similar optimization can be adopt to reduce task 
 and shuffle too. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2011-08-29 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093324#comment-13093324
 ] 

He Yongqiang commented on MAPREDUCE-2841:
-

sorry, i am kind of confused. i may should make me more clear: we are trying to 
evaluate and compare the c++ impl in HCE (and also this jira) and doing a pure 
java re-impl. So the thing that we mostly cared about is that is there sth that 
the c++ impl can do and a java re-impl can not. And if there is, we need to 
find out how much is that difference. And from there we can have a better 
understand of each approach and decide which approach to go. 

 Task level native optimization
 --

 Key: MAPREDUCE-2841
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
 Environment: x86-64 Linux
Reporter: Binglin Chang
Assignee: Binglin Chang
 Attachments: MAPREDUCE-2841.v1.patch, dualpivot-0.patch, 
 dualpivotv20-0.patch


 I'm recently working on native optimization for MapTask based on JNI. 
 The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
 emitted by mapper, therefore sort, spill, IFile serialization can all be done 
 in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
 results:
 1. Sort is about 3x-10x as fast as java(only binary string compare is 
 supported)
 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
 CRC32C is used, things can get much faster(1G/s).
 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
 prevent mid-spill
 This leads to a total speed up of 2x~3x for the whole MapTask, if 
 IdentityMapper(mapper does nothing) is used.
 There are limitations of course, currently only Text and BytesWritable is 
 supported, and I have not think through many things right now, such as how to 
 support map side combine. I had some discussion with somebody familiar with 
 hive, it seems that these limitations won't be much problem for Hive to 
 benefit from those optimizations, at least. Advices or discussions about 
 improving compatibility are most welcome:) 
 Currently NativeMapOutputCollector has a static method called canEnable(), 
 which checks if key/value type, comparator type, combiner are all compatible, 
 then MapTask can choose to enable NativeMapOutputCollector.
 This is only a preliminary test, more work need to be done. I expect better 
 final results, and I believe similar optimization can be adopt to reduce task 
 and shuffle too. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2010-01-27 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805457#action_12805457
 ] 

He Yongqiang commented on MAPREDUCE-1270:
-

Hi Dong / Shouyan,
Are you going to open source this? If yes, can you update the recent work? This 
can help others to better understand.

 Hadoop C++ Extention
 

 Key: MAPREDUCE-1270
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: task
Affects Versions: 0.20.1
 Environment:  hadoop linux
Reporter: Wang Shouyan

   Hadoop C++ extension is an internal project in baidu, We start it for these 
 reasons:
1  To provide C++ API. We mostly use Streaming before, and we also try to 
 use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
 think a new C++ extention is needed for us.
2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
 map/reduce Child JVM.
3  It costs so much to read/write/sort TB/PB data by Java. When using 
 PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do: 
1 We do not use map/reduce Child JVM to do any data processing, which just 
 prepares environment, starts C++ mapper, tells mapper which split it should  
 deal with, and reads report from mapper until that finished. The mapper will 
 read record, ivoke user defined map, to do partition, write spill, combine 
 and merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it 
 read from sorted files, ivoke user difined reduce, and write to user defined 
 record writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and 
 memory control.
at first, 1 and 2, then 3.  
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and 
 management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-765) eliminate the usage of FileSystem.create( ) depracated by Hadoop-5438

2009-07-17 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated MAPREDUCE-765:
---

Attachment: mapreduce-765-2009-07-18.patch

Incorporates Nicholas's comments

 eliminate the usage of FileSystem.create( ) depracated by Hadoop-5438 
 --

 Key: MAPREDUCE-765
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-765
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: He Yongqiang
Priority: Minor
 Attachments: mapreduce-765-2009-07-15.patch, 
 mapreduce-765-2009-07-18.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-765) eliminate the usage of FileSystem.create( ) depracated by Hadoop-5438

2009-07-16 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated MAPREDUCE-765:
---

Status: Patch Available  (was: Open)

 eliminate the usage of FileSystem.create( ) depracated by Hadoop-5438 
 --

 Key: MAPREDUCE-765
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-765
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: He Yongqiang
Priority: Minor
 Attachments: mapreduce-765-2009-07-15.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-765) eliminate the usage of FileSystem.create( ) depracated by Hadoop-5438

2009-07-15 Thread He Yongqiang (JIRA)
eliminate the usage of FileSystem.create( ) depracated by Hadoop-5438 
--

 Key: MAPREDUCE-765
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-765
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: He Yongqiang
Priority: Minor




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-765) eliminate the usage of FileSystem.create( ) depracated by Hadoop-5438

2009-07-15 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated MAPREDUCE-765:
---

Attachment: mapreduce-765-2009-07-15.patch

 eliminate the usage of FileSystem.create( ) depracated by Hadoop-5438 
 --

 Key: MAPREDUCE-765
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-765
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: He Yongqiang
Priority: Minor
 Attachments: mapreduce-765-2009-07-15.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.