from:"Sean Zhong \(JIRA\)"

[jira] [Commented] (MAPREDUCE-6417) MapReduceClient's primitives.h is toxic and should be extirpated

2015-12-09 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049842#comment-15049842
 ] 

Sean Zhong commented on MAPREDUCE-6417:
---

Have you done some micro-benchmark on this? The primitive.h is introduced 
because the built-in gcc implementation is very slow.

[~decster], more comments?

> MapReduceClient's primitives.h is toxic and should be extirpated
> 
>
> Key: MAPREDUCE-6417
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6417
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: client
>Affects Versions: 3.0.0
>Reporter: Alan Burlison
>Assignee: Alan Burlison
>Priority: Blocker
> Attachments: MAPREDUCE-6417.001.patch
>
>
> MapReduceClient's primitives.h attempts to provide optimised versions of 
> standard library memory copy and comparison functions. It has been the 
> subject of several portability-related bugs:
> * HADOOP-11505 hadoop-mapreduce-client-nativetask uses bswap where be32toh is 
> needed, doesn't work on non-x86
> * HADOOP-11665 Provide and unify cross platform byteorder support in native 
> code
> * MAPREDUCE-6397 MAPREDUCE makes many endian-dependent assumptions
> * HADOOP-11484 hadoop-mapreduce-client-nativetask fails to build on ARM 
> AARCH64 due to x86 asm statements
> At present it only works on x86 and ARM64 as it lacks definitions for bswap 
> and bswap64 for any platforms other than those.
> However it has even more serious problems on non-x86 architectures, for 
> example on SPARC simple_memcpy simply doesn't work at all:
> {code}
> $ cat bang.cc
> #include 
> #define SIMPLE_MEMCPY
> #include "primitives.h"
> int main(int argc, char **argv)
> {
> char b1[9];
> char b2[9];
> simple_memcpy(b2, b1, sizeof(b1));
> }
> $ gcc -o bang bang.cc && ./bang
> Bus Error (core dumped)
> {code}
> That's because simple_memcpy does pointer fiddling that results in misaligned 
> accesses, which are illegal on SPARC.
> fmemcmp is also broken. Even if a definition of bswap is provided, on 
> big-endian architectures the result is simply wrong because of its 
> unconditional use of bswap:
> {code}
> $ cat thud.cc
> #include 
> #include 
> #include "primitives.h"
> int main(int argc, char **argv)
> {
> char a[] = { 0,1,2,0 };
> char b[] = { 0,2,1,0 };
> printf("%lld %d\n", fmemcmp(a, b, sizeof(a), memcmp(a, b, sizeof(a;
> }
> $ g++ -o thud thud.cc && ./thud
> 65280 -1
> {code}
> And in addition fmemcmp suffers from the same misalignment issues as 
> simple_memcpy and coredumps on SPARC when asked to compare odd-sized buffers.
> primitives.h provides the following functions:
> * bswap - used in 12 files in MRC but as HADOOP-11505 points out, mostly 
> incorrectly as it takes no account of platform endianness
> * bswap64 - used in 4 files in MRC, same comments as per bswap apply
> * simple_memcpy - used in 3 files in MRC, should be replaced with the 
> standard memcpy
> * fmemcmp - used in 1 file, should be replaced with the standard memcmp
> * fmemeq - used in 1 file, should be replaced with the standard memcmp
> * frmemeq - not used at all, should just be removed
> *Summary*: primitives.h should simply be deleted and replaced with the 
> standard memory copy & compare functions, or with thin wrappers around them 
> where the APIs are different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-05-18 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001286#comment-14001286
 ] 

Sean Zhong commented on MAPREDUCE-2841:
---

Updates on this: https://github.com/intel-hadoop/nativetask


> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Binglin Chang
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
> fb-shuffle.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-06-23 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040691#comment-14040691
 ] 

Sean Zhong commented on MAPREDUCE-2841:
---

Latest native task code is posted at: 
https://github.com/intel-hadoop/nativetask/tree/native_output_collector for 
easy review. Currently the code is patched againt Hadoop2.2.

Some features highlights:
1. Full performance test covered 
https://github.com/intel-hadoop/nativetask/tree/native_output_collector#what-is-the-benefit
2. Support all values types which extends Writable.
3. Support all key types in hadoop.io, and most key types in project hive, pig, 
mahout, hbase. For a list of supported key types, please check 
https://github.com/intel-hadoop/nativetask/wiki#supported-key-types
4.  Fully support java combiner. 
5. Support large key and values.
6. A full test suite for key value combination.
7. Support GZIP, LZ4, and Snappy.

Items we are still working on:
1. Extract support  for  Hive/Pig/HBase/Mahout platforms to standalone jars, 
and decouple the dependency with native task source code.
2. More documents describing the api.

For design, test, and doc, please check
https://github.com/intel-hadoop/nativetask/tree/native_output_collector
https://github.com/intel-hadoop/nativetask/wiki



> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Binglin Chang
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
> fb-shuffle.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-06-25 Thread Sean Zhong (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044206#comment-14044206
]

Sean Zhong commented on MAPREDUCE-2841:
---

First, Arun and Todd, thank you both for your honest opinions! You are both
respected!

I believe the differences will narrow after we see the same facts, I'd like to
state the facts and clarify some confusions:

1. How many lines of code on earth?

Here is a breakdown for branch
https://github.com/intel-hadoop/nativetask/tree/native_output_collector:

java code(*.java) 122 files, 8057 lines
nativetask 62 files, 4080 lines
nativetask Unit Test14 files, 1222 lines
other platform pig/mahout/hbase/hive25 files, 477 lines
scenario test 21 files, 2278 lines,

native code(*.h, *.cc) 128 file, 47048 lines
nativetask 85 files, 11713 lines
nativetask Unit Test33 files, 4911 lines,
otherPlatform pig/mahout/hbase/hive 2 files,1083 lines,
thirdparty gtest lib header files 3 files,28699 lines
thirdparty lz4/snappy/cityhash 5 files,642 lines

(Note: All license header lines in each source file are not counted, blanks and
other comments are counted)

If we measure the LOC in the sense of code complexity, then:
Third party code like google test header files should not be counted，gtest head
alone has 28699 lines of code.
Pig/mahout/hbase/hive code will be removed from the code repository eventually,
and should not be counted.
Scenario test code may not be included, as you can always write new scenario
tests.

So after the deduction, effective code contains,
NativeTask Source Code(java + native C++): 15793 lines
NativeTask Unit test Code(java + native C++): 6133 lines

2. Is this patch used as alternate implementation of MapReduce runtime, like
TEZ?

No, the whole purpose of this patch submission is to act as an Map Output
Collector, which transparently improve MapReduce performance, NOT as a new MR
engine.

The code is posted at branch
https://github.com/intel-hadoop/nativetask/tree/native_output_collector, it
only includes code for map output collector.

3. Why there are Pig/Mahout/HBase/Hive code in native task source code?

We are working on removing platform(Hive/Pig/HBase/Mahout) code from native
task source code a I commented above, and provide them as standalone jars.
We rushed to post the link without fully cleanup so that we can get some early
feedback from community.

4. Is the Full native runtime included?

No, full native runtime is not included in this patch, and related code is
stripped. Repo
https://github.com/intel-hadoop/nativetask/tree/native_output_collector only
contains code for transparent collector.

5. Are there intention to contribute the full native runtime node to Hadoop? or
act as a separate project?

It is not the purpose of this patch to support full native runtime mode, the
goal of this patch is to make existing MR job runs better on modern CPU with
native map output collector.

Full native runtime mode is another topic, there is a long way for that to be
ready for submission, We don't want to consider that now.

6. Are there interface compatibility issue?

This patch is not about full native runtime mode which supports native mapper
and native reducer.

This patch is only about a custom map output collector in transparent mode. We
are using existing java interfaces, and people are still running their java
version mapper/reducer without re-compilation. User can make a small configure
change to enable this nativetask collector. When there is a case that
nativetask don't support, it will simply fallback to default MR implementation.

7. Are there C++ ABI issue?

The concern make sense.

Regarding ABI, if the user don't need custom key comparator, he will never need
to implement native comparator on nativetask.so, so no ABI issue.
If the user do want to write a native comparator, the nativetask native
interface involved is very limited, only:

typedef int (*ComparatorPtr)(const char * src, uint32_t srcLength, const char *
dest,
uint32_t destLength);

However, the current code will assume user to include whole "NativeTask.h",
which contains more stuff than the typedef above.
We will work on this to make sure that "NativeTask.h" only expose necessary
minimum API. After we do this, there should be no big ABI issue.

8. How can you make sure the quality of this code?

The code has been actively developed more than 1 year. It has been used and
tested in production for a very long time, and there are also full set of unit
test and scenario test for coverage.

9. Can this be worked on TEZ instead?

We believe it is good for MapReduce, we know people are still

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-07-17 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--

Status: Open  (was: Patch Available)

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
> fb-shuffle.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-07-17 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064878#comment-14064878
 ] 

Sean Zhong commented on MAPREDUCE-2841:
---

Hi Todd,

The patch is uploaded to: 
https://raw.githubusercontent.com/intel-hadoop/nativetask/native_output_collector/patch/hadoop-3.0-mapreduce-2841-2014-7-17.patch
 
(It is too big to be uploaded to here)

It is patched against hadoop3.0 trunk. 




> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
> fb-shuffle.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-07-17 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--

Attachment: hadoop-3.0-mapreduce-2841-2014-7-17.patch

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
> fb-shuffle.patch, hadoop-3.0-mapreduce-2841-2014-7-17.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-07-17 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065064#comment-14065064
 ] 

Sean Zhong commented on MAPREDUCE-2841:
---

Ah, thanks for pointing this out. I am not sure why this happen.

I just uploaded the patch to this jira 
https://issues.apache.org/jira/secure/attachment/12656288/hadoop-3.0-mapreduce-2841-2014-7-17.patch

updates:
1. Remove Hbase/hive/hive/mahout/pig related code, those code will be posted 
elsewhere in another jira or hosted on github.
2. Use ServiceLoader to discover custom platform(to support custom key types)

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
> fb-shuffle.patch, hadoop-3.0-mapreduce-2841-2014-7-17.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5975) Fix native-task build on Ubuntu 13.10

2014-07-21 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069674#comment-14069674
 ] 

Sean Zhong commented on MAPREDUCE-5975:
---

looks good. +1

> Fix native-task build on Ubuntu 13.10
> -
>
> Key: MAPREDUCE-5975
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5975
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
> Attachments: mr-5975.txt
>
>
> I'm having some issues building the native-task branch on my Ubuntu 13.10 
> box. This JIRA is to figure out and fix whatever's going on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5974) Allow map output collector fallback

2014-07-21 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069686#comment-14069686
 ] 

Sean Zhong commented on MAPREDUCE-5974:
---

This change is simple and compatible, +1

> Allow map output collector fallback
> ---
>
> Key: MAPREDUCE-5974
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5974
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Affects Versions: 2.6.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: mapreduce-5974.txt
>
>
> Currently we only allow specifying a single MapOutputCollector implementation 
> class in a job. It would be nice to allow a comma-separated list of classes: 
> we should try each collector implementation in the user-specified order until 
> we find one that can be successfully instantiated and initted.
> This is useful for cases where a particular optimized collector 
> implementation cannot operate on all key/value types, or requires native 
> code. The cluster administrator can configure the cluster to try to use the 
> optimized collector and fall back to the default collector.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5978) native-task CompressTest failure on Ubuntu

2014-07-21 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069720#comment-14069720
 ] 

Sean Zhong commented on MAPREDUCE-5978:
---

Hi Todd,

This is expected. We currently not support BZip2Codec and DefaultCodec. These 
two UT test are placeholder to track which codec we don't support yet.
Maybe, we can change the expected result to false, and add a warning like this 
"Support for BZip2Codec is not implemented yet, please switch the flag to true 
once this codec implemented"

Make sense?



> native-task CompressTest failure on Ubuntu
> --
>
> Key: MAPREDUCE-5978
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5978
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>
> The MR-2841 branch fails the following unit tests on my box:
>   CompressTest.testBzip2Compress:84 file compare result: if they are the same 
> ,then return true expected: but was:
>   CompressTest.testDefaultCompress:116 file compare result: if they are the 
> same ,then return true expected: but was:
> We need to fix these before merging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5984) native-task: upgrade lz4 to lastest version

2014-07-21 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069724#comment-14069724
 ] 

Sean Zhong commented on MAPREDUCE-5984:
---

Binglin,

Looks good to me.
Only one minor thing, I noticed there are "tab" character used for alignment, 
while the remain use space characacter. We should keep this consistent.

> native-task: upgrade lz4 to lastest version
> ---
>
> Key: MAPREDUCE-5984
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5984
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
>Priority: Minor
> Attachments: MAPREDUCE-5984.v1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-07-21 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069754#comment-14069754
 ] 

Sean Zhong commented on MAPREDUCE-2841:
---

Hi Binglin,

The TestGlibCBug UT fail is not expected.  I am investigating why. I will open 
a subtask for this.

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
> fb-shuffle.patch, hadoop-3.0-mapreduce-2841-2014-7-17.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAPREDUCE-5987) native-task: Unit test TestGlibCBug fails on ubuntu

2014-07-21 Thread Sean Zhong (JIRA)

Sean Zhong created MAPREDUCE-5987:
-

 Summary: native-task: Unit test TestGlibCBug fails on ubuntu
 Key: MAPREDUCE-5987
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5987
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: task
Reporter: Sean Zhong
Assignee: Sean Zhong
Priority: Minor


On  ubuntu12, glibc: 2.15-0ubuntu10.3, UT TestGlibCBug fails

[ RUN  ] IFile.TestGlibCBug
14/07/21 15:55:30 INFO TestGlibCBug ./testData/testGlibCBugSpill.out
/home/decster/projects/hadoop-trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/test/TestIFile.cc:186:
 Failure
Value of: realKey
  Actual: 1127504685
Expected: expect[index]
Which is: 4102672832
[  FAILED  ] IFile.TestGlibCBug (0 ms)
[--] 2 tests from IFile (240 ms total)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5976) native-task should not fail to build if snappy is missing

2014-07-22 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071258#comment-14071258
 ] 

Sean Zhong commented on MAPREDUCE-5976:
---

Thanks, Manu,

The patch looks good. +1

> native-task should not fail to build if snappy is missing
> -
>
> Key: MAPREDUCE-5976
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5976
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Sean Zhong
> Attachments: mapreduce-5976.txt
>
>
> Other native parts of Hadoop will automatically disable snappy support if 
> snappy is not present and -Drequire.snappy is not passed. native-task should 
> do the same. (right now, it fails to build if snappy is missing)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5994) native-task: TestBytesUtil fails

2014-07-23 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071832#comment-14071832
 ] 

Sean Zhong commented on MAPREDUCE-5994:
---

We did some test, the test pass.

> native-task: TestBytesUtil fails
> 
>
> Key: MAPREDUCE-5994
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5994
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: mapreduce-5994.txt
>
>
> This class appears to have some bugs. Two tests fail consistently on my 
> system. BytesUtil itself appears to duplicate a lot of code from guava - we 
> should probably just use the Guava functions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5994) native-task: TestBytesUtil fails

2014-07-23 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072856#comment-14072856
 ] 

Sean Zhong commented on MAPREDUCE-5994:
---

yes, thanks

> native-task: TestBytesUtil fails
> 
>
> Key: MAPREDUCE-5994
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5994
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: mapreduce-5994.txt
>
>
> This class appears to have some bugs. Two tests fail consistently on my 
> system. BytesUtil itself appears to duplicate a lot of code from guava - we 
> should probably just use the Guava functions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5997) native-task: Use DirectBufferPool from Hadoop Common

2014-07-23 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072900#comment-14072900
 ] 

Sean Zhong commented on MAPREDUCE-5997:
---

looks good to me, +1

> native-task: Use DirectBufferPool from Hadoop Common
> 
>
> Key: MAPREDUCE-5997
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5997
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Attachments: mapreduce-5997.txt
>
>
> The native task code has its own direct buffer pool, but Hadoop already has 
> an implementation. HADOOP-10882 will move that implementation into Common, 
> and this JIRA is to remove the duplicate code and use that one instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6000) native-task: Simplify ByteBufferDataReader/Writer

2014-07-24 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072981#comment-14072981
 ] 

Sean Zhong commented on MAPREDUCE-6000:
---

looks great, +1

> native-task: Simplify ByteBufferDataReader/Writer
> -
>
> Key: MAPREDUCE-6000
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6000
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Attachments: mapreduce-6000.txt, mapreduce-6000.txt
>
>
> The ByteBufferDataReader and ByteBufferDataWriter class are more complex than 
> necessary:
> - several methods related to reading/writing strings and char arrays are 
> implemented but never used by the native task code. Given that the use case 
> for these classes is limited to serializing binary data to/from the native 
> code, it seems unlikely people will want to use these methods in any 
> performance-critical space. So, let's do simpler implementations that are 
> less likely to be buggy, even if they're slightly less performant.
> - methods like readLine() are even less likely to be used. Since it's a 
> complex implementation, let's just throw UnsupportedOperationException
> - in the test case, we can use Mockito to shorten the amount of new code



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5987) native-task: Unit test TestGlibCBug fails on ubuntu

2014-07-24 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073003#comment-14073003
 ] 

Sean Zhong commented on MAPREDUCE-5987:
---

The steps to reproduce this bug:

1. allocate a small direct buffer, like 10 bytes
2. prepare a large data set in java side, suppose 1MB. And make the source data 
a incremental sequence.
3. write the data, it will first try to fill direct buffer, when it is full, it 
will notify native side to fetch the data, over and over.
4. In native side, check the flushed data, and make sure there are also 
sequential. Ocassionally, one data element data is corrupted.
5. The test can only be reproduced when direct buffer size is extremely small. 

After the Glibc update to https://rhn.redhat.com/errata/RHBA-2013-0279.html, 
this no longer happens.

> native-task: Unit test TestGlibCBug fails on ubuntu
> ---
>
> Key: MAPREDUCE-5987
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5987
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
>
> On  ubuntu12, glibc: 2.15-0ubuntu10.3, UT TestGlibCBug fails
> [ RUN  ] IFile.TestGlibCBug
> 14/07/21 15:55:30 INFO TestGlibCBug ./testData/testGlibCBugSpill.out
> /home/decster/projects/hadoop-trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/test/TestIFile.cc:186:
>  Failure
> Value of: realKey
>   Actual: 1127504685
> Expected: expect[index]
> Which is: 4102672832
> [  FAILED  ] IFile.TestGlibCBug (0 ms)
> [--] 2 tests from IFile (240 ms total)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5976) native-task should not fail to build if snappy is missing

2014-07-24 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073229#comment-14073229
 ] 

Sean Zhong commented on MAPREDUCE-5976:
---

Thanks, Manu. Looks good, +1

Changes of the new patch:
1. use system provided snappy header files, remove builtin snappy header
2. Java side delegate the codec check to a native function, 
NativeRuntime.supportCompressionCodec(codecName : String) 


> native-task should not fail to build if snappy is missing
> -
>
> Key: MAPREDUCE-5976
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5976
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Sean Zhong
> Attachments: mapreduce-5976-v2.txt, mapreduce-5976.txt
>
>
> Other native parts of Hadoop will automatically disable snappy support if 
> snappy is not present and -Drequire.snappy is not passed. native-task should 
> do the same. (right now, it fails to build if snappy is missing)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5987) native-task: Unit test TestGlibCBug fails on ubuntu

2014-07-25 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074169#comment-14074169
 ] 

Sean Zhong commented on MAPREDUCE-5987:
---

by the way, memcpy seems to perform better memmov, that is why I have not 
changed the code.

> native-task: Unit test TestGlibCBug fails on ubuntu
> ---
>
> Key: MAPREDUCE-5987
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5987
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
>
> On  ubuntu12, glibc: 2.15-0ubuntu10.3, UT TestGlibCBug fails
> [ RUN  ] IFile.TestGlibCBug
> 14/07/21 15:55:30 INFO TestGlibCBug ./testData/testGlibCBugSpill.out
> /home/decster/projects/hadoop-trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/test/TestIFile.cc:186:
>  Failure
> Value of: realKey
>   Actual: 1127504685
> Expected: expect[index]
> Which is: 4102672832
> [  FAILED  ] IFile.TestGlibCBug (0 ms)
> [--] 2 tests from IFile (240 ms total)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5987) native-task: Unit test TestGlibCBug fails on ubuntu

2014-07-25 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074164#comment-14074164
 ] 

Sean Zhong commented on MAPREDUCE-5987:
---

Hi Binglin,

Good point. I can remember now when I was trouble shooting it, it pointed to 
memcpy. And I did check carefully on the test to make sure source doesn't 
override with dest.
I will check the code again to prove or refute your point.



> native-task: Unit test TestGlibCBug fails on ubuntu
> ---
>
> Key: MAPREDUCE-5987
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5987
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
>
> On  ubuntu12, glibc: 2.15-0ubuntu10.3, UT TestGlibCBug fails
> [ RUN  ] IFile.TestGlibCBug
> 14/07/21 15:55:30 INFO TestGlibCBug ./testData/testGlibCBugSpill.out
> /home/decster/projects/hadoop-trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/test/TestIFile.cc:186:
>  Failure
> Value of: realKey
>   Actual: 1127504685
> Expected: expect[index]
> Which is: 4102672832
> [  FAILED  ] IFile.TestGlibCBug (0 ms)
> [--] 2 tests from IFile (240 ms total)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6005) native-task: fix some valgrind errors

2014-07-30 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080460#comment-14080460
 ] 

Sean Zhong commented on MAPREDUCE-6005:
---

1. About toHex
{quote}
-string StringUtil::ToString(const void * v, uint32_t len) {
+static char ToHex(uint8_t v) {
+  return v < 10 ? (v + '0') : (v - 10 + 'a');
+}
{quote}

It is not safe to not doing range check for a public function, besides the 
correct implementation which convert binary to string should use base64 
encoding. Since StringUtil::ToString(const void * v, uint32_t len)  is only 
used for md5 conversion, 
{quote}
  case MD5HashType:
dest.append(StringUtil::ToString(data, length));
{quote}

I believe we can rename StringUtil::ToString(const void * v, uint32_t len) to 
StringUtil::md5BinaryToString(const void * v, uint32_t len), and also make 
ToHex(uint8_t v) private or inlined to md5BinaryToString.

2. memmov replace memcpy is good, thanks

3. About 
{quote}
   } else { // no more, pop heap
+delete _heap[0];
{quote}
There is another leak at Merge
{quote}
MergeEntryPtr * base = &(_heap[0]);
popHeap(base, base + cur_heap_size, _comparator);
_heap.pop_back();
{quote}
And, I suggest we can add a comments in source about why we delete _heap[0]


Others looks good, +1

> native-task: fix some valgrind errors 
> --
>
> Key: MAPREDUCE-6005
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6005
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6005.v1.patch, MAPREDUCE-6005.v2.patch
>
>
> Running test with valgrind shows there are some bugs, this jira try to fix 
> them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5978) native-task CompressTest failure on Ubuntu

2014-07-30 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-5978:
--

Assignee: Manu Zhang  (was: Mykola Nikishov)

> native-task CompressTest failure on Ubuntu
> --
>
> Key: MAPREDUCE-5978
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5978
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Manu Zhang
> Attachments: mapreduce-5978.txt
>
>
> The MR-2841 branch fails the following unit tests on my box:
>   CompressTest.testBzip2Compress:84 file compare result: if they are the same 
> ,then return true expected: but was:
>   CompressTest.testDefaultCompress:116 file compare result: if they are the 
> same ,then return true expected: but was:
> We need to fix these before merging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5978) native-task CompressTest failure on Ubuntu

2014-07-30 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-5978:
--

Assignee: Mykola Nikishov

> native-task CompressTest failure on Ubuntu
> --
>
> Key: MAPREDUCE-5978
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5978
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Mykola Nikishov
> Attachments: mapreduce-5978.txt
>
>
> The MR-2841 branch fails the following unit tests on my box:
>   CompressTest.testBzip2Compress:84 file compare result: if they are the same 
> ,then return true expected: but was:
>   CompressTest.testDefaultCompress:116 file compare result: if they are the 
> same ,then return true expected: but was:
> We need to fix these before merging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6005) native-task: fix some valgrind errors

2014-07-30 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080500#comment-14080500
 ] 

Sean Zhong commented on MAPREDUCE-6005:
---

Hi Binling,

About the leak, at src / main / native / src / lib / Merge.cc, there is a 
similar mem leak, you only fixed the leak in PartitionBucketIterator.cc in 
patch 
https://issues.apache.org/jira/secure/attachment/12658416/MAPREDUCE-6005.v2.patch

About toHexString, the name is good. However, maybe better use 
snsprintf(buf_ptr, "%02X", ...)

for (i = 0; i < size; i++)
{
buf_ptr += snsprintf(buf_ptr, "%02X", buf[i]);
}

> native-task: fix some valgrind errors 
> --
>
> Key: MAPREDUCE-6005
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6005
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6005.v1.patch, MAPREDUCE-6005.v2.patch
>
>
> Running test with valgrind shows there are some bugs, this jira try to fix 
> them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6005) native-task: fix some valgrind errors

2014-07-30 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080546#comment-14080546
 ] 

Sean Zhong commented on MAPREDUCE-6005:
---

{quote}
In merger, all MergeEntryPtr is owned by Merger::_entries, and is deleted in 
~Merger at end, so it doesn't require additional care.
{quote}

you are right, +1.

> native-task: fix some valgrind errors 
> --
>
> Key: MAPREDUCE-6005
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6005
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6005.v1.patch, MAPREDUCE-6005.v2.patch, 
> MAPREDUCE-6005.v3.patch
>
>
> Running test with valgrind shows there are some bugs, this jira try to fix 
> them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6005) native-task: fix some valgrind errors

2014-07-30 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080567#comment-14080567
 ] 

Sean Zhong commented on MAPREDUCE-6005:
---

One more:

Can you also fix 
{quote}
string StringUtil::ToString(int32_t v) {
  char tmp[32];
  snprintf(tmp, 32, "%d", v);
  return tmp;
}

string StringUtil::ToString(uint32_t v) {
  char tmp[32];
  snprintf(tmp, 32, "%u", v);
  return tmp;
}

string StringUtil::ToString(int64_t v) {
  char tmp[32];
  snprintf(tmp, 32, "%lld", (long long int)v);
  return tmp;
}

string StringUtil::ToString(int64_t v, char pad, int64_t len) {
  char tmp[32];
  snprintf(tmp, 32, "%%%c%lldlld", pad, len);
  return Format(tmp, v);
}

string StringUtil::ToString(uint64_t v) {
  char tmp[32];
  snprintf(tmp, 32, "%llu", (long long unsigned int)v);
  return tmp;
}

string StringUtil::ToString(bool v) {
  if (v) {
return "true";
  } else {
return "false";
  }
}

string StringUtil::ToString(float v) {
  char tmp[32];
  snprintf(tmp, 32, "%f", v);
  return tmp;
}

string StringUtil::ToString(double v) {
  char tmp[32];
  snprintf(tmp, 32, "%lf", v);
  return tmp;
}

{quote}

1) it is not safe to convert a char array to a string like this. It will 
trigger a copy contructor. But by 
http://www.cplusplus.com/reference/string/string/string/, 
{quote}
string (const char* s);
the string need to be null terminated.
{quote}

2)   snprintf(tmp, 32, "%lf", v) impl is platform dependant when size  “32” 
equals the v length. It may truncate the raw data, or may ignore the null 
terminitor. http://linux.die.net/man/3/snprintf, 
{quote}
The functions snprintf() and vsnprintf() write at most size bytes (including 
the terminating null byte ('\0')) to str.
{quote}

> native-task: fix some valgrind errors 
> --
>
> Key: MAPREDUCE-6005
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6005
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6005.v1.patch, MAPREDUCE-6005.v2.patch, 
> MAPREDUCE-6005.v3.patch
>
>
> Running test with valgrind shows there are some bugs, this jira try to fix 
> them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6005) native-task: fix some valgrind errors

2014-07-31 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080651#comment-14080651
 ] 

Sean Zhong commented on MAPREDUCE-6005:
---

Thanks. +1

> native-task: fix some valgrind errors 
> --
>
> Key: MAPREDUCE-6005
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6005
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6005.v1.patch, MAPREDUCE-6005.v2.patch, 
> MAPREDUCE-6005.v3.patch, MAPREDUCE-6005.v4.patch
>
>
> Running test with valgrind shows there are some bugs, this jira try to fix 
> them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5984) native-task: upgrade lz4 to lastest version

2014-08-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085733#comment-14085733
 ] 

Sean Zhong commented on MAPREDUCE-5984:
---

Looks good, +1

> native-task: upgrade lz4 to lastest version
> ---
>
> Key: MAPREDUCE-5984
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5984
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
>Priority: Minor
> Attachments: MAPREDUCE-5984.v1.patch, MAPREDUCE-5984.v2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5976) native-task should not fail to build if snappy is missing

2014-08-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085734#comment-14085734
 ] 

Sean Zhong commented on MAPREDUCE-5976:
---

[~tlipcon], can you take a look at the new patch?

> native-task should not fail to build if snappy is missing
> -
>
> Key: MAPREDUCE-5976
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5976
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Sean Zhong
> Attachments: mapreduce-5976-v2.txt, mapreduce-5976.txt
>
>
> Other native parts of Hadoop will automatically disable snappy support if 
> snappy is not present and -Drequire.snappy is not passed. native-task should 
> do the same. (right now, it fails to build if snappy is missing)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5978) native-task CompressTest failure on Ubuntu

2014-08-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085735#comment-14085735
 ] 

Sean Zhong commented on MAPREDUCE-5978:
---

Ok, patch looks good, +1

> native-task CompressTest failure on Ubuntu
> --
>
> Key: MAPREDUCE-5978
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5978
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Manu Zhang
> Attachments: mapreduce-5978.txt
>
>
> The MR-2841 branch fails the following unit tests on my box:
>   CompressTest.testBzip2Compress:84 file compare result: if they are the same 
> ,then return true expected: but was:
>   CompressTest.testDefaultCompress:116 file compare result: if they are the 
> same ,then return true expected: but was:
> We need to fix these before merging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5976) native-task should not fail to build if snappy is missing

2014-08-06 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087375#comment-14087375
 ] 

Sean Zhong commented on MAPREDUCE-5976:
---

Committed to branch at r1616115. Thanks!

> native-task should not fail to build if snappy is missing
> -
>
> Key: MAPREDUCE-5976
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5976
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Sean Zhong
> Attachments: mapreduce-5976-v2.txt, mapreduce-5976-v3.txt, 
> mapreduce-5976-v4.txt, mapreduce-5976.txt
>
>
> Other native parts of Hadoop will automatically disable snappy support if 
> snappy is not present and -Drequire.snappy is not passed. native-task should 
> do the same. (right now, it fails to build if snappy is missing)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-5976) native-task should not fail to build if snappy is missing

2014-08-06 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong resolved MAPREDUCE-5976.
---

  Resolution: Fixed
Hadoop Flags: Reviewed

commited to branch MR-2841 at r1616115

> native-task should not fail to build if snappy is missing
> -
>
> Key: MAPREDUCE-5976
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5976
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Sean Zhong
> Attachments: mapreduce-5976-v2.txt, mapreduce-5976-v3.txt, 
> mapreduce-5976-v4.txt, mapreduce-5976.txt
>
>
> Other native parts of Hadoop will automatically disable snappy support if 
> snappy is not present and -Drequire.snappy is not passed. native-task should 
> do the same. (right now, it fails to build if snappy is missing)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-5978) native-task CompressTest failure on Ubuntu

2014-08-06 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong resolved MAPREDUCE-5978.
---

Resolution: Fixed

commited to branch MR-2841 at r1616116. Thanks, Manu.

> native-task CompressTest failure on Ubuntu
> --
>
> Key: MAPREDUCE-5978
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5978
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Manu Zhang
> Attachments: mapreduce-5978.txt
>
>
> The MR-2841 branch fails the following unit tests on my box:
>   CompressTest.testBzip2Compress:84 file compare result: if they are the same 
> ,then return true expected: but was:
>   CompressTest.testDefaultCompress:116 file compare result: if they are the 
> same ,then return true expected: but was:
> We need to fix these before merging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5987) native-task: Unit test TestGlibCBug fails on ubuntu

2014-08-06 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087387#comment-14087387
 ] 

Sean Zhong commented on MAPREDUCE-5987:
---

Hi, I also cannot reproduce this either. 

We will try to find more different machines to test it.

> native-task: Unit test TestGlibCBug fails on ubuntu
> ---
>
> Key: MAPREDUCE-5987
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5987
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
>
> On  ubuntu12, glibc: 2.15-0ubuntu10.3, UT TestGlibCBug fails
> [ RUN  ] IFile.TestGlibCBug
> 14/07/21 15:55:30 INFO TestGlibCBug ./testData/testGlibCBugSpill.out
> /home/decster/projects/hadoop-trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/test/TestIFile.cc:186:
>  Failure
> Value of: realKey
>   Actual: 1127504685
> Expected: expect[index]
> Which is: 4102672832
> [  FAILED  ] IFile.TestGlibCBug (0 ms)
> [--] 2 tests from IFile (240 ms total)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-5987) native-task: Unit test TestGlibCBug fails on ubuntu

2014-08-10 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong resolved MAPREDUCE-5987.
---

Resolution: Cannot Reproduce

> native-task: Unit test TestGlibCBug fails on ubuntu
> ---
>
> Key: MAPREDUCE-5987
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5987
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
>
> On  ubuntu12, glibc: 2.15-0ubuntu10.3, UT TestGlibCBug fails
> [ RUN  ] IFile.TestGlibCBug
> 14/07/21 15:55:30 INFO TestGlibCBug ./testData/testGlibCBugSpill.out
> /home/decster/projects/hadoop-trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/test/TestIFile.cc:186:
>  Failure
> Value of: realKey
>   Actual: 1127504685
> Expected: expect[index]
> Which is: 4102672832
> [  FAILED  ] IFile.TestGlibCBug (0 ms)
> [--] 2 tests from IFile (240 ms total)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5987) native-task: Unit test TestGlibCBug fails on ubuntu

2014-08-10 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092347#comment-14092347
 ] 

Sean Zhong commented on MAPREDUCE-5987:
---

We tested in several env, cannot reproduce it.

Let's close it for now.

> native-task: Unit test TestGlibCBug fails on ubuntu
> ---
>
> Key: MAPREDUCE-5987
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5987
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
>
> On  ubuntu12, glibc: 2.15-0ubuntu10.3, UT TestGlibCBug fails
> [ RUN  ] IFile.TestGlibCBug
> 14/07/21 15:55:30 INFO TestGlibCBug ./testData/testGlibCBugSpill.out
> /home/decster/projects/hadoop-trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/test/TestIFile.cc:186:
>  Failure
> Value of: realKey
>   Actual: 1127504685
> Expected: expect[index]
> Which is: 4102672832
> [  FAILED  ] IFile.TestGlibCBug (0 ms)
> [--] 2 tests from IFile (240 ms total)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (MAPREDUCE-5977) Fix or suppress native-task gcc warnings

2014-08-10 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong reassigned MAPREDUCE-5977:
-

Assignee: Manu Zhang  (was: Todd Lipcon)

Assign this to Manu

> Fix or suppress native-task gcc warnings
> 
>
> Key: MAPREDUCE-5977
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5977
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Manu Zhang
> Attachments: mapreduce-5977.txt
>
>
> Currently, building the native task code on gcc 4.8 has a fair number of 
> warnings. We should fix or suppress them so that new warnings are easier to 
> see.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5977) Fix or suppress native-task gcc warnings

2014-08-10 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092369#comment-14092369
 ] 

Sean Zhong commented on MAPREDUCE-5977:
---

HI, Manu,

Can  you attach a file of GCC compie logs? 

IMHO, static_cast from int to uint32_t change doesn't sound necessary and 
clean, maybe we should supress this kind of warnings. 
+1 for PRIu64 and PRId64, we need to test tihs in multiple platforms to make 
sure it is safe.

For gtest warnings, we can suppress the warning mesage like this 
http://stackoverflow.com/questions/1867065/how-to-suppress-gcc-warnings-from-library-headers

> Fix or suppress native-task gcc warnings
> 
>
> Key: MAPREDUCE-5977
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5977
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Manu Zhang
> Attachments: mapreduce-5977.txt
>
>
> Currently, building the native task code on gcc 4.8 has a fair number of 
> warnings. We should fix or suppress them so that new warnings are easier to 
> see.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6026) native-task: fix logging

2014-08-10 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092380#comment-14092380
 ] 

Sean Zhong commented on MAPREDUCE-6026:
---

+1

> native-task: fix logging
> 
>
> Key: MAPREDUCE-6026
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6026
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Attachments: mapreduce-6026.txt
>
>
> nativetask should use commons-logging and add log4j.properties in test 
> configuration as per hadoop standard



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5992) native-task test logs should not write to console

2014-08-10 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092385#comment-14092385
 ] 

Sean Zhong commented on MAPREDUCE-5992:
---

It is tricky to redirect native console to java logger using JNI. Another 
option would be using lo4cpp, but it requires additional dependance and also a 
new config file log4cpp.properties, so also not clean.

Maybe we should leave it as is?

> native-task test logs should not write to console
> -
>
> Key: MAPREDUCE-5992
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5992
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>
> Most of our unit tests are configured with a log4j.properties test resource 
> so they don't spout a bunch of output to the console. We need to do the same 
> for native-task.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6025) native-task: fix native library distribution

2014-08-10 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092394#comment-14092394
 ] 

Sean Zhong commented on MAPREDUCE-6025:
---

Can this config to moved to pom of sub project 
hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask?

It seems not clean to have nativetask distributation configurations in top 
level pom.xml.

> native-task: fix native library distribution
> 
>
> Key: MAPREDUCE-6025
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6025
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Manu Zhang
>Assignee: Manu Zhang
> Attachments: mapreduce-6025-v2.txt, mapreduce-6025-v3.txt, 
> mapreduce-6025.txt
>
>
> currently running "mvn install -Pdist" fails and nativetask native library is 
> not distributed to hadoop tar



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6025) native-task: fix native library distribution

2014-08-15 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098356#comment-14098356
 ] 

Sean Zhong commented on MAPREDUCE-6025:
---

Thanks, +1

> native-task: fix native library distribution
> 
>
> Key: MAPREDUCE-6025
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6025
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Manu Zhang
>Assignee: Manu Zhang
> Attachments: mapreduce-6025-v2.txt, mapreduce-6025-v3.txt, 
> mapreduce-6025-v4.txt, mapreduce-6025.txt
>
>
> currently running "mvn install -Pdist" fails and nativetask native library is 
> not distributed to hadoop tar



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6025) native-task: fix native library distribution

2014-08-15 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098362#comment-14098362
 ] 

Sean Zhong commented on MAPREDUCE-6025:
---

Hi Binling， Manu made patch 4 which change cp to cp -R, and other concerns. Are 
you ok with that? I will commit after you confirm.

> native-task: fix native library distribution
> 
>
> Key: MAPREDUCE-6025
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6025
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Manu Zhang
>Assignee: Manu Zhang
> Attachments: mapreduce-6025-v2.txt, mapreduce-6025-v3.txt, 
> mapreduce-6025-v4.txt, mapreduce-6025.txt
>
>
> currently running "mvn install -Pdist" fails and nativetask native library is 
> not distributed to hadoop tar



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6025) native-task: fix native library distribution

2014-08-15 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098367#comment-14098367
 ] 

Sean Zhong commented on MAPREDUCE-6025:
---

thanks, commited.

> native-task: fix native library distribution
> 
>
> Key: MAPREDUCE-6025
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6025
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Manu Zhang
>Assignee: Manu Zhang
> Attachments: mapreduce-6025-v2.txt, mapreduce-6025-v3.txt, 
> mapreduce-6025-v4.txt, mapreduce-6025.txt
>
>
> currently running "mvn install -Pdist" fails and nativetask native library is 
> not distributed to hadoop tar



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAPREDUCE-6025) native-task: fix native library distribution

2014-08-17 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong resolved MAPREDUCE-6025.
---

Resolution: Fixed

> native-task: fix native library distribution
> 
>
> Key: MAPREDUCE-6025
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6025
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Manu Zhang
>Assignee: Manu Zhang
> Attachments: mapreduce-6025-v2.txt, mapreduce-6025-v3.txt, 
> mapreduce-6025-v4.txt, mapreduce-6025.txt
>
>
> currently running "mvn install -Pdist" fails and nativetask native library is 
> not distributed to hadoop tar



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5977) Fix or suppress native-task gcc warnings

2014-08-17 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100235#comment-14100235
 ] 

Sean Zhong commented on MAPREDUCE-5977:
---

gtest.h is moved to folder gtest/include/gtest/, is this because that it is 
more easy to treat it as system header, and removing compilation warnings?

New patch looks good

> Fix or suppress native-task gcc warnings
> 
>
> Key: MAPREDUCE-5977
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5977
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Manu Zhang
> Attachments: gcc_compile.log, mapreduce-5977-v2.txt, 
> mapreduce-5977-v3.txt, mapreduce-5977.txt
>
>
> Currently, building the native task code on gcc 4.8 has a fair number of 
> warnings. We should fix or suppress them so that new warnings are easier to 
> see.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5977) Fix or suppress native-task gcc warnings

2014-08-24 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108675#comment-14108675
 ] 

Sean Zhong commented on MAPREDUCE-5977:
---

Hi Todd,

I encounter errors when trying to commit, can you try this in your side?

Commit failed (details follow):
Changing file
 
'C:\myData\MR-28412\hadoop-mapreduce-project\hadoop-mapreduce-client\hadoop-mapreduce-client-nativetask\src\main\native\gtest\gtest.h'
 is forbidden by the server
Access to
 
'/repos/asf/!svn/txr/1620252-ziae/hadoop/common/branches/MR-2841/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/gtest/gtest.h'
 forbidden
Additional errors:
DELETE of
 
'/repos/asf/!svn/txr/1620252-ziae/hadoop/common/branches/MR-2841/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/gtest/gtest.h':
 403 Forbidden



> Fix or suppress native-task gcc warnings
> 
>
> Key: MAPREDUCE-5977
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5977
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Manu Zhang
> Attachments: gcc_compile.log, mapreduce-5977-v2.txt, 
> mapreduce-5977-v3.txt, mapreduce-5977.txt
>
>
> Currently, building the native task code on gcc 4.8 has a fair number of 
> warnings. We should fix or suppress them so that new warnings are easier to 
> see.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6058) native-task: KVTest and LargeKVTest should check mr job is sucessful

2014-09-01 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117831#comment-14117831
 ] 

Sean Zhong commented on MAPREDUCE-6058:
---

looks good,

Should we move the following lines to a startUp() function so that when adding 
new test function, we don't need to ad these two lines.
   @Test
   public void testKVCompability() {
+Assume.assumeTrue(NativeCodeLoader.isNativeCodeLoaded());
+Assume.assumeTrue(NativeRuntime.isNativeLibraryLoaded());



> native-task: KVTest and LargeKVTest should check mr job is sucessful
> 
>
> Key: MAPREDUCE-6058
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6058
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
>Priority: Minor
> Attachments: MAPREDUCE-6058.v1.patch, MAPREDUCE-6058.v2.patch
>
>
> When running KVTest and LargeKVTest, if the job failed for some reason(lack 
> libhadoop.so etc), both native and normal job failed, and both compare empty 
> output directory, so the test passes without noticing failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6058) native-task: KVTest and LargeKVTest should check mr job is sucessful

2014-09-01 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117843#comment-14117843
 ] 

Sean Zhong commented on MAPREDUCE-6058:
---

+1

> native-task: KVTest and LargeKVTest should check mr job is sucessful
> 
>
> Key: MAPREDUCE-6058
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6058
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
>Priority: Minor
> Attachments: MAPREDUCE-6058.v1.patch, MAPREDUCE-6058.v2.patch, 
> MAPREDUCE-6058.v3.patch
>
>
> When running KVTest and LargeKVTest, if the job failed for some reason(lack 
> libhadoop.so etc), both native and normal job failed, and both compare empty 
> output directory, so the test passes without noticing failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6055) native-task: findbugs, interface annotations, and other misc cleanup

2014-09-01 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117844#comment-14117844
 ] 

Sean Zhong commented on MAPREDUCE-6055:
---

DefaultSerializer
Platform
INativeComparable 
INativeSerializer
should be public and evolving. 

Other changes looks good. We can ignore the createNativeTaskOutput warning

> native-task: findbugs, interface annotations, and other misc cleanup
> 
>
> Key: MAPREDUCE-6055
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6055
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: mapreduce-6055.txt
>
>
> A few items which we need to address before merge:
> - fix findbugs errors
> - add interface and stability annotations to all public classes
> - fix eclipse warnings where possible



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6065) native-task: warnings about illegal Progress values

2014-09-02 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-6065:
--
Assignee: Manu Zhang

> native-task: warnings about illegal Progress values
> ---
>
> Key: MAPREDUCE-6065
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6065
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Manu Zhang
>
> In running terasort tests, I see a few warnings like this:
> 2014-09-02 18:50:34,623 WARN [main] org.apache.hadoop.util.Progress: Illegal 
> progress value found, progress is larger than 1. Progress will be changed to 1
> It sounds like we're improperly calculating task progress somewhere. We 
> should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-09-02 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119200#comment-14119200
 ] 

Sean Zhong commented on MAPREDUCE-2841:
---

Hi Todd,

We typically choose block size as 512MB, and tune the io.sort.mb to make each 
task spill only once, and use 10GbE for testing. Other setting are typical 
settings.

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
> fb-shuffle.patch, hadoop-3.0-mapreduce-2841-2014-7-17.patch, 
> micro-benchmark.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6065) native-task: warnings about illegal Progress values

2014-09-02 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119204#comment-14119204
 ] 

Sean Zhong commented on MAPREDUCE-6065:
---

Sure, we will take care of it.

> native-task: warnings about illegal Progress values
> ---
>
> Key: MAPREDUCE-6065
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6065
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Manu Zhang
>
> In running terasort tests, I see a few warnings like this:
> 2014-09-02 18:50:34,623 WARN [main] org.apache.hadoop.util.Progress: Illegal 
> progress value found, progress is larger than 1. Progress will be changed to 1
> It sounds like we're improperly calculating task progress somewhere. We 
> should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-5992) native-task test logs should not write to console

2014-09-02 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119207#comment-14119207
 ] 

Sean Zhong commented on MAPREDUCE-5992:
---

Should we mark this as won't fix? 

I cannot find a clean solution currently. 

> native-task test logs should not write to console
> -
>
> Key: MAPREDUCE-5992
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5992
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>
> Most of our unit tests are configured with a log4j.properties test resource 
> so they don't spout a bunch of output to the console. We need to do the same 
> for native-task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6058) native-task: KVTest and LargeKVTest should check mr job is sucessful

2014-09-03 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120844#comment-14120844
 ] 

Sean Zhong commented on MAPREDUCE-6058:
---

Agree with Manu

> native-task: KVTest and LargeKVTest should check mr job is sucessful
> 
>
> Key: MAPREDUCE-6058
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6058
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
>Priority: Minor
> Attachments: MAPREDUCE-6058.v1.patch, MAPREDUCE-6058.v2.patch, 
> MAPREDUCE-6058.v3.patch
>
>
> When running KVTest and LargeKVTest, if the job failed for some reason(lack 
> libhadoop.so etc), both native and normal job failed, and both compare empty 
> output directory, so the test passes without noticing failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6069) native-task: Style fixups and dead code removal

2014-09-03 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121021#comment-14121021
 ] 

Sean Zhong commented on MAPREDUCE-6069:
---

Todd,

I cannot express enough my thanks for this! I know cleaning this up will take 
huge amount of time. So, it is an essential steps towards making it acceptable 
by the community.

Consider this patch is huge, maybe it is more easy to break it down to several 
small patches, so that we can track the changes step by step. We can work on 
your patch to break it down to several smaller ones.

> native-task: Style fixups and dead code removal
> ---
>
> Key: MAPREDUCE-6069
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6069
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
> Attachments: mr-6069.txt
>
>
> A few more cleanup things we should address:
> - fix style issues (eg lines too long, bad indentation, commented-out code 
> blocks, etc) both in Java and C++
> - Found a few pieces of unused code by running a coverage tool. We should 
> remove them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6067) native-task: spilled records counter is incorrect

2014-09-03 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121028#comment-14121028
 ] 

Sean Zhong commented on MAPREDUCE-6067:
---

{quote}
//assertEquals("Native Reduce reduce group counter should equal orignal 
reduce group counter",
//nativeReduceGroups.getValue(), normalReduceGroups.getValue());
{quote}

Hi Todd,

I made that change one year ago. The idea is that since combiner is an optional 
step, so no matter combiner can reduce 50% of data, 90% of data, they are both 
correct. So, for some optimization, we may not need to combine every key, and 
just leave them to  be handled by reducer.

> native-task: spilled records counter is incorrect
> -
>
> Key: MAPREDUCE-6067
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6067
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6067.v1.patch, native-counters.html, 
> trunk-counters.html
>
>
> After running a terasort, I see the spilled records counter at 5028651606, 
> which is about half what I expected to see. Using the non-native collector I 
> see the expected count of 100. It seems the correct number of records 
> were indeed spilled, because the job's output record count is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6067) native-task: spilled records counter is incorrect

2014-09-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121306#comment-14121306
 ] 

Sean Zhong commented on MAPREDUCE-6067:
---

Hi Binglin,

Thanks for your patch. I have some doubt about your latest patch:
{quote}
void MCollectorOutputHandler::handleInput(ByteBuffer & in) {
  char * buff = in.current();
  uint32_t length = in.remain();

  const char * end = buff + length;
  char * pos = buff;
  if (_kvContainer.remain() > 0) {
uint32_t filledLength = _kvContainer.fill(pos, length);
pos += filledLength;
  }

  while (end - pos > 0) {
KVBufferWithParititionId * kvBuffer = (KVBufferWithParititionId *)pos;

if (unlikely(end - pos < KVBuffer::headerLength())) {
  THROW_EXCEPTION(IOException, "k/v meta information incomplete");
}

if (_endium == LARGE_ENDIUM) {
  kvBuffer->partitionId = bswap(kvBuffer->partitionId);
  kvBuffer->buffer.keyLength = bswap(kvBuffer->buffer.keyLength);
  kvBuffer->buffer.valueLength = bswap(kvBuffer->buffer.valueLength);
}

uint32_t kvLength = kvBuffer->buffer.length();

KVBuffer * dest = allocateKVBuffer(kvBuffer->partitionId, kvLength);
_kvContainer.wrap((char *)dest, kvLength);

pos += 4; //skip the partition length
uint32_t filledLength = _kvContainer.fill(pos, end - pos);
pos += filledLength;

+_mapOutputRecords->increase();
+uint32_t outputSize = kvLength-KVBuffer::headerLength();
+_mapOutputBytes->increase(outputSize);
  }
}
{quote}
Since the new added line lies in the critical path of performance. May be it is 
risky to change here?

{quote}
+  Counter * _mapOutputRecords;
+  Counter * _mapOutputBytes;
{quote}

For these two, they are not inited in the constructor. It will trigger compile 
warnings, which we put great efforts to solve at MAPREDUCE-5977. There are lots 
of similar issues in other files. Make sure all newed added field are inited in 
constructor with the right order of field definition, otherwise there will be 
GCC warnings.

{quote}
-  const uint64_t M = 100; //million
-  LOG("[MapOutputCollector::final_merge_and_spill] Spilling file path: %s", 
filepath.c_str());
{quote}

Log is removed due to it is too noisy? The log was added after real pain and 
practices in troubleshootings, it give out important informanation when 
spliting. I also noticed some other log messages are removed, do we have enough 
reason for this? 
 
{quote}
-  } else {
-LOG("MemoryPool is full, fail to allocate new MemBlock, block size: 
%d, kv length: %d", expect, kvLength);
{quote}
I can understand why you remove this two lines. But these two lines also helped 
when troubleshooting a realbug, where the KV length memory is coruppted, that 
the LOG will print a huge kvLength information, that really helpped in 
troubleshooting.

{quote}
+  assertEquals(reason, true, compareRet);
+  ResultVerifier.verifyCounters(normaljob, nativejob);
+}
fs.close();
{quote}
I am oK with the change as long as all regression test passes.

 in KVTest.java
{quote}
   }
 
+  Job normalJob;
+  Job nativeJob;
+
@Test
public void testKVCompability() throws Exception {
{quote}

Can we make normalJob and nativeJob local var instead of field member? Since it 
is a test file, Test case should share nothing except immutable things defined 
in test setup.

{quote}
-if(compareRet){
-  final FileSystem fs = FileSystem.get(hadoopkvtestconf);
-  fs.delete(new Path(nativeoutput), true);
-  fs.delete(new Path(normaloutput), true);
-  fs.delete(new Path(input), true);
-  fs.close();
-}
{quote}

By deleting the cleanup code, have you confirmed that it will leak any garbage 
file on local disk?

{quote}
   @AfterClass
@@ -150,6 +148,7 @@ private String runNativeTest(String jobname, Class 
keyclass, Class valuecl
 nativekvtestconf.set(TestConstants.NATIVETASK_KVTEST_CREATEFILE, "true");
 final KVJob keyJob = new KVJob(jobname, nativekvtestconf, keyclass, 
valueclass, inputpath, outputpath);
 assertTrue("job should complete successfully", keyJob.runJob());
+nativeJob = keyJob.job;
 return outputpath;
   }
{quote}

It is confusing by looking at this line of change. 


{quote}
+Counters normalCounters = normalJob.getCounters();
+Counters nativeCounters = nativeJob.getCounters();
+assertEquals(
+normalCounters.findCounter(TaskCounter.MAP_OUTPUT_RECORDS).getValue(),
+nativeCounters.findCounter(TaskCounter.MAP_OUTPUT_RECORDS).getValue());
+assertEquals(
+normalCounters.findCounter(TaskCounter.REDUCE_INPUT_GROUPS).getValue(),
+
nativeCounters.findCounter(TaskCounter.REDUCE_INPUT_GROUPS).getValue());
+assertEquals(
+
normalCounters.findCounter(TaskCounter.REDUCE_INPUT_RECORDS).getValue(),
+
nativeCounters.findCounter(TaskCounter.REDUCE_INPUT_RECORDS).getValue());
{quote}

Maybe we can

[jira] [Commented] (MAPREDUCE-6067) native-task: spilled records counter is incorrect

2014-09-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121309#comment-14121309
 ] 

Sean Zhong commented on MAPREDUCE-6067:
---

By the way, the patches doesn't applies well on my local source folder. 

> native-task: spilled records counter is incorrect
> -
>
> Key: MAPREDUCE-6067
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6067
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6067.v1.patch, MAPREDUCE-6067.v2.patch, 
> MAPREDUCE-6067.v3.patch, MAPREDUCE-6067.v4.patch, native-counters.html, 
> trunk-counters.html
>
>
> After running a terasort, I see the spilled records counter at 5028651606, 
> which is about half what I expected to see. Using the non-native collector I 
> see the expected count of 100. It seems the correct number of records 
> were indeed spilled, because the job's output record count is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6067) native-task: spilled records counter is incorrect

2014-09-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122344#comment-14122344
 ] 

Sean Zhong commented on MAPREDUCE-6067:
---

Hi Binling, 

{quote}
Thanks for the comments Manu and Sean.
Since the new added line lies in the critical path of performance. May be it is 
risky to change here?
1. The added code just increase 2 counters, the performance impact should be 
negligible, and we need a way to get the counter number right? java side also 
increase counters for every kv pair.
{quote}

I am 100% understand what you are doing, counter is important. I also belive 
that there should no performance impact. The reason that I raise this out is 
that this part of code is tuned for CPU cache efficiency for months, and during 
that process, I did saw some seeming trivival minor changes, like one or two 
lines of change, will impact the CPU cache efficiency. 

I am OK that you commit this change, I just want to remind the potential risk.

{quote}
 In common practice, log added when troubleshooting bug should be remove after 
the bug is found and fixed. 
{quote}
The troubleshooting I mean is operation team troubleshooting for customers.

{quote}
spill file path is useful only for debugging only
{quote}
I think operation team will want this. In field, myself have been benefited 
from this line of code during onsite custom support. I wil want to know where 
the intermediate files goes.

{quote}
LOG("MemoryPool is full, fail to allocate new MemBlock, block size: %d, kv 
length: %d", expect, kvLength);
{quote}
For this line, I agreed that you should remove it.

For the merge log, spill log, I do think we should keep them. First, there are 
only a few lines of them, it will not impact the readbility. Second, It 
indicate the normal application flow.
It is strange that there is no log for combiner Like this when I enabled the 
combiner.
{quote}
-  if (total_record != 0) {
-LOG("[Merge] Merged segment#: %lu, record#: %"PRIu64", avg record size: 
%"PRIu64", uncompressed total bytes: %"PRIu64", compressed total bytes: 
%"PRIu64", time: %"PRIu64" ms",
-_entries.size(),
-total_record,
-output_size / (total_record),
-output_size,
-real_output_size,
-interval / M);
{quote}

{quote}
4. Simple use local var doesn't work, if we want to eliminate field member, we 
need a way to get both outputpath and job from sub-methods(runNativeTest, 
runNormalTest), perhaps just inline them into test method, this is lot change 
compare to current approach, if you think it's OK, I will make more aggressive 
changes.
{quote}

I think more changes are better than this. Some test maven plugin allows tests 
to be runned in parallel, sharing mutable stuff in Test Cases is wrong.

{quote}
 I see you already add cleanUp method to remove root dir, so the old cleanup 
code is removed
{quote}

OK.

{quote}
+  Counter * materializedBytes = 
NativeObjectFactory::GetCounter(TaskCounters::TASK_COUNTER_GROUP,
+  TaskCounters::MAP_OUTPUT_MATERIALIZED_BYTES);
{quote}

Yes, but you only have a declaration, but never use it? 

> native-task: spilled records counter is incorrect
> -
>
> Key: MAPREDUCE-6067
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6067
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6067.v1.patch, MAPREDUCE-6067.v2.patch, 
> MAPREDUCE-6067.v3.patch, MAPREDUCE-6067.v4.patch, native-counters.html, 
> trunk-counters.html
>
>
> After running a terasort, I see the spilled records counter at 5028651606, 
> which is about half what I expected to see. Using the non-native collector I 
> see the expected count of 100. It seems the correct number of records 
> were indeed spilled, because the job's output record count is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6067) native-task: spilled records counter is incorrect

2014-09-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122356#comment-14122356
 ] 

Sean Zhong commented on MAPREDUCE-6067:
---

Here, the variable name is different
{quote}
+
+  _outputBytes->increase(realOutputSize);
{quote}

Oh, yes, do you mind use same name?

+1 for other part.

> native-task: spilled records counter is incorrect
> -
>
> Key: MAPREDUCE-6067
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6067
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6067.v1.patch, MAPREDUCE-6067.v2.patch, 
> MAPREDUCE-6067.v3.patch, MAPREDUCE-6067.v4.patch, native-counters.html, 
> trunk-counters.html
>
>
> After running a terasort, I see the spilled records counter at 5028651606, 
> which is about half what I expected to see. Using the non-native collector I 
> see the expected count of 100. It seems the correct number of records 
> were indeed spilled, because the job's output record count is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6067) native-task: spilled records counter is incorrect

2014-09-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122449#comment-14122449
 ] 

Sean Zhong commented on MAPREDUCE-6067:
---

+1 for the change.

Have you runned this against test?

> native-task: spilled records counter is incorrect
> -
>
> Key: MAPREDUCE-6067
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6067
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6067.v1.patch, MAPREDUCE-6067.v2.patch, 
> MAPREDUCE-6067.v3.patch, MAPREDUCE-6067.v4.patch, MAPREDUCE-6067.v5.patch, 
> native-counters.html, trunk-counters.html
>
>
> After running a terasort, I see the spilled records counter at 5028651606, 
> which is about half what I expected to see. Using the non-native collector I 
> see the expected count of 100. It seems the correct number of records 
> were indeed spilled, because the job's output record count is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6069) native-task: Style fixups and dead code removal

2014-09-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122455#comment-14122455
 ] 

Sean Zhong commented on MAPREDUCE-6069:
---

Hi Todd,

where is the new patch?

> native-task: Style fixups and dead code removal
> ---
>
> Key: MAPREDUCE-6069
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6069
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
> Attachments: mr-6069.txt, mr-6069.txt
>
>
> A few more cleanup things we should address:
> - fix style issues (eg lines too long, bad indentation, commented-out code 
> blocks, etc) both in Java and C++
> - Found a few pieces of unused code by running a coverage tool. We should 
> remove them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6067) native-task: spilled records counter is incorrect

2014-09-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122474#comment-14122474
 ] 

Sean Zhong commented on MAPREDUCE-6067:
---

+1

> native-task: spilled records counter is incorrect
> -
>
> Key: MAPREDUCE-6067
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6067
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6067.v1.patch, MAPREDUCE-6067.v2.patch, 
> MAPREDUCE-6067.v3.patch, MAPREDUCE-6067.v4.patch, MAPREDUCE-6067.v5.patch, 
> native-counters.html, trunk-counters.html
>
>
> After running a terasort, I see the spilled records counter at 5028651606, 
> which is about half what I expected to see. Using the non-native collector I 
> see the expected count of 100. It seems the correct number of records 
> were indeed spilled, because the job's output record count is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6069) native-task: Style fixups and dead code removal

2014-09-04 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122476#comment-14122476
 ] 

Sean Zhong commented on MAPREDUCE-6069:
---

The file you uploaded is only 8K, looks like you uploaded a wrong file.

> native-task: Style fixups and dead code removal
> ---
>
> Key: MAPREDUCE-6069
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6069
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
> Attachments: mr-6069.txt, mr-6069.txt
>
>
> A few more cleanup things we should address:
> - fix style issues (eg lines too long, bad indentation, commented-out code 
> blocks, etc) both in Java and C++
> - Found a few pieces of unused code by running a coverage tool. We should 
> remove them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6069) native-task: Style fixups and dead code removal

2014-09-05 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122538#comment-14122538
 ] 

Sean Zhong commented on MAPREDUCE-6069:
---

+1， thanks!

> native-task: Style fixups and dead code removal
> ---
>
> Key: MAPREDUCE-6069
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6069
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
> Attachments: mr-6069.txt, mr-6069.txt, mr-6069.txt
>
>
> A few more cleanup things we should address:
> - fix style issues (eg lines too long, bad indentation, commented-out code 
> blocks, etc) both in Java and C++
> - Found a few pieces of unused code by running a coverage tool. We should 
> remove them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MAPREDUCE-6067) native-task: fix some counter issues

2014-09-05 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong resolved MAPREDUCE-6067.
---
Resolution: Fixed

mark as resolved.

> native-task: fix some counter issues
> 
>
> Key: MAPREDUCE-6067
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6067
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Binglin Chang
> Attachments: MAPREDUCE-6067.v1.patch, MAPREDUCE-6067.v2.patch, 
> MAPREDUCE-6067.v3.patch, MAPREDUCE-6067.v4.patch, MAPREDUCE-6067.v5.patch, 
> native-counters.html, trunk-counters.html
>
>
> After running a terasort, I see the spilled records counter at 5028651606, 
> which is about half what I expected to see. Using the non-native collector I 
> see the expected count of 100. It seems the correct number of records 
> were indeed spilled, because the job's output record count is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-09-05 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124270#comment-14124270
 ] 

Sean Zhong commented on MAPREDUCE-2841:
---

Hi Todd,

In that case, we can remove CustomModule SDK example. 

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-05 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Status: In Progress  (was: Patch Available)

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-05 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Attachment: mr-2841-merge-2.txt

new patch after removing the customModule

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-05 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Status: Patch Available  (was: In Progress)

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6074) native-task: fix release audit, javadoc, javac warnings

2014-09-05 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124276#comment-14124276
 ] 

Sean Zhong commented on MAPREDUCE-6074:
---

It looks good. +1. 

I will clean up the CustomModule in another subtask. 


> native-task: fix release audit, javadoc, javac warnings
> ---
>
> Key: MAPREDUCE-6074
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6074
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
> Attachments: mapreduce-6074.txt, mapreduce-6074.txt
>
>
> RAT is showing some release audit warnings. They all look spurious - just 
> need to do a little cleanup and add excludes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MAPREDUCE-6077) Remove CustomModule examples in nativetask

2014-09-05 Thread Sean Zhong (JIRA)

Sean Zhong created MAPREDUCE-6077:
-

 Summary: Remove CustomModule examples in nativetask
 Key: MAPREDUCE-6077
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6077
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
Reporter: Sean Zhong
Assignee: Sean Zhong
Priority: Minor


Currently, we don't need to support custom key types. So, this module can be 
removed for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6077) Remove CustomModule examples in nativetask

2014-09-05 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-6077:
--
Attachment: MAPREDUCE-6077.patch

patch for this

> Remove CustomModule examples in nativetask
> --
>
> Key: MAPREDUCE-6077
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6077
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
> Attachments: MAPREDUCE-6077.patch
>
>
> Currently, we don't need to support custom key types. So, this module can be 
> removed for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-05 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Attachment: mr-2841-merge-3.patch

Use git diff to generate this patch.
Also includes fix in MAPREDUCE-6077 and MAPREDUCE-6074

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge-3.patch, mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-05 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Status: In Progress  (was: Patch Available)

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge-3.patch, mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-05 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Status: Patch Available  (was: In Progress)

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge-3.patch, mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MAPREDUCE-6077) Remove CustomModule examples in nativetask

2014-09-06 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong resolved MAPREDUCE-6077.
---
Resolution: Fixed

committed.

> Remove CustomModule examples in nativetask
> --
>
> Key: MAPREDUCE-6077
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6077
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
> Attachments: MAPREDUCE-6077.patch
>
>
> Currently, we don't need to support custom key types. So, this module can be 
> removed for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-06 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Status: In Progress  (was: Patch Available)

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge-3.patch, mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-06 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Status: Patch Available  (was: In Progress)

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge-3.patch, mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-06 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Status: Open  (was: Patch Available)

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge-3.patch, mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-06 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Attachment: mr-2841-merge-4.patch

rebased patch with trunk

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge-3.patch, mr-2841-merge-4.patch, 
> mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-2841) Task level native optimization

2014-09-06 Thread Sean Zhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated MAPREDUCE-2841:
--
Status: Patch Available  (was: Open)

rebase patch with trunk

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge-3.patch, mr-2841-merge-4.patch, 
> mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6078) native-task: fix gtest build on macosx

2014-09-10 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128222#comment-14128222
 ] 

Sean Zhong commented on MAPREDUCE-6078:
---

Hi binling,

The doc said the express at else and end is optional. 

from  http://www.cmake.org/cmake/help/v2.8.8/cmake.html#command:if
{quote}
Evaluates the given expression. If the result is true, the commands in the THEN 
section are invoked. Otherwise, the commands in the else section are invoked. 
The elseif and else sections are optional. You may have multiple elseif 
clauses. *** Note that the expression in the else and endif clause is 
optional.*** Long expressions can be used and there is a traditional order of 
precedence.
{quote}

> native-task: fix gtest build on macosx
> --
>
> Key: MAPREDUCE-6078
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6078
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: task
>Reporter: Binglin Chang
>Assignee: Binglin Chang
>Priority: Trivial
> Attachments: MAPREDUCE-6078.v1.patch
>
>
> Try compile the HEAD code in macos but failed, looks like MAPREDUCE-5977 
> separate gtest compile from nttest in order to surpress compile warnings, but 
> it forget to add addition compile flags added to nttest is also required for  
> gtest build, this patch fix this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

2014-09-13 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133079#comment-14133079
 ] 

Sean Zhong commented on MAPREDUCE-2841:
---

Thanks for everyone!

> Task level native optimization
> --
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
> Environment: x86-64 Linux/Unix
>Reporter: Binglin Chang
>Assignee: Sean Zhong
> Fix For: 3.0.0
>
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, MR-2841benchmarks.pdf, dualpivot-0.patch, 
> dualpivotv20-0.patch, fb-shuffle.patch, 
> hadoop-3.0-mapreduce-2841-2014-7-17.patch, micro-benchmark.txt, 
> mr-2841-merge-2.txt, mr-2841-merge-3.patch, mr-2841-merge-4.patch, 
> mr-2841-merge.txt
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

88 matches

Mail list logo