Re: collecting CPU, mem, iops of hadoop jobs

2011-12-20 Thread He Chen
You may need Ganglia. It is a cluster monitoring software.

On Tue, Dec 20, 2011 at 2:44 PM, Patai Sangbutsarakum 
silvianhad...@gmail.com wrote:

 Hi Hadoopers,

 We're running Hadoop 0.20 CentOS5.5. I am finding the way to collect
 CPU time, memory usage, IOPS of each hadoop Job.
 What would be the good starting point ? document ? api ?

 Thanks in advance
 -P



Re: More cores Vs More Nodes ?

2011-12-13 Thread He Chen
Hi Brad

This is a really interesting experiment. I am curious why you did not use 2
cores each machine but 32 nodes. That makes the number of CPU core in two
groups equal.

Chen

On Tue, Dec 13, 2011 at 7:15 PM, Brad Sarsfield b...@bing.com wrote:

 Hi Prashant,

 In each case I had a single tasktracker per node. I oversubscribed the
 total tasks per tasktracker/node by 1.5 x # of cores.

 So for the 64 core allocation comparison.
In A: 8 cores; Each machine had a single tasktracker with 8 maps /
 4 reduce slots for 12 task slots total per machine x 8 machines (including
 head node)
In B: 2 c   ores; Each machine had a single tasktracker with 2
 maps / 1 reduce slots for 3 slots total per machines x 29 machines
 (including head node which was running 8 cores)

 The experiment was done in a cloud hosted environment running set of VMs.

 ~Brad

 -Original Message-
 From: Prashant Kommireddi [mailto:prash1...@gmail.com]
 Sent: Tuesday, December 13, 2011 9:46 AM
 To: common-user@hadoop.apache.org
 Subject: Re: More cores Vs More Nodes ?

 Hi Brad, how many taskstrackers did you have on each node in both cases?

 Thanks,
 Prashant

 Sent from my iPhone

 On Dec 13, 2011, at 9:42 AM, Brad Sarsfield b...@bing.com wrote:

  Praveenesh,
 
  Your question is not naïve; in fact, optimal hardware design can
 ultimately be a very difficult question to answer on what would be
 better. If you made me pick one without much information I'd go for more
 machines.  But...
 
  It all depends; and there is no right answer :)
 
  More machines
 +May run your workload faster
 +Will give you a higher degree of reliability protection from node /
 hardware / hard drive failure.
 +More aggregate IO capabilities
 - capex / opex may be higher than allocating more cores More cores
 +May run your workload faster
 +More cores may allow for more tasks to run on the same machine
 +More cores/tasks may reduce network contention and increase
 increasing task to task data flow performance.
 
  Notice May run your workload faster is in both; as it can be very
 workload dependant.
 
  My Experience:
  I did a recent experiment and found that given the same number of cores
 (64) with the exact same network / machine configuration;
 A: I had 8 machines with 8 cores
 B: I had 28 machines with 2 cores (and 1x8 core head node)
 
  B was able to outperform A by 2x using teragen and terasort. These
 machines were running in a virtualized environment; where some of the IO
 capabilities behind the scenes were being regulated to 400Mbps per node
 when running in the 2 core configuration vs 1Gbps on the 8 core.  So I
 would expect the non-throttled scenario to work even better.
 
  ~Brad
 
 
  -Original Message-
  From: praveenesh kumar [mailto:praveen...@gmail.com]
  Sent: Monday, December 12, 2011 8:51 PM
  To: common-user@hadoop.apache.org
  Subject: More cores Vs More Nodes ?
 
  Hey Guys,
 
  So I have a very naive question in my mind regarding Hadoop cluster
 nodes ?
 
  more cores or more nodes - Shall I spend money on going from 2-4 core
 machines, or spend money on buying more nodes less core eg. say 2 machines
 of 2 cores for example?
 
  Thanks,
  Praveenesh
 




Re: Matrix multiplication in Hadoop

2011-11-19 Thread He Chen
Did you try Hama?

There are may methods.

1) use Hadoop MPI which allows you use MPI MM code based on Hadoop;

2) Hama is designed for MM

3) Use pure Hadoop Java MapReduce;

I did this before but may not be optimal algorithm. Put your first matrix
in DistributedCache and take second matrix line as inputsplit. For each
line, use a mapper to let a array multply the first matrix in
DistributedCache. Use reducer to collect the result matrix. This algorithm
is limited by your DistributedCache size. It is suitable for a small matrix
to multiply a huge matrix.

Chen
On Sat, Nov 19, 2011 at 10:34 AM, Tim Broberg tim.brob...@exar.com wrote:

 Perhaps this is a good candidate for a native library, then?

 
 From: Mike Davis [xmikeda...@gmail.com]
 Sent: Friday, November 18, 2011 7:39 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Matrix multiplication in Hadoop

  On Friday, November 18, 2011, Mike Spreitzer mspre...@us.ibm.com wrote:
   Why is matrix multiplication ill-suited for Hadoop?

 IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
 specific SIMD instructions and, by extension, optimized BLAS routines.
 Running a large MM task using intel's MKL rather than relying on generic
 compiler optimization is orders of magnitude faster on a single multicore
 processor. I see almost no way that Hadoop could win such a CPU intensive
 task against an mpi cluster with even a tenth of the nodes running with a
 decently tuned BLAS library. Racing even against a single CPU might be
 difficult, given the i/o overhead.

 Still, it's a reasonably common problem and we shouldn't murder the good in
 favor of the best. I'm certain a MM/LinAlg Hadoop library with even
 mediocre performance, wrt C, would get used.

 --
 Mike Davis

 The information and any attached documents contained in this message
 may be confidential and/or legally privileged.  The message is
 intended solely for the addressee(s).  If you are not the intended
 recipient, you are hereby notified that any use, dissemination, or
 reproduction is strictly prohibited and may be unlawful.  If you are
 not the intended recipient, please contact the sender immediately by
 return e-mail and destroy all copies of the original message.



Re: Matrix multiplication in Hadoop

2011-11-19 Thread He Chen
Right, I agree with Edward Capriolo, Hadoop + GPGPU is a better choice.



On Sat, Nov 19, 2011 at 10:53 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Sounds like a job for next gen map reduce native libraries and gpu's. A
 modern day Dr frankenstein for sure.

 On Saturday, November 19, 2011, Tim Broberg tim.brob...@exar.com wrote:
  Perhaps this is a good candidate for a native library, then?
 
  
  From: Mike Davis [xmikeda...@gmail.com]
  Sent: Friday, November 18, 2011 7:39 PM
  To: common-user@hadoop.apache.org
  Subject: Re: Matrix multiplication in Hadoop
 
  On Friday, November 18, 2011, Mike Spreitzer mspre...@us.ibm.com
 wrote:
   Why is matrix multiplication ill-suited for Hadoop?
 
  IMHO, a huge issue here is the JVM's inability to fully support cpu
 vendor
  specific SIMD instructions and, by extension, optimized BLAS routines.
  Running a large MM task using intel's MKL rather than relying on generic
  compiler optimization is orders of magnitude faster on a single multicore
  processor. I see almost no way that Hadoop could win such a CPU intensive
  task against an mpi cluster with even a tenth of the nodes running with a
  decently tuned BLAS library. Racing even against a single CPU might be
  difficult, given the i/o overhead.
 
  Still, it's a reasonably common problem and we shouldn't murder the good
 in
  favor of the best. I'm certain a MM/LinAlg Hadoop library with even
  mediocre performance, wrt C, would get used.
 
  --
  Mike Davis
 
  The information and any attached documents contained in this message
  may be confidential and/or legally privileged.  The message is
  intended solely for the addressee(s).  If you are not the intended
  recipient, you are hereby notified that any use, dissemination, or
  reproduction is strictly prohibited and may be unlawful.  If you are
  not the intended recipient, please contact the sender immediately by
  return e-mail and destroy all copies of the original message.
 



Re: HDFS file into Blocks

2011-09-25 Thread He Chen
Hi

It is interesting that a guy from Huawei is also working on Hadoop project.
:)

Chen

On Sun, Sep 25, 2011 at 11:29 PM, Uma Maheswara Rao G 72686 
mahesw...@huawei.com wrote:


 Hi,

  You can find the Code in DFSOutputStream.java
  Here there will be one thread DataStreamer thread. This thread will pick
 the packets from DataQueue and write on to the sockets.
  Before this, when actually writing the chunks, based on the block size
 parameter passed from client, it will set the last packet parameter in
 Packet.
  If the streamer thread finds that is the last block then it end the block.
 That means it will close the socket which were used for witing the block.
  Streamer thread repeat the loops. When it find there is no sockets open
 then it will again create the pipeline for the next block.
  Go throgh the flow from writeChunk in DFSOutputStream.java, where exactly
 enqueing the packets in dataQueue.

 Regards,
 Uma
 - Original Message -
 From: kartheek muthyala kartheek0...@gmail.com
 Date: Sunday, September 25, 2011 11:06 am
 Subject: HDFS file into Blocks
 To: common-user@hadoop.apache.org

  Hi all,
  I am working around the code to understand where HDFS divides a
  file into
  blocks. Can anyone point me to this section of the code?
  Thanks,
  Kartheek
 



Re: phases of Hadoop Jobs

2011-09-19 Thread He Chen
Hi Kai

Thank you  for the reply.

 The reduce() will not start because the shuffle phase does not finish. And
the shuffle phase will not finish untill alll mapper end.

I am curious about the design purpose about overlapping the map and reduce
stage. Was this only for saving shuffling time? Or there are some other
reasons.

Best wishes!

Chen
On Mon, Sep 19, 2011 at 12:36 AM, Kai Voigt k...@123.org wrote:

 Hi Chen,

 the times when nodes running instances of the map and reduce nodes overlap.
 But map() and reduce() execution will not.

 reduce nodes will start copying data from map nodes, that's the shuffle
 phase. And the map nodes are still running during that copy phase. My
 observation had been that if the map phase progresses from 0 to 100%, it
 matches with the reduce phase progress from 0-33%. For example, if you map
 progress shows 60%, reduce might show 20%.

 But the reduce() will not start until all the map() code has processed the
 entire input. So you will never see the reduce progress higher than 66% when
 map progress didn't reach 100%.

 If you see map phase reaching 100%, but reduce phase not making any higher
 number than 66%, it means your reduce() code is broken or slow because it
 doesn't produce any output. An infinitive loop is a common mistake.

 Kai

 Am 19.09.2011 um 07:29 schrieb He Chen:

  Hi Arun
 
  I have a question. Do you know what is the reason that hadoop allows the
 map
  and the reduce stage overlap? Or anyone knows about it. Thank you in
  advance.
 
  Chen
 
  On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy a...@hortonworks.com
 wrote:
 
  Nan,
 
  The 'phase' is implicitly understood by the 'progress' (value) made by
 the
  map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
 
  For e.g.
  Reduce:
  0-33% - Shuffle
  34-66% - Sort (actually, just 'merge', there is no sort in the reduce
  since all map-outputs are sorted)
  67-100% - Reduce
 
  With 0.23 onwards the Map has phases too:
  0-90% - Map
  91-100% - Final Sort/merge
 
  Now,about starting reduces early - this is done to ensure shuffle can
  proceed for completed maps while rest of the maps run, there-by
 pipelining
  shuffle and map completion. There is a 'reduce slowstart' feature to
 control
  this - by default, reduces aren't started until 5% of maps are complete.
  Users can set this higher.
 
  Arun
 
  On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
 
  Hi, all
 
  recently, I was hit by a question, how is a hadoop job divided into 2
  phases?,
 
  In textbooks, we are told that the mapreduce jobs are divided into 2
  phases,
  map and reduce, and for reduce, we further divided it into 3 stages,
  shuffle, sort, and reduce, but in hadoop codes, I never think about
  this question, I didn't see any variable members in JobInProgress class
  to indicate this information,
 
  and according to my understanding on the source code of hadoop, the
  reduce
  tasks are unnecessarily started until all mappers are finished, in
  constract, we can see the reduce tasks are in shuffle stage while there
  are
  mappers which are still in running,
  So how can I indicate the phase which the job is belonging to?
 
  Thanks
  --
  Nan Zhu
  School of Electronic, Information and Electrical Engineering,229
  Shanghai Jiao Tong University
  800,Dongchuan Road,Shanghai,China
  E-Mail: zhunans...@gmail.com
 
 

  --
 Kai Voigt
 k...@123.org







Re: phases of Hadoop Jobs

2011-09-19 Thread He Chen
Or we can just seperate shuffle from reduce stage and integrate it to the
map stage
. Then we can clearly differentiate the map stage(before shuffle finish) and
(after shuffle finish)the reduce stage.


On Mon, Sep 19, 2011 at 1:20 AM, He Chen airb...@gmail.com wrote:

 Hi Kai

 Thank you  for the reply.

  The reduce() will not start because the shuffle phase does not finish. And
 the shuffle phase will not finish untill alll mapper end.

 I am curious about the design purpose about overlapping the map and reduce
 stage. Was this only for saving shuffling time? Or there are some other
 reasons.

 Best wishes!

 Chen
   On Mon, Sep 19, 2011 at 12:36 AM, Kai Voigt k...@123.org wrote:

 Hi Chen,

 the times when nodes running instances of the map and reduce nodes
 overlap. But map() and reduce() execution will not.

 reduce nodes will start copying data from map nodes, that's the shuffle
 phase. And the map nodes are still running during that copy phase. My
 observation had been that if the map phase progresses from 0 to 100%, it
 matches with the reduce phase progress from 0-33%. For example, if you map
 progress shows 60%, reduce might show 20%.

 But the reduce() will not start until all the map() code has processed the
 entire input. So you will never see the reduce progress higher than 66% when
 map progress didn't reach 100%.

 If you see map phase reaching 100%, but reduce phase not making any higher
 number than 66%, it means your reduce() code is broken or slow because it
 doesn't produce any output. An infinitive loop is a common mistake.

 Kai

 Am 19.09.2011 um 07:29 schrieb He Chen:

  Hi Arun
 
  I have a question. Do you know what is the reason that hadoop allows the
 map
  and the reduce stage overlap? Or anyone knows about it. Thank you in
  advance.
 
  Chen
 
  On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy a...@hortonworks.com
 wrote:
 
  Nan,
 
  The 'phase' is implicitly understood by the 'progress' (value) made by
 the
  map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
 
  For e.g.
  Reduce:
  0-33% - Shuffle
  34-66% - Sort (actually, just 'merge', there is no sort in the reduce
  since all map-outputs are sorted)
  67-100% - Reduce
 
  With 0.23 onwards the Map has phases too:
  0-90% - Map
  91-100% - Final Sort/merge
 
  Now,about starting reduces early - this is done to ensure shuffle can
  proceed for completed maps while rest of the maps run, there-by
 pipelining
  shuffle and map completion. There is a 'reduce slowstart' feature to
 control
  this - by default, reduces aren't started until 5% of maps are
 complete.
  Users can set this higher.
 
  Arun
 
  On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
 
  Hi, all
 
  recently, I was hit by a question, how is a hadoop job divided into 2
  phases?,
 
  In textbooks, we are told that the mapreduce jobs are divided into 2
  phases,
  map and reduce, and for reduce, we further divided it into 3 stages,
  shuffle, sort, and reduce, but in hadoop codes, I never think about
  this question, I didn't see any variable members in JobInProgress
 class
  to indicate this information,
 
  and according to my understanding on the source code of hadoop, the
  reduce
  tasks are unnecessarily started until all mappers are finished, in
  constract, we can see the reduce tasks are in shuffle stage while
 there
  are
  mappers which are still in running,
  So how can I indicate the phase which the job is belonging to?
 
  Thanks
  --
  Nan Zhu
  School of Electronic, Information and Electrical Engineering,229
  Shanghai Jiao Tong University
  800,Dongchuan Road,Shanghai,China
  E-Mail: zhunans...@gmail.com
 
 

  --
 Kai Voigt
 k...@123.org








Re: phases of Hadoop Jobs

2011-09-18 Thread He Chen
Hi Nan

I have the same question for a while. In some research papers, people like
to make the reduce stage to be slow start. In this way, the map stage and
reduce stage are easy to differentiate. You can use the number of remaining
unallocated map tasks to detect in which stage your job is.

To let the reduce stage overlap with the map stage, it blurs the boundary
between two stages. I think it may decreases the execution time of the whole
job (I am not sure whether this is the main reason that people allow fast
start happen or not).

However, fast start has its side-effect. It is hard to get a global view
of the map stage's output, and then the reduce stage's balance and data
locality are not easy to be solved.

Chen
Research Assistant of Holland Computing Center
PhD student of CSE Department
University of Nebraska-Lincoln


On Sun, Sep 18, 2011 at 9:24 PM, Nan Zhu zhunans...@gmail.com wrote:

 Hi, all

  recently, I was hit by a question, how is a hadoop job divided into 2
 phases?,

 In textbooks, we are told that the mapreduce jobs are divided into 2
 phases,
 map and reduce, and for reduce, we further divided it into 3 stages,
 shuffle, sort, and reduce, but in hadoop codes, I never think about
 this question, I didn't see any variable members in JobInProgress class
 to indicate this information,

 and according to my understanding on the source code of hadoop, the reduce
 tasks are unnecessarily started until all mappers are finished, in
 constract, we can see the reduce tasks are in shuffle stage while there are
 mappers which are still in running,
 So how can I indicate the phase which the job is belonging to?

 Thanks
 --
 Nan Zhu
 School of Electronic, Information and Electrical Engineering,229
 Shanghai Jiao Tong University
 800,Dongchuan Road,Shanghai,China
 E-Mail: zhunans...@gmail.com



Re: phases of Hadoop Jobs

2011-09-18 Thread He Chen
Hi Arun

I have a question. Do you know what is the reason that hadoop allows the map
and the reduce stage overlap? Or anyone knows about it. Thank you in
advance.

Chen

On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy a...@hortonworks.com wrote:

 Nan,

  The 'phase' is implicitly understood by the 'progress' (value) made by the
 map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).

  For e.g.
  Reduce:
  0-33% - Shuffle
  34-66% - Sort (actually, just 'merge', there is no sort in the reduce
 since all map-outputs are sorted)
  67-100% - Reduce

  With 0.23 onwards the Map has phases too:
  0-90% - Map
  91-100% - Final Sort/merge

  Now,about starting reduces early - this is done to ensure shuffle can
 proceed for completed maps while rest of the maps run, there-by pipelining
 shuffle and map completion. There is a 'reduce slowstart' feature to control
 this - by default, reduces aren't started until 5% of maps are complete.
 Users can set this higher.

 Arun

 On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:

  Hi, all
 
  recently, I was hit by a question, how is a hadoop job divided into 2
  phases?,
 
  In textbooks, we are told that the mapreduce jobs are divided into 2
 phases,
  map and reduce, and for reduce, we further divided it into 3 stages,
  shuffle, sort, and reduce, but in hadoop codes, I never think about
  this question, I didn't see any variable members in JobInProgress class
  to indicate this information,
 
  and according to my understanding on the source code of hadoop, the
 reduce
  tasks are unnecessarily started until all mappers are finished, in
  constract, we can see the reduce tasks are in shuffle stage while there
 are
  mappers which are still in running,
  So how can I indicate the phase which the job is belonging to?
 
  Thanks
  --
  Nan Zhu
  School of Electronic, Information and Electrical Engineering,229
  Shanghai Jiao Tong University
  800,Dongchuan Road,Shanghai,China
  E-Mail: zhunans...@gmail.com




Re: Poor IO performance on a 10 node cluster.

2011-05-30 Thread He Chen
Hi Gyuribácsi

I would suggest you divide MapReduce program execution time into 3 parts

a) Map Stage
In this stage, wc splits input data and generates map tasks. Each map task
process one block (in default, you can change it in FileInputFormat.java).
As Brian said, if you have larger blocks size, you may have less number of
map tasks, and then probably less overhead.

b) Reduce Stage
2) shuffle phase
 In this phase, reduce task collect intermediate results from every node
that has executed map tasks. Each reduce task can have many current threads
to obtain data(you can configure it in mapred-site.xml, it is
mapreduce.reduce.shuffle.parallelcopies). But, be careful to your data
popularity. For example, you have Hadoop, Hadoop, Hadoop,hello. The
default Hadoop partitioner will assign 3 Hadoop, 1  key-value pairs to one
node. Thus, if you have two nodes run reduce tasks, one of them will copy 3
times more data than the other. This will cause one node slower than the
other. You may rewrite the partitioner.

3) sort and reduce phase
I think the Hadoop UI will give you some hints about how long this phase
takes.

By dividing MapReduce application into these 3 parts, you can easily find
which one is your bottleneck and do some profiling. And I don't know why my
font change to this type.:(

Hope it will be helpful.
Chen

On Mon, May 30, 2011 at 12:32 PM, Harsh J ha...@cloudera.com wrote:

 Psst. The cats speak in their own language ;-)

 On Mon, May 30, 2011 at 10:31 PM, James Seigel ja...@tynt.com wrote:
  Not sure that will help ;)
 
  Sent from my mobile. Please excuse the typos.
 
  On 2011-05-30, at 9:23 AM, Boris Aleksandrovsky balek...@gmail.com
 wrote:
 
 
 Ljddfjfjfififfifjftjiifjfjjjffkxbznzsjxodiewisshsudddudsjidhddueiweefiuftttoitfiirriifoiffkllddiririiriioerorooiieirrioeekroooeoooirjjfdijdkkduddjudiiehs
  On May 30, 2011 5:28 AM, Gyuribácsi bogyo...@gmail.com wrote:
 
 
  Hi,
 
  I have a 10 node cluster (IBM blade servers, 48GB RAM, 2x500GB Disk, 16
 HT
  cores).
 
  I've uploaded 10 files to HDFS. Each file is 10GB. I used the streaming
  jar
  with 'wc -l' as mapper and 'cat' as reducer.
 
  I use 64MB block size and the default replication (3).
 
  The wc on the 100 GB took about 220 seconds which translates to about
 3.5
  Gbit/sec processing speed. One disk can do sequential read with
 1Gbit/sec
  so
  i would expect someting around 20 GBit/sec (minus some overhead), and
 I'm
  getting only 3.5.
 
  Is my expectaion valid?
 
  I checked the jobtracked and it seems all nodes are working, each
 reading
  the right blocks. I have not played with the number of mapper and
 reducers
  yet. It seems the number of mappers is the same as the number of blocks
  and
  the number of reducers is 20 (there are 20 disks). This looks ok for
 me.
 
  We also did an experiment with TestDFSIO with similar results.
 Aggregated
  read io speed is around 3.5Gbit/sec. It is just too far from my
  expectation:(
 
  Please help!
 
  Thank you,
  Gyorgy
  --
  View this message in context:
 
 http://old.nabble.com/Poor-IO-performance-on-a-10-node-cluster.-tp31732971p31732971.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



 --
 Harsh J



How can I let datanode do not save blocks from other datanodes after a MapReduce job

2011-05-25 Thread He Chen
Hi all,

I remember there is a parameter that we can turn this off. I mean we do not
allow tasktracker to keep the blocks from other datanode after a MapReduce
job finished.

I met a problem when I using hadoop-0.21.0.

First of all, I balanced cluster according to number of blocks on every
datanode. That's to say, for example, under /user/test/, I have 100 blocks
data. The replication number is 2. Then there are total 200 block under
/user/test. I have 10 datanodes. What I do is to let every datanode to
have 20 blocks of the total.

However, after about 300 MapReduce jobs finished. I found out the number of
blocks in datanodes changed. It is not 20 for every datanode. someone got 21
and someone got 19. I turned off the hadoop balancer.

What is the reason caused this problem? Any suggestion will be appreciated!

Best wishes!

Chen


Change block size from 64M to 128M does not work on Hadoop-0.21

2011-05-04 Thread He Chen
Hi all

I met a problem about changing block size from 64M to 128M. I am sure I
modified the correct configuration file hdfs-site.xml. Because I can change
the replication number correctly. However, it does not work on block size
changing.

For example:

I change the dfs.block.size to 134217728 bytes.

I upload a file which is 128M and use fsck to find how many blocks this
file has. It shows:
/user/file1/file 134217726 bytes, 2 blocks(s): OK
0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010
]
1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010]

The hadoop version is 0.21. Any suggestion will be appreciated!

thanks

Chen


Re: Change block size from 64M to 128M does not work on Hadoop-0.21

2011-05-04 Thread He Chen
Hi Harsh

Thank you for the reply.

Actually, the hadoop directory is on my NFS server, every node reads the
same file from NFS server. I think this is not a problem.

I like your second solution. But I am not sure, whether the namenode
will divide those 128MB

 blocks to smaller ones in future or not.

Chen

On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote:

 Your client (put) machine must have the same block size configuration
 during upload as well.

 Alternatively, you may do something explicit like `hadoop dfs
 -Ddfs.block.size=size -put file file`

 On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote:
  Hi all
 
  I met a problem about changing block size from 64M to 128M. I am sure I
  modified the correct configuration file hdfs-site.xml. Because I can
 change
  the replication number correctly. However, it does not work on block size
  changing.
 
  For example:
 
  I change the dfs.block.size to 134217728 bytes.
 
  I upload a file which is 128M and use fsck to find how many blocks this
  file has. It shows:
  /user/file1/file 134217726 bytes, 2 blocks(s): OK
  0. blk_xx len=67108864 repl=2 [192.168.0.3:50010,
 192.168.0.32:50010
  ]
  1. blk_xx len=67108862 repl=2 [192.168.0.9:50010,
 192.168.0.8:50010]
 
  The hadoop version is 0.21. Any suggestion will be appreciated!
 
  thanks
 
  Chen
 



 --
 Harsh J



Re: Change block size from 64M to 128M does not work on Hadoop-0.21

2011-05-04 Thread He Chen
Tried second solution. Does not work, still 2 64M blocks. h

On Wed, May 4, 2011 at 3:16 PM, He Chen airb...@gmail.com wrote:

 Hi Harsh

 Thank you for the reply.

 Actually, the hadoop directory is on my NFS server, every node reads the
 same file from NFS server. I think this is not a problem.

 I like your second solution. But I am not sure, whether the namenode
 will divide those 128MB

  blocks to smaller ones in future or not.

 Chen

 On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote:

 Your client (put) machine must have the same block size configuration
 during upload as well.

 Alternatively, you may do something explicit like `hadoop dfs
 -Ddfs.block.size=size -put file file`

 On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote:
  Hi all
 
  I met a problem about changing block size from 64M to 128M. I am sure I
  modified the correct configuration file hdfs-site.xml. Because I can
 change
  the replication number correctly. However, it does not work on block
 size
  changing.
 
  For example:
 
  I change the dfs.block.size to 134217728 bytes.
 
  I upload a file which is 128M and use fsck to find how many blocks
 this
  file has. It shows:
  /user/file1/file 134217726 bytes, 2 blocks(s): OK
  0. blk_xx len=67108864 repl=2 [192.168.0.3:50010,
 192.168.0.32:50010
  ]
  1. blk_xx len=67108862 repl=2 [192.168.0.9:50010,
 192.168.0.8:50010]
 
  The hadoop version is 0.21. Any suggestion will be appreciated!
 
  thanks
 
  Chen
 



 --
 Harsh J





Re: Change block size from 64M to 128M does not work on Hadoop-0.21

2011-05-04 Thread He Chen
Got it. Thankyou Harsh. BTW
It is `hadoop dfs -Ddfs.blocksize=size -put file file`. No dot between
block and size

On Wed, May 4, 2011 at 3:18 PM, He Chen airb...@gmail.com wrote:

 Tried second solution. Does not work, still 2 64M blocks. h


 On Wed, May 4, 2011 at 3:16 PM, He Chen airb...@gmail.com wrote:

 Hi Harsh

 Thank you for the reply.

 Actually, the hadoop directory is on my NFS server, every node reads the
 same file from NFS server. I think this is not a problem.

 I like your second solution. But I am not sure, whether the namenode
 will divide those 128MB

  blocks to smaller ones in future or not.

 Chen

 On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote:

 Your client (put) machine must have the same block size configuration
 during upload as well.

 Alternatively, you may do something explicit like `hadoop dfs
 -Ddfs.block.size=size -put file file`

 On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote:
  Hi all
 
  I met a problem about changing block size from 64M to 128M. I am sure I
  modified the correct configuration file hdfs-site.xml. Because I can
 change
  the replication number correctly. However, it does not work on block
 size
  changing.
 
  For example:
 
  I change the dfs.block.size to 134217728 bytes.
 
  I upload a file which is 128M and use fsck to find how many blocks
 this
  file has. It shows:
  /user/file1/file 134217726 bytes, 2 blocks(s): OK
  0. blk_xx len=67108864 repl=2 [192.168.0.3:50010,
 192.168.0.32:50010
  ]
  1. blk_xx len=67108862 repl=2 [192.168.0.9:50010,
 192.168.0.8:50010]
 
  The hadoop version is 0.21. Any suggestion will be appreciated!
 
  thanks
 
  Chen
 



 --
 Harsh J






Apply HADOOP-4667 to branch-0.20

2011-04-26 Thread He Chen
Hey everyone

I tried to apply HADOOP-4667 patch to branch-0.20, but always failed.
Because my cluster is based on branch-0.20, however, I want to test the
delay scheduling method performance. I do not want to re-format the HDFS.

Then I tried to apply HADOOP-4667 to branch-0.20. Anyone did this before or
any suggestion?   Thank you in advance.

Best wishes!

Chen


Any one know where to get Hadoop production cluster log

2011-03-16 Thread He Chen
Hi all

I am working on Hadoop scheduler. But I do not know where to get log from
Hadoop production clusters. Any suggestions?

Bests

Chen


Re: Not able to Run C++ code in Hadoop Cluster

2011-03-14 Thread He Chen
Agree with Keith Wiley, we use streaming also.

On Mon, Mar 14, 2011 at 11:40 AM, Keith Wiley kwi...@keithwiley.com wrote:

 Not to speak against pipes because I don't have much experience with it,
 but I eventually abandoned my pipes efforts and went with streaming.  If you
 don't get pipes to work, you might take a look at streaming as an
 alternative.

 Cheers!


 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com

 I used to be with it, but then they changed what it was.  Now, what I'm
 with
 isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson

 




Anyone knows how to attach a figure on Hadoop Wiki page?

2011-03-14 Thread He Chen
Hi all

Any suggestions?

Bests

Chen


Re: Cuda Program in Hadoop Cluster

2011-03-09 Thread He Chen
Hi, Adarsh Sharma

For C code

My friend employ hadoop streaming to run CUDA C code. You can send email to
him. p...@cse.unl.edu.

Best wishes!

Chen


On Thu, Mar 3, 2011 at 11:18 PM, Adarsh Sharma adarsh.sha...@orkash.comwrote:

 Dear all,

 I followed a fantastic tutorial and able to run the Wordcont C++ Program in
 Hadoop Cluster.


 http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_--_Running_C%2B%2B_Programs_on_Hadoop

 But know I want to run a Cuda Program in the Hadoop Cluster but results in
 errors.
 Is anyone has done it before and guide me how to do this.

 I attached the both files. Please find the attachment.


 Thanks  best Regards,

 Adarsh Sharma



Re: hadoop balancer

2011-03-04 Thread He Chen
Thank you very much Icebergs.

I rewrite the balancer. Now, given a directory like /user/foo/, I can
balance the blocks under this directory evenly to every node in the cluster.

Best wishes!

Chen

On Thu, Mar 3, 2011 at 11:14 PM, icebergs hkm...@gmail.com wrote:

 try this command
 hadoop fs -setrep -R -w 2 xx
 maybe help

 2011/3/2 He Chen airb...@gmail.com

  Hi all
 
  I met a problem when I try to balance certain hdfs directory among the
  clusters. For example, I have a directory /user/xxx/, and there 100
  blocks. I want to balance them among my 5 nodes clusters. Each node has
 40
  blocks (2 replicas). The problem is about transfer block from one
 datanode
  to another. Actually, I followed the balancer's method. However, it
 always
  waits for the response of destination datanode and halt. I attached the
  code:
  .
 
   Socket sock = new Socket();
 
   DataOutputStream out = null;
 
   DataInputStream in = null;
 
   try{
 
 sock.connect(NetUtils.createSocketAddr(
 
 target.getName()), HdfsConstants.READ_TIMEOUT);
 
 sock.setKeepAlive(true);
 
 System.out.println(sock.isConnected());
 
 out = new DataOutputStream( new BufferedOutputStream(
 
 sock.getOutputStream(), FSConstants.BUFFER_SIZE));
 
 out.writeShort(DataTransferProtocol.DATA_TRANSFER_VERSION);
 
 out.writeByte(DataTransferProtocol.OP_REPLACE_BLOCK);
 
 out.writeLong(block2move.getBlockId());
 
 out.writeLong(block2move.getGenerationStamp());
 
 Text.writeString(out, source.getStorageID());
 
 System.out.println(Ready to move);
 
 source.write(out);
 
 System.out.println(Write to output Stream);
 
 out.flush();
 
 System.out.println(out has been flushed!);
 
 in = new DataInputStream( new BufferedInputStream(
 
 sock.getInputStream(), FSConstants.BUFFER_SIZE));
 
 It stop here and wait for response.
 
 short status = in.readShort();
 
 System.out.println(Got the response from input stream!+status);
 
 if (status != DataTransferProtocol.OP_STATUS_SUCCESS) {
 
throw new IOException(block move is failed\t+status);
 
 }
 
 
 
   } catch (IOException e) {
 
 LOG.warn(Error moving block +block2move.getBlockId()+
 
  from  + source.getName() +  to  +
 
 target.getName() +  through  +
 
 source.getName() +
 
 : +e.toString());
 
 
} finally {
 
 IOUtils.closeStream(out);
 
 IOUtils.closeStream(in);
 
 IOUtils.closeSocket(sock);
}
  ..
 
  Any reply will be appreciated. Thank you in advance!
 
  Chen
 



Re: CUDA on Hadoop

2011-02-10 Thread He Chen
Thank you Steve Loughran. I just created a new page on Hadoop wiki, however,
how can I create a new document page on Hadoop Wiki?

Best wishes

Chen

On Thu, Feb 10, 2011 at 5:38 AM, Steve Loughran ste...@apache.org wrote:

 On 09/02/11 17:31, He Chen wrote:

 Hi sharma

 I shared our slides about CUDA performance on Hadoop clusters. Feel free
 to
 modified it, please mention the copyright!


 This is nice. If you stick it up online you should link to it from the
 Hadoop wiki pages -maybe start a hadoop+cuda page and refer to it




Re: CUDA on Hadoop

2011-02-09 Thread He Chen
Hi  Sharma

I have some experiences on working Hybrid Hadoop with GPU. Our group has
tested CUDA performance on Hadoop clusters. We obtain 20 times speedup and
save up to 95% power consumption in some computation-intensive test case.

You can parallel your Java code by using JCUDA which is a kind of API to
help you call CUDA in your Java code.

Chen

On Wed, Feb 9, 2011 at 8:45 AM, Steve Loughran ste...@apache.org wrote:

 On 09/02/11 13:58, Harsh J wrote:

 You can check-out this project which did some work for Hama+CUDA:
 http://code.google.com/p/mrcl/


 Amazon let you bring up a Hadoop cluster on machines with GPUs you can code
 against, but I haven't heard of anyone using it. The big issue is bandwidth;
 it just doesn't make sense for a classic scan through the logs kind of
 problem as the disk:GPU bandwidth ratio is even worse than disk:CPU.

 That said, if you were doing something that involved a lot of compute on a
 block of data (e.g. rendering tiles in a map), this could work.



Re: Question about Hadoop Default FCFS Job Scheduler

2011-01-17 Thread He Chen
Hi Nan,

Thank you for the reply. I understand what you mean. What I concern is
inside the obtainNewLocalMapTask(...) method, it only assigns one tasks a
time.

Now I understand why it only assigns one task at a time. It is because the
outside loop:

for (i = 0; i  MapperCapacity; ++i){

(..)

}

I mean why this loop exists here. Why does the scheduler use this type of
loop. It imposes overhead to the task assigning process if only assign one
task at a time. It is obviously that a node can be assigned all available
local tasks it can in one afford obtainNewLocalMapTask(..) method
call.

Bests

Chen

On Mon, Jan 17, 2011 at 8:28 AM, Nan Zhu zhunans...@gmail.com wrote:

 Hi, Chen

 How is it going recently?

 Actually I think you misundertand the code in assignTasks() in
 JobQueueTaskScheduler.java, see the following structure of the interesting
 codes:

 //I'm sorry, I hacked the code so much, the name of the variables may be
 different from the original version

 for (i = 0; i  MapperCapacity; ++i){
   ...
   for (JobInProgress job:jobQueue){
   //try to shedule a node-local or rack-local map tasks
   //here is the interesting place
   t = job.obtainNewLocalMapTask(...);
   if (t != null){
  ...
  break;//the break statement here will make the control flow back
 to for (job:jobQueue) which means that it will restart map tasks
 selection
 procedure from the first job, so , it is actually schedule all of the first
 job's local mappers first until the map slots are full
   }
   }
 }

 BTW, we can only schedule a reduce task in a single heartbeat



 Best,
 Nan
 On Sat, Jan 15, 2011 at 1:45 PM, He Chen airb...@gmail.com wrote:

  Hey all
 
  Why does the FCFS scheduler only let a node chooses one task at a time in
  one job? In order to increase the data locality,
  it is reasonable to let a node to choose all its local tasks (if it can)
  from a job at a time.
 
  Any reply will be appreciated.
 
  Thanks
 
  Chen
 



Question about Hadoop Default FCFS Job Scheduler

2011-01-14 Thread He Chen
Hey all

Why does the FCFS scheduler only let a node chooses one task at a time in
one job? In order to increase the data locality,
it is reasonable to let a node to choose all its local tasks (if it can)
from a job at a time.

Any reply will be appreciated.

Thanks

Chen


Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

2011-01-13 Thread He Chen
Actually, PhedEx is using GridFTP for its data transferring.

On Thu, Jan 13, 2011 at 5:34 AM, Steve Loughran ste...@apache.org wrote:

 On 13/01/11 08:34, li ping wrote:

 That is also my concerns. Is it efficient for data transmission.


 It's long lived TCP connections, reasonably efficient for bulk data xfer,
 has all the throttling of TCP built in, and comes with some excellently
 debugged client and server code in the form of jetty and httpclient. In
 maintenance costs alone, those libraries justify HTTP unless you have a
 vastly superior option *and are willing to maintain it forever*

 FTPs limits are well known (security), NFS limits well known (security, UDP
 version doesn't throttle), self developed protocols will have whatever
 problems you want.

 There are better protocols for long-haul data transfer over fat pipes, such
 as GridFTP , PhedEX ( http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ),
 which use multiple TCP channels in parallel to reduce the impact of a single
 lost packet, but within a datacentre, you shouldn't have to worry about
 this. If you do find lots of packets get lost, raise the issue with the
 networking team.

 -Steve



 On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhuzhunans...@gmail.com  wrote:

  Hi, all

 I have a question about the file transmission between Map and Reduce
 stage,
 in current implementation, the Reducers get the results generated by
 Mappers
 through HTTP Get, I don't understand why HTTP is selected, why not FTP,
 or
 a
 self-developed protocal?

 Just for HTTP's simple?

 thanks

 Nan








FW:FW

2010-12-31 Thread He Chen
I bought some items from a commercial site, because of the unique
channel of purchases,
product prices unexpected, I think you can go to see: elesales.com ,
high-quality products can also attract you.


Jcuda on Hadoop

2010-12-09 Thread He Chen
Hello everyone, I 've got a problem when I write some Jcuda program based on
Hadoop MapReduce. I use the jcudaUtill. The KernelLauncherSample can be
successfully executed on my worker node. However, When I submit a program
containing jcuda to Hadoop MapReduce. I got following errors. Any reply will
be appreciated! 10/12/09 15:41:39 INFO mapred.JobClient: Running job:
job_201012091523_0002 10/12/09 15:41:40 INFO mapred.JobClient: map 0% reduce
0% 10/12/09 15:41:53 INFO mapred.JobClient: Task Id :
attempt_201012091523_0002_m_00_0, Status : FAILED Error: Could not load
native library attempt_201012091523_0002_m_00_0: [GC
7402K-1594K(12096K), 0.0045650 secs] attempt_201012091523_0002_m_00_0:
[GC 108666K-104610K(116800K), 0.0106860 secs]
attempt_201012091523_0002_m_00_0: [Full GC 104610K-104276K(129856K),
0.0482530 secs] attempt_201012091523_0002_m_00_0: Error while loading
native library with base name JCudaDriver
attempt_201012091523_0002_m_00_0: Operating system name: Linux
attempt_201012091523_0002_m_00_0: Architecture : amd64
attempt_201012091523_0002_m_00_0: Architecture bit size: 64 10/12/09
15:42:00 INFO mapred.JobClient: Task Id :
attempt_201012091523_0002_m_00_1, Status : FAILED Error: Could not load
native library attempt_201012091523_0002_m_00_1: [GC
7373K-1573K(18368K), 0.0045230 secs] attempt_201012091523_0002_m_00_1:
Error while loading native library with base name JCudaDriver
attempt_201012091523_0002_m_00_1: Operating system name: Linux
attempt_201012091523_0002_m_00_1: Architecture : amd64
attempt_201012091523_0002_m_00_1: Architecture bit size: 64 It looks
like The jcuda library file can not be sucessfully loaded. Actually, I tried
many combinations. 1) I include all the jcuda library files in my
TaskTracker's classpath, and also include jcuda library file into my
mapreduce program. 2) TaskTracker's classpath w/o jcuda library, but my
program contains them 3) TaskTracker's classpath w/ jcuda librara, but my
program w/o them All of them report the same error above.


Re: Jcuda on Hadoop

2010-12-09 Thread He Chen
I am still in the test stage. I mean I only start one JobTracker and one
TaskTracker.
I copied all jcuda.*.jar into  HADOOP_HOME/lib/

from the ps aux|grep java

I can confirm the JT and TT processes all contain the jcuda.*.jar files

On Thu, Dec 9, 2010 at 4:05 PM, Mathias Herberts mathias.herbe...@gmail.com
 wrote:

 You need to have the native libs on all tasktrackers and have
 java.library.path correctly set.
 On Dec 9, 2010 11:01 PM, He Chen airb...@gmail.com wrote:
  Hello everyone, I 've got a problem when I write some Jcuda program based
 on
  Hadoop MapReduce. I use the jcudaUtill. The KernelLauncherSample can be
  successfully executed on my worker node. However, When I submit a program
  containing jcuda to Hadoop MapReduce. I got following errors. Any reply
 will
  be appreciated! 10/12/09 15:41:39 INFO mapred.JobClient: Running job:
  job_201012091523_0002 10/12/09 15:41:40 INFO mapred.JobClient: map 0%
 reduce
  0% 10/12/09 15:41:53 INFO mapred.JobClient: Task Id :
  attempt_201012091523_0002_m_00_0, Status : FAILED Error: Could not
 load
  native library attempt_201012091523_0002_m_00_0: [GC
  7402K-1594K(12096K), 0.0045650 secs]
 attempt_201012091523_0002_m_00_0:
  [GC 108666K-104610K(116800K), 0.0106860 secs]
  attempt_201012091523_0002_m_00_0: [Full GC 104610K-104276K(129856K),
  0.0482530 secs] attempt_201012091523_0002_m_00_0: Error while loading
  native library with base name JCudaDriver
  attempt_201012091523_0002_m_00_0: Operating system name: Linux
  attempt_201012091523_0002_m_00_0: Architecture : amd64
  attempt_201012091523_0002_m_00_0: Architecture bit size: 64 10/12/09
  15:42:00 INFO mapred.JobClient: Task Id :
  attempt_201012091523_0002_m_00_1, Status : FAILED Error: Could not
 load
  native library attempt_201012091523_0002_m_00_1: [GC
  7373K-1573K(18368K), 0.0045230 secs]
 attempt_201012091523_0002_m_00_1:
  Error while loading native library with base name JCudaDriver
  attempt_201012091523_0002_m_00_1: Operating system name: Linux
  attempt_201012091523_0002_m_00_1: Architecture : amd64
  attempt_201012091523_0002_m_00_1: Architecture bit size: 64 It looks
  like The jcuda library file can not be sucessfully loaded. Actually, I
 tried
  many combinations. 1) I include all the jcuda library files in my
  TaskTracker's classpath, and also include jcuda library file into my
  mapreduce program. 2) TaskTracker's classpath w/o jcuda library, but my
  program contains them 3) TaskTracker's classpath w/ jcuda librara, but my
  program w/o them All of them report the same error above.



Re: Jcuda on Hadoop

2010-12-09 Thread He Chen
Thank you so much, Mathias Herberts

This really helps

On Thu, Dec 9, 2010 at 4:17 PM, He Chen airb...@gmail.com wrote:

 I am still in the test stage. I mean I only start one JobTracker and one
 TaskTracker.
 I copied all jcuda.*.jar into  HADOOP_HOME/lib/

 from the ps aux|grep java

 I can confirm the JT and TT processes all contain the jcuda.*.jar files


 On Thu, Dec 9, 2010 at 4:05 PM, Mathias Herberts 
 mathias.herbe...@gmail.com wrote:

 You need to have the native libs on all tasktrackers and have
 java.library.path correctly set.
 On Dec 9, 2010 11:01 PM, He Chen airb...@gmail.com wrote:
  Hello everyone, I 've got a problem when I write some Jcuda program
 based
 on
  Hadoop MapReduce. I use the jcudaUtill. The KernelLauncherSample can be
  successfully executed on my worker node. However, When I submit a
 program
  containing jcuda to Hadoop MapReduce. I got following errors. Any reply
 will
  be appreciated! 10/12/09 15:41:39 INFO mapred.JobClient: Running job:
  job_201012091523_0002 10/12/09 15:41:40 INFO mapred.JobClient: map 0%
 reduce
  0% 10/12/09 15:41:53 INFO mapred.JobClient: Task Id :
  attempt_201012091523_0002_m_00_0, Status : FAILED Error: Could not
 load
  native library attempt_201012091523_0002_m_00_0: [GC
  7402K-1594K(12096K), 0.0045650 secs]
 attempt_201012091523_0002_m_00_0:
  [GC 108666K-104610K(116800K), 0.0106860 secs]
  attempt_201012091523_0002_m_00_0: [Full GC
 104610K-104276K(129856K),
  0.0482530 secs] attempt_201012091523_0002_m_00_0: Error while
 loading
  native library with base name JCudaDriver
  attempt_201012091523_0002_m_00_0: Operating system name: Linux
  attempt_201012091523_0002_m_00_0: Architecture : amd64
  attempt_201012091523_0002_m_00_0: Architecture bit size: 64 10/12/09
  15:42:00 INFO mapred.JobClient: Task Id :
  attempt_201012091523_0002_m_00_1, Status : FAILED Error: Could not
 load
  native library attempt_201012091523_0002_m_00_1: [GC
  7373K-1573K(18368K), 0.0045230 secs]
 attempt_201012091523_0002_m_00_1:
  Error while loading native library with base name JCudaDriver
  attempt_201012091523_0002_m_00_1: Operating system name: Linux
  attempt_201012091523_0002_m_00_1: Architecture : amd64
  attempt_201012091523_0002_m_00_1: Architecture bit size: 64 It looks
  like The jcuda library file can not be sucessfully loaded. Actually, I
 tried
  many combinations. 1) I include all the jcuda library files in my
  TaskTracker's classpath, and also include jcuda library file into my
  mapreduce program. 2) TaskTracker's classpath w/o jcuda library, but my
  program contains them 3) TaskTracker's classpath w/ jcuda librara, but
 my
  program w/o them All of them report the same error above.





Re: Two questions.

2010-11-03 Thread He Chen
Both option 1 and 3 will work.

On Wed, Nov 3, 2010 at 9:28 PM, James Seigel ja...@tynt.com wrote:

 Option 1 = good

 Sent from my mobile. Please excuse the typos.

 On 2010-11-03, at 8:27 PM, shangan shan...@corp.kaixin001.com wrote:

  I don't think the first two options can work, even you stop the
 tasktracker these to-be-retired nodes are still connected to the namenode.
  Option 3 can work.  You only need to add this exclude file on the
 namenode, and it is an regular file. Add a key named dfs.hosts.exclude to
 your conf/hadoop-site.xml file,The value associated with this key provides
 the full path to a file on the NameNode's local file system which contains a
 list of machines which are not permitted to connect to HDFS.
 
  Then you can run the command bin/hadoop dfsadmin -refreshNodes, then the
 cluster will decommission the nodes in the exclude file.This might take a
 period of time as the cluster need to move data from those retired nodes to
 left nodes.
 
  After this you can use these retired nodes as a new cluster.But remember
 to remove those nodes from the slave nodes file and you can delete the
 exclude file afterward.
 
 
  2010-11-04
 
 
 
  shangan
 
 
 
 
  发件人: Raj V
  发送时间: 2010-11-04  10:05:44
  收件人: common-user
  抄送:
  主题: Two questions.
 
  1. I have a 512 node cluster. I need to have 32 nodes do something else.
 They
  can be datanodes but I cannot run any map or reduce jobs on them. So I
 see three
  options.
  1. Stop the tasktracker on those nodes. leave the datanode running.
  2. Set  mapred.tasktracker.reduce.tasks.maximum and
  mapred.tasktracker.map.tasks.maximum to 0 on these nodes and make these
 final.
  3. Use the parameter mapred.hosts.exclude.
  I am assuming that any of the three methods would work.  To start with, I
 went
  with option 3. I used a local file /home/hadoop/myjob.exclude and the
 file
  myjob.exclude had the hostname of one host per line ( hadoop-480 ..
 hadoop-511.
  But I see both map and reduce jobs being scheduled to all the 511 nodes.
  I understand there is an inherent inefficieny by running only the data
 node on
  these 32 nodess.
  Here are my questions.
  1. Will all three methods work?
  2. If I choose method 3, does this file exist as a dfs file or a regular
 file.
  If regular file , does it need to exist on all the nodes or only the node
 where
  teh job is submitted?
  Many thanks in advance/
  Raj
  __ Information from ESET NOD32 Antivirus, version of virus
 signature database 5574 (20101029) __
  The message was checked by ESET NOD32 Antivirus.
  http://www.eset.com



Re: Can not upload local file to HDFS

2010-09-27 Thread He Chen
Thanks, but I think you goes too far to focus on the problem itself.

On Sun, Sep 26, 2010 at 11:43 AM, Nan Zhu zhunans...@gmail.com wrote:

 Have you ever check the log file in the directory?

 I always find some important information there,

 I suggest you to recompile hadoop with ant since mapred daemons also don't
 work

 Nan

 On Sun, Sep 26, 2010 at 7:29 PM, He Chen airb...@gmail.com wrote:

  The problem is every datanode may be listed in the error report. That
 means
  all my datanodes are bad?
 
  One thing I forgot to mention. I can not use start-all.sh and stop-all.sh
  to
  start and stop all dfs and mapred processes on my clusters. But the
  jobtracker and namenode web interface still work.
 
  I think I can solve this problem by ssh to every node and kill current
  hadoop processes and restart them again. The previous problem will also
 be
  solved( it's my opinion). But I really want to know why the HDFS reports
 me
  previous errors.
 
 
  On Sat, Sep 25, 2010 at 11:20 PM, Nan Zhu zhunans...@gmail.com wrote:
 
   Hi Chen,
  
   It seems that you have a bad datanode? maybe you should reformat them?
  
   Nan
  
   On Sun, Sep 26, 2010 at 10:42 AM, He Chen airb...@gmail.com wrote:
  
Hello Neil
   
No matter how big the file is. It always report this to me. The file
  size
is
from 10KB to 100MB.
   
On Sat, Sep 25, 2010 at 6:08 PM, Neil Ghosh neil.gh...@gmail.com
   wrote:
   
 How Big is the file? Did you try Formatting Name node and Datanode?

 On Sun, Sep 26, 2010 at 2:12 AM, He Chen airb...@gmail.com
 wrote:

  Hello everyone
 
  I can not load local file to HDFS. It gave the following errors.
 
  WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception
   for
 block
  blk_-236192853234282209_419415java.io.EOFException
 at
  java.io.DataInputStream.readFully(DataInputStream.java:197)
 at
  java.io.DataInputStream.readLong(DataInputStream.java:416)
 at
 
 

   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2397)
  10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block
  blk_-236192853234282209_419415 bad datanode[0]
 192.168.0.23:50010
  10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block
  blk_-236192853234282209_419415 in pipeline 192.168.0.23:50010,
  192.168.0.39:50010: bad datanode 192.168.0.23:50010
  Any response will be appreciated!
 
 
  
 




-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Research Assistant of Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Can not upload local file to HDFS

2010-09-26 Thread He Chen
The problem is every datanode may be listed in the error report. That means
all my datanodes are bad?

One thing I forgot to mention. I can not use start-all.sh and stop-all.sh to
start and stop all dfs and mapred processes on my clusters. But the
jobtracker and namenode web interface still work.

I think I can solve this problem by ssh to every node and kill current
hadoop processes and restart them again. The previous problem will also be
solved( it's my opinion). But I really want to know why the HDFS reports me
previous errors.


On Sat, Sep 25, 2010 at 11:20 PM, Nan Zhu zhunans...@gmail.com wrote:

 Hi Chen,

 It seems that you have a bad datanode? maybe you should reformat them?

 Nan

 On Sun, Sep 26, 2010 at 10:42 AM, He Chen airb...@gmail.com wrote:

  Hello Neil
 
  No matter how big the file is. It always report this to me. The file size
  is
  from 10KB to 100MB.
 
  On Sat, Sep 25, 2010 at 6:08 PM, Neil Ghosh neil.gh...@gmail.com
 wrote:
 
   How Big is the file? Did you try Formatting Name node and Datanode?
  
   On Sun, Sep 26, 2010 at 2:12 AM, He Chen airb...@gmail.com wrote:
  
Hello everyone
   
I can not load local file to HDFS. It gave the following errors.
   
WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
   block
blk_-236192853234282209_419415java.io.EOFException
   at java.io.DataInputStream.readFully(DataInputStream.java:197)
   at java.io.DataInputStream.readLong(DataInputStream.java:416)
   at
   
   
  
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2397)
10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block
blk_-236192853234282209_419415 bad datanode[0] 192.168.0.23:50010
10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block
blk_-236192853234282209_419415 in pipeline 192.168.0.23:50010,
192.168.0.39:50010: bad datanode 192.168.0.23:50010
Any response will be appreciated!
   
   



Can not upload local file to HDFS

2010-09-25 Thread He Chen
Hello everyone

I can not load local file to HDFS. It gave the following errors.

WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
blk_-236192853234282209_419415java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readLong(DataInputStream.java:416)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2397)
10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block
blk_-236192853234282209_419415 bad datanode[0] 192.168.0.23:50010
10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block
blk_-236192853234282209_419415 in pipeline 192.168.0.23:50010,
192.168.0.39:50010: bad datanode 192.168.0.23:50010
Any response will be appreciated!


-- 
Best Wishes!
顺送商祺!

--
Chen He


Re: Job performance issue: output.collect()

2010-09-01 Thread He Chen
Hey Oded Rosen

I am not sure what is the functionality of your map() method. Intuitively,
move the map() method computation to the reduce() method if your map()
output is problematic. I mean just let the map() method act as a data input
reader and divider and let reduce() method do all you computation. In this
way, your intermediate results are less than before. Shuffle time can also
be reduced.

If the computation is still slow, I think it may not be the MapReduce
framework problem, but your programs. Hope this helps.


Chen

On Wed, Sep 1, 2010 at 7:18 AM, Oded Rosen o...@legolas-media.com wrote:

 Hi all,

 My job (written in old 0.18 api, but that's not the issue here) is
 producing
 large amounts of map output.
 Each map() call generates about ~20 output.collects, and each output is
 pretty big (~1K) = each map() produces about 20K.
 All of this data is fed to a combiner that really reduces the output's size
 + amounts.
 the job input is not so big: there are about 120M map input records.

 This job is pretty slow. Other jobs that work on the same input are much
 faster, since they do not produce so much output.
 Analyzing the job performance (timing the map() function parts), I've seen
 that much time is spent on the output.collect() line itself.

 I know that during the output.collect() command the output is being written
 to local filesystem spills (when the spill buffer reaches a 80% limit),
 so I guessed that reducing the size of each output will improve
 performance.
 This was not the case - after cutting 30% of the map output size, the job
 took the same amount of time. The thing that I cannot reduce is the amount
 of output lines being written out of the map.

 I would like to know what happens in the output.collect line that takes
 lots
 of time, in order to cut down this job's running time.
 Please keep in mind that I have a combiner, and to my understanding
 different things happen to the map output when a combiner is present.

 Can anyone help me understand how can I save this precious time?
 Thanks,

 --
 Oded



Re: Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode

2010-08-06 Thread He Chen
Way#3

1) bring up all 8 dn and the nn
2) retire one of your 4 nodes:
   kill the datanode process
   hadoop dfsadmin -refreshNodes  (this should be done on nn)
3) do 2) extra three times

On Fri, Aug 6, 2010 at 1:21 AM, Allen Wittenauer
awittena...@linkedin.comwrote:


 On Aug 5, 2010, at 10:42 PM, Steve Kuo wrote:

  As part of our experimentation, the plan is to pull 4 slave nodes out of
 a
  8-slave/1-master cluster. With replication factor set to 3, I thought
  losing half of the cluster may be too much for hdfs to recover.  Thus I
  copied out all relevant data from hdfs to local disk and reconfigure the
  cluster.

 It depends.  If you have configured Hadoop to have a topology such that the
 8 nodes were in 2 logical racks, then it would have worked just fine.  If
 you didn't have any topology configured, then each node is considered its
 own rack.  So pulling half of the grid down means you are likely losing a
 good chunk of all your blocks.




 
  The 4 slave nodes started okay but hdfs never left safe mode.  The nn.log
  has the following line.  What is the best way to deal with this?  Shall I
  restart the cluster with 8-node and then delete
  /data/hadoop-hadoop/mapred/system?  Or shall I reformat hdfs?

 Two ways to go:

 Way #1:

 1) configure dfs.hosts
 2) bring up all 8 nodes
 3) configure dfs.hosts.exclude to include the 4 you don't want
 4) dfsadmin -refreshNodes to start decommissioning the 4 you don't want

 Way #2:

 1) configure a topology
 2) bring up all 8 nodes
 3) setrep all files +1
 4) wait for nn to finish replication
 5) pull 4 nodes
 6) bring down nn
 7) remove topology
 8) bring nn up
 9) setrep -1






-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Research Assistant of Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

2010-07-27 Thread He Chen
Hey Deepak Diwakar

Try to keep the /etc/hosts file as the same among all your cluster nodes.
See whether this problem will disappear.

On Tue, Jul 27, 2010 at 2:31 PM, Deepak Diwakar ddeepa...@gmail.com wrote:

 Hey friends,

 I got stuck on setting up hdfs cluster and getting this error while running
 simple wordcount example(I did that 2 yrs back not had any problem).

 Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed
 from
 (

 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
 ).

  I checked the firewall settings and /etc/hosts there is no issue there.
 Also master and slave are accessible both ways.

 Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
 because ulimit(its btw of 4096).

 Would be really thankful  if  anyone can guide me to resolve this.

 Thanks  regards,
 - Deepak Diwakar,




 On 28 June 2010 18:39, bmdevelopment bmdevelopm...@gmail.com wrote:

  Hi, Sorry for the cross-post. But just trying to see if anyone else
  has had this issue before.
  Thanks
 
 
  -- Forwarded message --
  From: bmdevelopment bmdevelopm...@gmail.com
  Date: Fri, Jun 25, 2010 at 10:56 AM
  Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
  bailing-out.
  To: mapreduce-u...@hadoop.apache.org
 
 
  Hello,
  Thanks so much for the reply.
  See inline.
 
  On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala yhema...@gmail.com
  wrote:
   Hi,
  
   I've been getting the following error when trying to run a very simple
   MapReduce job.
   Map finishes without problem, but error occurs as soon as it enters
   Reduce phase.
  
   10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
   attempt_201006241812_0001_r_00_0, Status : FAILED
   Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
  
   I am running a 5 node cluster and I believe I have all my settings
  correct:
  
   * ulimit -n 32768
   * DNS/RDNS configured properly
   * hdfs-site.xml : http://pastebin.com/xuZ17bPM
   * mapred-site.xml : http://pastebin.com/JraVQZcW
  
   The program is very simple - just counts a unique string in a log
 file.
   See here: http://pastebin.com/5uRG3SFL
  
   When I run, the job fails and I get the following output.
   http://pastebin.com/AhW6StEb
  
   However, runs fine when I do *not* use substring() on the value (see
   map function in code above).
  
   This runs fine and completes successfully:
  String str = val.toString();
  
   This causes error and fails:
  String str = val.toString().substring(0,10);
  
   Please let me know if you need any further information.
   It would be greatly appreciated if anyone could shed some light on
 this
  problem.
  
   It catches attention that changing the code to use a substring is
   causing a difference. Assuming it is consistent and not a red herring,
 
  Yes, this has been consistent over the last week. I was running 0.20.1
  first and then
  upgrade to 0.20.2 but results have been exactly the same.
 
   can you look at the counters for the two jobs using the JobTracker web
   UI - things like map records, bytes etc and see if there is a
   noticeable difference ?
 
  Ok, so here is the first job using write.set(value.toString()); having
  *no* errors:
  http://pastebin.com/xvy0iGwL
 
  And here is the second job using
  write.set(value.toString().substring(0, 10)); that fails:
  http://pastebin.com/uGw6yNqv
 
  And here is even another where I used a longer, and therefore unique
  string,
  by write.set(value.toString().substring(0, 20)); This makes every line
  unique, similar to first job.
  Still fails.
  http://pastebin.com/GdQ1rp8i
 
  Also, are the two programs being run against
   the exact same input data ?
 
  Yes, exactly the same input: a single csv file with 23K lines.
  Using a shorter string leads to more like keys and therefore more
  combining/reducing, but going
  by the above it seems to fail whether the substring/key is entirely
  unique (23000 combine output records) or
  mostly the same (9 combine output records).
 
  
   Also, since the cluster size is small, you could also look at the
   tasktracker logs on the machines where the maps have run to see if
   there are any failures when the reduce attempts start failing.
 
  Here is the TT log from the last failed job. I do not see anything
  besides the shuffle failure, but there
  may be something I am overlooking or simply do not understand.
  http://pastebin.com/DKFTyGXg
 
  Thanks again!
 
  
   Thanks
   Hemanth
  
 




-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Research Assistant of Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: hybrid map/reducer scheduler?

2010-06-28 Thread He Chen
You can write your own one based on them. They are open source.

On Mon, Jun 28, 2010 at 6:13 PM, jiang licht licht_ji...@yahoo.com wrote:

 In addition to default FIFO scheduler, there are fair scheduler and
 capacity scheduler. In some sense, fair scheduler can be considered a
 user-based scheduling while capacity scheduler does a queue-based
 scheduling. Is there or will there be a hybrid scheduler that combines the
 good parts of the two (or a capacity scheduler that allows preemption, then
 different users are asked to submit jobs to different queues, in this way
 implicitly follow user-based scheduling as well, more or less)?

 Thanks,

 --Michael





Re: Shuffle error

2010-05-24 Thread He Chen
problem solved. This is caused by the inconsistency of the /etc/hosts file.

2010/5/24 He Chen airb...@gmail.com

 Hey, every one

 I have a problem when I run hadoop programs. Some of my worker nodes always
 report following Shuffle Error.

 Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: 
 Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
 MAX_FAILED_UNIQUE_FETCHES; bailing-out.

 I tried to find out whether this is caused by certain nodes' hardware. For
 example, memory, hard drive. But it looks like any of the worker nodes
 incline to have this error. I am using 0.20.1.

 Any suggestion will be appreciated!

 --
 Best Wishes!
 顺送商祺!

 --
 Chen He
 (402)613-9298
 PhD. student of CSE Dept.
 Holland Computing Center
 University of Nebraska-Lincoln
 Lincoln NE 68588




-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread He Chen
If you know how to use AspectJ to do aspect oriented programming. You can
write a aspect class. Let it just monitors the whole process of MapReduce

On Tue, May 18, 2010 at 10:00 AM, Patrick Angeles patr...@cloudera.comwrote:

 Should be evident in the total job running time... that's the only metric
 that really matters :)

 On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT pierre...@gmail.com
 wrote:

  Thank you,
  Any way I can measure the startup overhead in terms of time?
 
 
  On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles patr...@cloudera.com
  wrote:
 
   Pierre,
  
   Adding to what Brian has said (some things are not explicitly mentioned
  in
   the HDFS design doc)...
  
   - If you have small files that take up  64MB you do not actually use
 the
   entire 64MB block on disk.
   - You *do* use up RAM on the NameNode, as each block represents
 meta-data
   that needs to be maintained in-memory in the NameNode.
   - Hadoop won't perform optimally with very small block sizes. Hadoop
 I/O
  is
   optimized for high sustained throughput per single file/block. There is
 a
   penalty for doing too many seeks to get to the beginning of each block.
   Additionally, you will have a MapReduce task per small file. Each
  MapReduce
   task has a non-trivial startup overhead.
   - The recommendation is to consolidate your small files into large
 files.
   One way to do this is via SequenceFiles... put the filename in the
   SequenceFile key field, and the file's bytes in the SequenceFile value
   field.
  
   In addition to the HDFS design docs, I recommend reading this blog
 post:
   http://www.cloudera.com/blog/2009/02/the-small-files-problem/
  
   Happy Hadooping,
  
   - Patrick
  
   On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT pierre...@gmail.com
   wrote:
  
Okay, thank you :)
   
   
On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman 
 bbock...@cse.unl.edu
wrote:
   

 On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:

  Hi, thanks for this fast answer :)
  If so, what do you mean by blocks? If a file has to be splitted,
 it
will
 be
  splitted when larger than 64MB?
 

 For every 64MB of the file, Hadoop will create a separate block.
  So,
   if
 you have a 32KB file, there will be one block of 32KB.  If the file
  is
65MB,
 then it will have one block of 64MB and another block of 1MB.

 Splitting files is very useful for load-balancing and distributing
  I/O
 across multiple nodes.  At 32KB / file, you don't really need to
  split
the
 files at all.

 I recommend reading the HDFS design document for background issues
  like
 this:

 http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html

 Brian

 
 
 
  On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman 
   bbock...@cse.unl.edu
 wrote:
 
  Hey Pierre,
 
  These are not traditional filesystem blocks - if you save a file
smaller
  than 64MB, you don't lose 64MB of file space..
 
  Hadoop will use 32KB to store a 32KB file (ok, plus a KB of
  metadata
or
  so), not 64MB.
 
  Brian
 
  On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
 
  Hi,
  I'm porting a legacy application to hadoop and it uses a bunch
 of
small
  files.
  I'm aware that having such small files ain't a good idea but
 I'm
   not
  doing
  the technical decisions and the port has to be done for
   yesterday...
  Of course such small files are a problem, loading 64MB blocks
 for
  a
few
  lines of text is an evident loss.
  What will happen if I set a smaller, or even way smaller (32kB)
blocks?
 
  Thank you.
 
  Pierre ANCELOT.
 
 
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect


   
   
--
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect
   
  
 
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect
 




-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Hadoop does not follow my setting

2010-04-22 Thread He Chen
Hi everyone

I am doing a benchmark by using Hadoop 0.20.0's wordcount example. I have a
30GB file. I plan to test differenct number of mappers' performance. For
example, for a wordcount job, I plan to test 22 mappers, 44 mappers, 66
mappers and 110 mappers.

However, I set the mapred.map.tasks equals to 22. But when I ran the job,
it shows 436 mappers total.

I think maybe the wordcount set its parameters inside the its own program. I
give -Dmapred.map.tasks=22 to this program. But it is still 436 again in
my another try.  I found out that 30GB divide by 436 is just 64MB, it is
just my block size.

Any suggestions will be appreciated.

Thank you in advance!

-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Hadoop does not follow my setting

2010-04-22 Thread He Chen
Hey Eric Sammer

Thank you for the reply. Actually, I only care about the number of mappers
in my circumstance. Looks like, I should write the wordcount program with my
own InputFormat class.

2010/4/22 Eric Sammer esam...@cloudera.com

 This is normal and expected. The mapred.map.tasks parameter is only a
 hint. The InputFormat gets to decide how to calculate splits.
 FileInputFormat and all subclasses, including TextInputFormat, use a
 few parameters to figure out what the appropriate split size will be
 but under most circumstances, this winds up being the block size. If
 you used fewer map tasks than blocks, you would sacrifice data
 locality which would only hurt performance.

 2010/4/22 He Chen airb...@gmail.com:
   Hi everyone
 
  I am doing a benchmark by using Hadoop 0.20.0's wordcount example. I have
 a
  30GB file. I plan to test differenct number of mappers' performance. For
  example, for a wordcount job, I plan to test 22 mappers, 44 mappers, 66
  mappers and 110 mappers.
 
  However, I set the mapred.map.tasks equals to 22. But when I ran the
 job,
  it shows 436 mappers total.
 
  I think maybe the wordcount set its parameters inside the its own
 program. I
  give -Dmapred.map.tasks=22 to this program. But it is still 436 again
 in
  my another try.  I found out that 30GB divide by 436 is just 64MB, it is
  just my block size.
 
  Any suggestions will be appreciated.
 
  Thank you in advance!
 
  --
  Best Wishes!
  顺送商祺!
 
  --
  Chen He
  (402)613-9298
  PhD. student of CSE Dept.
  Holland Computing Center
  University of Nebraska-Lincoln
  Lincoln NE 68588
 



 --
 Eric Sammer
 phone: +1-917-287-2675
 twitter: esammer
 data: www.cloudera.com



Re: Hadoop does not follow my setting

2010-04-22 Thread He Chen
Hi Raymond Jennings III

I use 22 mappers because I have 22 cores in my clusters. Is this what you
want?

On Thu, Apr 22, 2010 at 11:55 AM, Raymond Jennings III 
raymondj...@yahoo.com wrote:

 Isn't the number of mappers specified only a suggestion ?

 --- On Thu, 4/22/10, He Chen airb...@gmail.com wrote:

  From: He Chen airb...@gmail.com
  Subject: Hadoop does not follow my setting
  To: common-user@hadoop.apache.org
  Date: Thursday, April 22, 2010, 12:50 PM
   Hi everyone
 
  I am doing a benchmark by using Hadoop 0.20.0's wordcount
  example. I have a
  30GB file. I plan to test differenct number of mappers'
  performance. For
  example, for a wordcount job, I plan to test 22 mappers, 44
  mappers, 66
  mappers and 110 mappers.
 
  However, I set the mapred.map.tasks equals to 22. But
  when I ran the job,
  it shows 436 mappers total.
 
  I think maybe the wordcount set its parameters inside the
  its own program. I
  give -Dmapred.map.tasks=22 to this program. But it is
  still 436 again in
  my another try.  I found out that 30GB divide by 436
  is just 64MB, it is
  just my block size.
 
  Any suggestions will be appreciated.
 
  Thank you in advance!
 
  --
  Best Wishes!
  顺送商祺!
 
  --
  Chen He
  (402)613-9298
  PhD. student of CSE Dept.
  Holland Computing Center
  University of Nebraska-Lincoln
  Lincoln NE 68588
 






-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Hadoop does not follow my setting

2010-04-22 Thread He Chen
Yes, but if you have more mappers, you may have more waves to execute. I
mean if I have 110 mappers for a job and I only have 22 cores. Then, it will
execute 5 waves approximately, If I have only 22 mappers, It will save the
overhead time.

2010/4/22 Edward Capriolo edlinuxg...@gmail.com

 2010/4/22 He Chen airb...@gmail.com

  Hi Raymond Jennings III
 
  I use 22 mappers because I have 22 cores in my clusters. Is this what you
  want?
 
  On Thu, Apr 22, 2010 at 11:55 AM, Raymond Jennings III 
  raymondj...@yahoo.com wrote:
 
   Isn't the number of mappers specified only a suggestion ?
  
   --- On Thu, 4/22/10, He Chen airb...@gmail.com wrote:
  
From: He Chen airb...@gmail.com
Subject: Hadoop does not follow my setting
To: common-user@hadoop.apache.org
Date: Thursday, April 22, 2010, 12:50 PM
 Hi everyone
   
I am doing a benchmark by using Hadoop 0.20.0's wordcount
example. I have a
30GB file. I plan to test differenct number of mappers'
performance. For
example, for a wordcount job, I plan to test 22 mappers, 44
mappers, 66
mappers and 110 mappers.
   
However, I set the mapred.map.tasks equals to 22. But
when I ran the job,
it shows 436 mappers total.
   
I think maybe the wordcount set its parameters inside the
its own program. I
give -Dmapred.map.tasks=22 to this program. But it is
still 436 again in
my another try.  I found out that 30GB divide by 436
is just 64MB, it is
just my block size.
   
Any suggestions will be appreciated.
   
Thank you in advance!
   
--
Best Wishes!
顺送商祺!
   
--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588
   
  
  
  
  
 
 
  --
  Best Wishes!
  顺送商祺!
 
  --
  Chen He
  (402)613-9298
  PhD. student of CSE Dept.
  Holland Computing Center
  University of Nebraska-Lincoln
  Lincoln NE 68588
 

 No matter how many total mappers exist for the job only a certain number of
 them run at once.




-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Hadoop does not follow my setting

2010-04-22 Thread He Chen
In some extents, for 30GB file, if it is well balanced the overhead imposed
by data locality may not be too much. We will see. I will report my results
to this mail-list.

On Thu, Apr 22, 2010 at 2:44 PM, Allen Wittenauer
awittena...@linkedin.comwrote:


 On Apr 22, 2010, at 11:46 AM, He Chen wrote:

  Yes, but if you have more mappers, you may have more waves to execute. I
  mean if I have 110 mappers for a job and I only have 22 cores. Then, it
 will
  execute 5 waves approximately, If I have only 22 mappers, It will save
 the
  overhead time.

 But you'll sacrifice data locality, which means that instead of testing the
 cpu, you'll be testing cpu+network.





-- 
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Reducer-side join example

2010-04-05 Thread He Chen
For the Map function:
Input key: default
input value: File A and File B lines

output key: A, B, C,(first colomn of the final result)
output value: 12, 24, Car, 13, Van, SUV...

Reduce function:
take the Map output and do:
for each key
{   if the value of a key is integer
then same it to array1;
   else save it to array2
}
for ith element in array1
  for jth element in array2
   output(key, array1[i]+\t+array2[j]);
done

Hope this helps.


On Mon, Apr 5, 2010 at 4:10 PM, M B machac...@gmail.com wrote:

 Hi, I need a good java example to get me started with some joining we need
 to do, any examples would be appreciated.

 File A:
 Field1  Field2
 A12
 B13
 C22
 A24

 File B:
  Field1  Field2   Field3
 ACar   ...
 BTruck...
 BSUV ...
 BVan  ...

 So, we need to first join File A and B on Field1 (say both are string
 fields).  The result would just be:
 A   12   Car   ...
 A   24   Car   ...
 B   13   Truck   ...
 B   13   SUV   ...
  B   13   Van   ...
 and so on - with all the fields from both files returning.

 Once we have that, we sometimes need to then transform it so we have a
 single record per key (Field1):
 A (12,Car) (24,Car)
 B (13,Truck) (13,SUV) (13,Van)
 --however it looks, basically tuples for each key (we'll modify this later
 to return a conatenated set of fields from B, etc)

 At other times, instead of transforming to a single row, we just need to
 modify rows based on values.  So if B.Field2 equals Van, we need to set
 Output.Field2 = whatever then output to file ...

 Are there any good examples of this in native java (we can't use
 pig/hive/etc)?

 thanks.




-- 
Best Wishes!


--
Chen He
  PhD. student of CSE Dept.
Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588


Re: Defining the number of map tasks

2009-12-29 Thread He Chen
in the hadoop-site.xml or hadoop-default.xml file. you can find a parameter:
mapred.map.tasks. Change it value to 3. At the same time set
mapred.tasktracker.map.tasks.maximum to 3 if you use only one tasktracker.

On Wed, Dec 16, 2009 at 3:26 PM, psdc1978 psdc1...@gmail.com wrote:

 Hi,

 I would like to have several Map tasks that execute the same tasks.
 For example, I've 3 map tasks (M1, M2 and M3) and a 1Gb of input data
 to be read by each map. Each map should read the same input data and
 send the result to the same Reduce. At the end, the reduce should
 produce the same 3 results.

 Put in conf/slaves file 3 instances of the same machine

 file
 localhost
 localhost
 localhost
 /file

 does it solve the problem?


 How I define the number of map tasks to run?



 Best regards,
 --
 xeon


Chen