Re: Data Locality Importance

2014-03-22 Thread Chen He
Hi Mike

Data locality has an assumption. It assumes storage access (disk, ssd, etc)
is faster than network data transferring. Vinod has already explained the
benefits. But locality in map stage may not always bring good things. If a
fat node saves a large file, it is possible that current MR framework
assigns a lots of map tasks from single job to this node, and then, congest
its network in shuffle.

I am not sure how EMR is implemented in physical layer. If they are all
virtual machines, it is possible that your "seperate" HDFS cluster and MR
cluster still get benefits from local data access.

Chen


On Sat, Mar 22, 2014 at 11:07 PM, Sathya  wrote:

> "VOTE FOR MODI" or teach me how not to get mails
>
> -Original Message-
> From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com] On Behalf
> Of
> Vinod Kumar Vavilapalli
> Sent: Sunday, March 23, 2014 12:20 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Data Locality Importance
>
> Like you said, it depends both on the kind of network you have and the type
> of your workload.
>
> Given your point about S3, I'd guess your input files/blocks are not large
> enough that moving code to data trumps moving data itself to the code. When
> that balance tilts a lot, especially when moving large input data
> files/blocks, data-locality will help improve performance significantly.
> That or when the read throughput from a remote desk << reading it from a
> local disk.
>
> HTH
> +Vinod
>
> On Mar 21, 2014, at 7:06 PM, Mike Sam  wrote:
>
> > How important is Data Locality to Hadoop? I mean, if we prefer to
> > separate the HDFS cluster from the MR cluster, we will lose data
> > locality but my question is how bad is this assuming we provider a
> > reasonable network connection between the two clusters? EMR kills data
> > locality when using S3 as storage but we do not see a significant job
> > time difference running same job from the HDFS cluster of the same
> > setup. So, I am wondering how important is Data Locality to Hadoop in
> practice?
> >
> > Thanks,
> > Mike
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of
> this message is not the intended recipient, you are hereby notified that
> any
> printing, copying, dissemination, distribution, disclosure or forwarding of
> this communication is strictly prohibited. If you have received this
> communication in error, please contact the sender immediately and delete it
> from your system. Thank You.
>
>
> ---
> This email is free from viruses and malware because avast! Antivirus
> protection is active.
> http://www.avast.com
>
>


Re: Hadoop and Cuda , JCuda (CPU+GPU architecture)

2012-09-25 Thread Chen He
Hi Sudha

Good question.

First of all, you need to specify clearly about your Hadoop environment,
(pseudo distributed or real cluster)

Secondly, you need to clearly understand how hadoop load job's jar file to
all worker nodes, it only copy the jar file to worker nodes. It does not
contain the jcuda.jar file. MapReduce program may not know where it is even
you specify the jcuda.jar file in our worker node classpath.

I prefer you can include the Jcuda.jar into your wordcount.jar. Then when
Hadoop copy the wordcount.jar file to all worker nodes' temporary working
directory, you do not need to worry about this issue.

Let me know if you meet further question.

Chen

On Tue, Sep 25, 2012 at 12:38 AM, sudha sadhasivam <
sudhasadhasi...@yahoo.com> wrote:

> Sir
> We tried to integrate hadoop and JCUDA.
> We tried a code from
>
>
> http://code.google.com/p/mrcl/source/browse/trunk/hama-mrcl/src/mrcl/mrcl/?r=76
>
> We re able to compile. We are not able to execute. It does not recognise
> JCUBLAS.jar. We tried setting the classpath
> We are herewith attaching the procedure for the same along with errors
> Kindly inform us how to proceed. It is our UG project
> Thanking you
> Dr G sudha Sadasivam
>
> --- On *Mon, 9/24/12, Chen He * wrote:
>
>
> From: Chen He 
> Subject: Re: Hadoop and Cuda , JCuda (CPU+GPU architecture)
> To: common-user@hadoop.apache.org
> Date: Monday, September 24, 2012, 9:03 PM
>
>
> http://wiki.apache.org/hadoop/CUDA%20On%20Hadoop
>
> On Mon, Sep 24, 2012 at 10:30 AM, Oleg Ruchovets 
> http://mc/compose?to=oruchov...@gmail.com>
> >wrote:
>
> > Hi
> >
> > I am going to process video analytics using hadoop
> > I am very interested about CPU+GPU architercute espessially using CUDA (
> > http://www.nvidia.com/object/cuda_home_new.html) and JCUDA (
> > http://jcuda.org/)
> > Does using HADOOP and CPU+GPU architecture bring significant performance
> > improvement and does someone succeeded to implement it in production
> > quality?
> >
> > I didn't fine any projects / examples  to use such technology.
> > If someone could give me a link to best practices and example using
> > CUDA/JCUDA + hadoop that would be great.
> > Thanks in advane
> > Oleg.
> >
>
>


Re: Hadoop and Cuda , JCuda (CPU+GPU architecture)

2012-09-24 Thread Chen He
Hi Oleg

I will answer your questions one by one.

1) file size

There is no exactly number of file size that will definitely works well for
GPGPU+Hadoop. You need to do your project POC to get the number.

I think the GPU+Hadoop is very suitable for computation-intensive and
data-intensive applications. However, be aware of the bottleneck between
the GPU memory and CPU memory. I mean the benefit you obtained from using
GPGPU should be larger than the performance that you sacrificed by shipping
data between GPU memory and CPU memory.

If you only have computation-intensive applications and can be parallelized
by GPGPU, CUDA+Hadoop can also provide a parallel framework for you to
distribute your work among the cluster nodes with fault-tolerance.


 2) Is it good Idea to process data as locally as possble (I mean process a
data like one file per one map)

Local Map tasks are shorter than non-local tasks in the Hadoop MapReduce
framework.

3) During your project did you face with limitations , problems?

During my project, the video card was not fancy, it only allowed one CUDA
program using the card in anytime. Then, we only  configured one map slot
and one reduce slot in a cluster node. Now, nvidia has some powerful
products that support multiple program run on the same card simultaneously.

4)  By the way I didn't fine code Jcuda example with Hadoop. :-)

Your MapReduce code is written in Java, right? Integrate your Jcude code to
either map() or reduce() method of your MapReduce code (you can also do
this in the combiner, partitioner or whatever you need). Jcuda example only
helps you know how Jcuda works.

Chen

On Mon, Sep 24, 2012 at 11:22 AM, Oleg Ruchovets wrote:

> Great ,
>Can you give some tips or best practices like:
> 1) file size
> 2) Is it good Idea to process data as locally as possble (I mean process a
> data like one file per one map)
> 3) During your project did you face with limitations , problems?
>
>
>Can you point me on which hartware is better to use( I understand in
> order to use GPU I need NVIDIA) .
> I mean using CPU only arthitecture I have 8-12 core per one computer(for
> example).
>  What should I do in orger to use CPU+GPU arthitecture? What kind of NVIDIA
> do I need for this.
>
> By the way I didn't fine code Jcuda example with Hadoop. :-)
>
> Thanks in advane
> Oleg.
>
> On Mon, Sep 24, 2012 at 6:07 PM, Chen He  wrote:
>
> > Please see the Jcuda example. I do refer from there. BTW, you can also
> > compile your cuda code in advance and let your hadoop code call those
> > compiled code through Jcuda. That is what I did in my program.
> >
> > On Mon, Sep 24, 2012 at 10:45 AM, Oleg Ruchovets  > >wrote:
> >
> > > Thank you very much.  I saw this link !!!  . Do you have any code ,
> > example
> > > shared in the network (github for example).
> > >
> > > On Mon, Sep 24, 2012 at 5:33 PM, Chen He  wrote:
> > >
> > > > http://wiki.apache.org/hadoop/CUDA%20On%20Hadoop
> > > >
> > > > On Mon, Sep 24, 2012 at 10:30 AM, Oleg Ruchovets <
> oruchov...@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi
> > > > >
> > > > > I am going to process video analytics using hadoop
> > > > > I am very interested about CPU+GPU architercute espessially using
> > CUDA
> > > (
> > > > > http://www.nvidia.com/object/cuda_home_new.html) and JCUDA (
> > > > > http://jcuda.org/)
> > > > > Does using HADOOP and CPU+GPU architecture bring significant
> > > performance
> > > > > improvement and does someone succeeded to implement it in
> production
> > > > > quality?
> > > > >
> > > > > I didn't fine any projects / examples  to use such technology.
> > > > > If someone could give me a link to best practices and example using
> > > > > CUDA/JCUDA + hadoop that would be great.
> > > > > Thanks in advane
> > > > > Oleg.
> > > > >
> > > >
> > >
> >
>


Re: Hadoop and Cuda , JCuda (CPU+GPU architecture)

2012-09-24 Thread Chen He
Please see the Jcuda example. I do refer from there. BTW, you can also
compile your cuda code in advance and let your hadoop code call those
compiled code through Jcuda. That is what I did in my program.

On Mon, Sep 24, 2012 at 10:45 AM, Oleg Ruchovets wrote:

> Thank you very much.  I saw this link !!!  . Do you have any code , example
> shared in the network (github for example).
>
> On Mon, Sep 24, 2012 at 5:33 PM, Chen He  wrote:
>
> > http://wiki.apache.org/hadoop/CUDA%20On%20Hadoop
> >
> > On Mon, Sep 24, 2012 at 10:30 AM, Oleg Ruchovets  > >wrote:
> >
> > > Hi
> > >
> > > I am going to process video analytics using hadoop
> > > I am very interested about CPU+GPU architercute espessially using CUDA
> (
> > > http://www.nvidia.com/object/cuda_home_new.html) and JCUDA (
> > > http://jcuda.org/)
> > > Does using HADOOP and CPU+GPU architecture bring significant
> performance
> > > improvement and does someone succeeded to implement it in production
> > > quality?
> > >
> > > I didn't fine any projects / examples  to use such technology.
> > > If someone could give me a link to best practices and example using
> > > CUDA/JCUDA + hadoop that would be great.
> > > Thanks in advane
> > > Oleg.
> > >
> >
>


Re: Hadoop and Cuda , JCuda (CPU+GPU architecture)

2012-09-24 Thread Chen He
http://wiki.apache.org/hadoop/CUDA%20On%20Hadoop

On Mon, Sep 24, 2012 at 10:30 AM, Oleg Ruchovets wrote:

> Hi
>
> I am going to process video analytics using hadoop
> I am very interested about CPU+GPU architercute espessially using CUDA (
> http://www.nvidia.com/object/cuda_home_new.html) and JCUDA (
> http://jcuda.org/)
> Does using HADOOP and CPU+GPU architecture bring significant performance
> improvement and does someone succeeded to implement it in production
> quality?
>
> I didn't fine any projects / examples  to use such technology.
> If someone could give me a link to best practices and example using
> CUDA/JCUDA + hadoop that would be great.
> Thanks in advane
> Oleg.
>


Re: migrate cluster to different datacenter

2012-08-03 Thread Chen He
sometimes, physically moving hard drives helps.   :)
On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" 
wrote:

> Hi Hadoopers,
>
> We have a plan to migrate Hadoop cluster to a different datacenter
> where we can triple the size of the cluster.
> Currently, our 0.20.2 cluster have around 1PB of data. We use only
> Java/Pig.
>
> I would like to get some input how we gonna handle with transferring
> 1PB of data to a new site, and also keep up with
> new files that thrown into cluster all the time.
>
> Happy friday !!
>
> P
>


Re: Re: HDFS block physical location

2012-07-25 Thread Chen He
For block to filename mapping, you can get from my previous answer.

For block to harddisk mapping, you may need to traverse all the directory
that used for HDFS, I am sure your OS has the information about which hard
drive is mounted to which directory.

with these two types of information, you can write a small Perl or Python
script to get what you want.

Or

Take look of the namenode.java and see where and how it saves the table of
block information.

Please correct me if there is any mistake.

Chen


On Wed, Jul 25, 2012 at 6:10 PM, <20seco...@web.de> wrote:

>
> Thanks,
> but that just gives me the hostnames or am I overlooking something?
> I actually need the filename/harddisk on the node.
>
> JS
>
> Gesendet: Mittwoch, 25. Juli 2012 um 23:33 Uhr
> Von: "Chen He" 
> An: common-user@hadoop.apache.org
> Betreff: Re: HDFS block physical location
> >nohup hadoop fsck / -files -blocks -locations
> >cat nohup.out | grep [your block name]
>
> Hope this helps.
>
> On Wed, Jul 25, 2012 at 5:17 PM, <20seco...@web.de> wrote:
>
> > Hi,
> >
> > just a short question. Is there any way to figure out the physical
> storage
> > location of a given block?
> > I don't mean just a list of hostnames (which I know how to obtain), but
> > actually the file where it is being stored in.
> > We use several hard disks for hdfs data on each node, and I would need to
> > know which block ends up on which harddisk.
> >
> > Thanks!
> > JS
> >
> >
>
>
>
>


Re: HDFS block physical location

2012-07-25 Thread Chen He
>nohup hadoop fsck / -files -blocks -locations
>cat nohup.out | grep [your block name]

Hope this helps.

On Wed, Jul 25, 2012 at 5:17 PM, <20seco...@web.de> wrote:

> Hi,
>
> just a short question. Is there any way to figure out the physical storage
> location of a given block?
> I don't mean just a list of hostnames (which I know how to obtain), but
> actually the file where it is being stored in.
> We use several hard disks for hdfs data on each node, and I would need to
> know which block ends up on which harddisk.
>
> Thanks!
> JS
>
>


Re: AUTO: Yuan Jin is out of the office. (returning 07/25/2012)

2012-07-23 Thread Chen He
BTW, this is a Hadoop user group. You are welcomed to ask question and give
solution to help people. Please do not pollute this technical environment.

To Yuan Jin: DO NOT send me your auto email again to my personal mail-box.
It is not fun but rude. We will still respect you if you do not send this
type of auto email to our technical mail-list and say at least "Execuse me"
to all people in this mail-list.



On Mon, Jul 23, 2012 at 9:31 PM, Chen He  wrote:

> Looks like that guy is your boss, Jason. It was you to let people forgive
> him last time. Tell him, remove the group mail-list from his auto email
> system.
>
> Looks like this Yuan has little contribution to the mail-list except for
> the spam auto emails.
>
> On Mon, Jul 23, 2012 at 6:12 PM, Jason  wrote:
>
>> Guys, just be nice
>>
>> On Tue, Jul 24, 2012 at 5:59 AM, Chen He  wrote:
>>
>> > Just kick this junk mail guy out of the group.
>> >
>> > On Mon, Jul 23, 2012 at 5:22 PM, Jean-Daniel Cryans <
>> jdcry...@apache.org
>> > >wrote:
>> >
>> > > Fifth offense.
>> > >
>> > > Yuan Jin is out of the office. - I will be out of the office starting
>> > > 06/22/2012 and will not return until 06/25/2012. I am out of
>> > > Jun 21
>> > >
>> > > Yuan Jin is out of the office. - I will be out of the office starting
>> > > 04/13/2012 and will not return until 04/16/2012. I am out of
>> > > Apr 12
>> > >
>> > > Yuan Jin is out of the office. - I will be out of the office starting
>> > > 04/02/2012 and will not return until 04/05/2012. I am out of
>> > > Apr 2
>> > >
>> > > Yuan Jin is out of the office. - I will be out of the office starting
>> > > 02/17/2012 and will not return until 02/20/2012. I am out of
>> > > Feb 16
>> > >
>> > >
>> > > On Mon, Jul 23, 2012 at 1:09 PM, Yuan Jin  wrote:
>> > > >
>> > > >
>> > > > I am out of the office until 07/25/2012.
>> > > >
>> > > > I am out of office.
>> > > >
>> > > > For HAMSTER related things, you can contact Jason(Deng Peng
>> > > Zhou/China/IBM)
>> > > > For CFM related things, you can contact Daniel(Liang SH
>> > > Su/China/Contr/IBM)
>> > > > For TMB related things, you can contact Flora(Jun Ying Li/China/IBM)
>> > > > For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM)
>> > > > For others, I will reply you when I am back.
>> > > >
>> > > >
>> > > > Note: This is an automated response to your message  "Reducer
>> > > > MapFileOutpuFormat" sent on 24/07/2012 4:09:51.
>> > > >
>> > > > This is the only notification you will receive while this person is
>> > away.
>> > >
>> >
>>
>>
>>
>> --
>> Regards,
>>
>> Hao Tian
>>
>
>


Re: AUTO: Yuan Jin is out of the office. (returning 07/25/2012)

2012-07-23 Thread Chen He
Looks like that guy is your boss, Jason. It was you to let people forgive
him last time. Tell him, remove the group mail-list from his auto email
system.

Looks like this Yuan has little contribution to the mail-list except for
the spam auto emails.

On Mon, Jul 23, 2012 at 6:12 PM, Jason  wrote:

> Guys, just be nice
>
> On Tue, Jul 24, 2012 at 5:59 AM, Chen He  wrote:
>
> > Just kick this junk mail guy out of the group.
> >
> > On Mon, Jul 23, 2012 at 5:22 PM, Jean-Daniel Cryans  > >wrote:
> >
> > > Fifth offense.
> > >
> > > Yuan Jin is out of the office. - I will be out of the office starting
> > > 06/22/2012 and will not return until 06/25/2012. I am out of
> > > Jun 21
> > >
> > > Yuan Jin is out of the office. - I will be out of the office starting
> > > 04/13/2012 and will not return until 04/16/2012. I am out of
> > > Apr 12
> > >
> > > Yuan Jin is out of the office. - I will be out of the office starting
> > > 04/02/2012 and will not return until 04/05/2012. I am out of
> > > Apr 2
> > >
> > > Yuan Jin is out of the office. - I will be out of the office starting
> > > 02/17/2012 and will not return until 02/20/2012. I am out of
> > > Feb 16
> > >
> > >
> > > On Mon, Jul 23, 2012 at 1:09 PM, Yuan Jin  wrote:
> > > >
> > > >
> > > > I am out of the office until 07/25/2012.
> > > >
> > > > I am out of office.
> > > >
> > > > For HAMSTER related things, you can contact Jason(Deng Peng
> > > Zhou/China/IBM)
> > > > For CFM related things, you can contact Daniel(Liang SH
> > > Su/China/Contr/IBM)
> > > > For TMB related things, you can contact Flora(Jun Ying Li/China/IBM)
> > > > For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM)
> > > > For others, I will reply you when I am back.
> > > >
> > > >
> > > > Note: This is an automated response to your message  "Reducer
> > > > MapFileOutpuFormat" sent on 24/07/2012 4:09:51.
> > > >
> > > > This is the only notification you will receive while this person is
> > away.
> > >
> >
>
>
>
> --
> Regards,
>
> Hao Tian
>


Re: AUTO: Yuan Jin is out of the office. (returning 07/25/2012)

2012-07-23 Thread Chen He
Just kick this junk mail guy out of the group.

On Mon, Jul 23, 2012 at 5:22 PM, Jean-Daniel Cryans wrote:

> Fifth offense.
>
> Yuan Jin is out of the office. - I will be out of the office starting
> 06/22/2012 and will not return until 06/25/2012. I am out of
> Jun 21
>
> Yuan Jin is out of the office. - I will be out of the office starting
> 04/13/2012 and will not return until 04/16/2012. I am out of
> Apr 12
>
> Yuan Jin is out of the office. - I will be out of the office starting
> 04/02/2012 and will not return until 04/05/2012. I am out of
> Apr 2
>
> Yuan Jin is out of the office. - I will be out of the office starting
> 02/17/2012 and will not return until 02/20/2012. I am out of
> Feb 16
>
>
> On Mon, Jul 23, 2012 at 1:09 PM, Yuan Jin  wrote:
> >
> >
> > I am out of the office until 07/25/2012.
> >
> > I am out of office.
> >
> > For HAMSTER related things, you can contact Jason(Deng Peng
> Zhou/China/IBM)
> > For CFM related things, you can contact Daniel(Liang SH
> Su/China/Contr/IBM)
> > For TMB related things, you can contact Flora(Jun Ying Li/China/IBM)
> > For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM)
> > For others, I will reply you when I am back.
> >
> >
> > Note: This is an automated response to your message  "Reducer
> > MapFileOutpuFormat" sent on 24/07/2012 4:09:51.
> >
> > This is the only notification you will receive while this person is away.
>


Re: what does "keep 10% map, 40% reduce" mean in gridmix2's README?

2012-06-14 Thread Chen He
Let me know when you get the correct answer.

Chen

On Thu, Jun 14, 2012 at 11:42 AM, Nan Zhu  wrote:

> Hi, Chen,
>
> Thank you for your reply,
>
> but in its README, there is no value which is larger than 100%, it means
> that the size of intermediate results will never be larger than input size,
>
> it will not be the case, because the input data is compressed, the size of
> the generated data will expand to be very large
>
> it's just my guessing, can anyone correct me?
>
> Best,
>
> Nan
>
>
> On Thu, Jun 14, 2012 at 11:50 PM, Chen He  wrote:
>
> > Hi Nan
> >
> > probably the map stage will output 10% of the total input, and the reduce
> > stage will output 40% of intermediate results (10% of total input).
> >
> > For example, 500GB input, after the map stage, it will be 50GB and it
> will
> > become 20GB after the reduce stage.
> >
> > It may be similar to the loadgen in hadoop test example.
> >
> > Anyone has suggestion?
> >
> > Chen
> > System Architect Intern @ ZData
> > PhD student@CSE Dept.
> >
> >
> > On Thu, Jun 14, 2012 at 1:58 AM, Nan Zhu  wrote:
> >
> > > Hi, all
> > >
> > > I'm using gridmix2 to test my cluster, while in its README file, there
> > are
> > > statements like the following:
> > >
> > > +1) Three stage map/reduce job
> > > +  Input:  500GB compressed (2TB uncompressed) SequenceFile
> > > + (k,v) = (5 words, 100 words)
> > > + hadoop-env: FIXCOMPSEQ
> > > + *Compute1:   keep 10% map, 40% reduce
> > > +  Compute2:   keep 100% map, 77% reduce
> > > + Input from Compute1
> > > + Compute3:   keep 116% map, 91% reduce
> > > + Input from Compute2
> > > + *Motivation: Many user workloads are implemented as pipelined
> > > map/reduce
> > > + jobs, including Pig workloads
> > >
> > >
> > > Can anyone tell me what does "keep 10% map, 40% reduce" mean here?
> > >
> > > Best,
> > >
> > > --
> > > Nan Zhu
> > > School of Electronic, Information and Electrical Engineering,229
> > > Shanghai Jiao Tong University
> > > 800,Dongchuan Road,Shanghai,China
> > > E-Mail: zhunans...@gmail.com
> > >
> >
>
>
>
> --
> Nan Zhu
> School of Electronic, Information and Electrical Engineering,229
> Shanghai Jiao Tong University
> 800,Dongchuan Road,Shanghai,China
> E-Mail: zhunans...@gmail.com
>


Re: what does "keep 10% map, 40% reduce" mean in gridmix2's README?

2012-06-14 Thread Chen He
Hi Nan

probably the map stage will output 10% of the total input, and the reduce
stage will output 40% of intermediate results (10% of total input).

For example, 500GB input, after the map stage, it will be 50GB and it will
become 20GB after the reduce stage.

It may be similar to the loadgen in hadoop test example.

Anyone has suggestion?

Chen
System Architect Intern @ ZData
PhD student@CSE Dept.


On Thu, Jun 14, 2012 at 1:58 AM, Nan Zhu  wrote:

> Hi, all
>
> I'm using gridmix2 to test my cluster, while in its README file, there are
> statements like the following:
>
> +1) Three stage map/reduce job
> +  Input:  500GB compressed (2TB uncompressed) SequenceFile
> + (k,v) = (5 words, 100 words)
> + hadoop-env: FIXCOMPSEQ
> + *Compute1:   keep 10% map, 40% reduce
> +  Compute2:   keep 100% map, 77% reduce
> + Input from Compute1
> + Compute3:   keep 116% map, 91% reduce
> + Input from Compute2
> + *Motivation: Many user workloads are implemented as pipelined
> map/reduce
> + jobs, including Pig workloads
>
>
> Can anyone tell me what does "keep 10% map, 40% reduce" mean here?
>
> Best,
>
> --
> Nan Zhu
> School of Electronic, Information and Electrical Engineering,229
> Shanghai Jiao Tong University
> 800,Dongchuan Road,Shanghai,China
> E-Mail: zhunans...@gmail.com
>


Re: Feedback on real world production experience with Flume

2012-04-21 Thread Chen He
Can the NFS become the bottleneck ?

Chen

On Sat, Apr 21, 2012 at 5:23 PM, Edward Capriolo wrote:

> It seems pretty relevant. If you can directly log via NFS that is a
> viable alternative.
>
> On Sat, Apr 21, 2012 at 11:42 AM, alo alt 
> wrote:
> > We decided NO product and vendor advertising on apache mailing lists!
> > I do not understand why you'll put that closed source stuff from your
> employe in the room. It has nothing to do with flume or the use cases!
> >
> > --
> > Alexander Lorenz
> > http://mapredit.blogspot.com
> >
> > On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:
> >
> >> Karl,
> >>
> >> since you did ask for alternatives,  people using MapR prefer to use the
> >> NFS access to directly deposit data (or access it).  Works seamlessly
> from
> >> all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
> >> without having to load any agents on those machines. And it is fully
> >> automatic HA
> >>
> >> Since compression is built-in in MapR, the data gets compressed coming
> in
> >> over NFS automatically without much fuss.
> >>
> >> Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
> >> attached (of course, with compression, the effective throughput will
> >> surpass that based on how good the data can be squeezed).
> >>
> >>
> >> On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig 
> wrote:
> >>
> >>> I am investigating automated methods of moving our data from the web
> tier
> >>> into HDFS for processing, a process that's performed periodically.
> >>>
> >>> I am looking for feedback from anyone who has actually used Flume in a
> >>> production setup (redundant, failover) successfully.  I understand it
> is
> >>> now being largely rearchitected during its incubation as Apache
> Flume-NG,
> >>> so I don't have full confidence in the old, stable releases.
> >>>
> >>> The other option would be to write our own tools.  What methods are you
> >>> using for these kinds of tasks?  Did you write your own or does Flume
> (or
> >>> something else) work for you?
> >>>
> >>> I'm also on the Flume mailing list, but I wanted to ask these questions
> >>> here because I'm interested in Flume _and_ alternatives.
> >>>
> >>> Thank you!
> >>>
> >>>
> >
>


Re: Yuan Jin is out of the office.

2012-04-12 Thread Chen He
This is the second time. Pure junk email. Could you avoid sending email to
public mail-list, Ms/Mr. Yuan Jin?

On Thu, Apr 12, 2012 at 6:22 PM, Chen He  wrote:

> who cares?
>
>
> On Thu, Apr 12, 2012 at 6:09 PM, Yuan Jin  wrote:
>
>>
>> I will be out of the office starting  04/13/2012 and will not return until
>> 04/16/2012.
>>
>> I am out of office, and will reply you when I am back.
>>
>> For HAMSTER related things, you can contact Jason(Deng Peng
>> Zhou/China/IBM)
>> or Anthony(Fei Xiong/China/IBM)
>> For CFM related things, you can contact Daniel(Liang SH
>> Su/China/Contr/IBM)
>> For TMB related things, you can contact Flora(Jun Ying Li/China/IBM)
>> For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM)
>> For others, I will reply you when I am back.
>>
>
>


Re: Yuan Jin is out of the office.

2012-04-12 Thread Chen He
who cares?

On Thu, Apr 12, 2012 at 6:09 PM, Yuan Jin  wrote:

>
> I will be out of the office starting  04/13/2012 and will not return until
> 04/16/2012.
>
> I am out of office, and will reply you when I am back.
>
> For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM)
> or Anthony(Fei Xiong/China/IBM)
> For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM)
> For TMB related things, you can contact Flora(Jun Ying Li/China/IBM)
> For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM)
> For others, I will reply you when I am back.
>


Re: start hadoop slave over WAN

2012-03-30 Thread Chen He
login your remote datanode and start the datanode manually to see what
happen.

start HDFS based on the WAN is not as easy as on a cluster. There are many
issues. datanode log should be the best way to shoot troubles.

Chen

On Fri, Mar 30, 2012 at 12:52 PM, Michael Segel
wrote:

> Probably a timeout.
> Really, not a good idea to do this in the first place...
>
> Sent from my iPhone
>
> On Mar 30, 2012, at 12:35 PM, "Ben Cuthbert" 
> wrote:
>
> > Strange thing is the datanode in the remote location has a log zero
> bytes. So nothing there.
> > Its strange it is like the master does and ssh, login, and then attempts
> to start it but nothing. Maybe there is a timeout?
> >
> >
> > On 30 Mar 2012, at 18:22, kasi subrahmanyam wrote:
> >
> >> Try checking the logs in the logs folder for the datanode.It might give
> >> some lead.
> >> Maybe there is a mismatch between the namespace iDs in the system and
> user
> >> itself while starting the datanode.
> >>
> >> On Fri, Mar 30, 2012 at 10:32 PM, Ben Cuthbert  >wrote:
> >>
> >>> All
> >>>
> >>> We have a master in one region and we are trying to start a slave
> datanode
> >>> in another region. When executing the scripts it looks to login to the
> >>> remote host, but
> >>> never starts the datanode. When executing hbase tho it does work. Is
> there
> >>> a timeout or something with hadoop?
> >
>


Re: mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Chen He
if you do not specify  setNumMapTasks, by default, system will use the
number you configured  for "mapred.map.tasks" in the conf/mapred-site.xml
file.

On Fri, Mar 9, 2012 at 7:19 PM, Mohit Anchlia wrote:

> What's the difference between setNumMapTasks and mapred.map.tasks?
>
> On Fri, Mar 9, 2012 at 5:00 PM, Chen He  wrote:
>
> > Hi Mohit
> >
> > " mapred.tasktracker.reduce(map).tasks.maximum " means how many
> reduce(map)
> > slot(s) you can have on each tasktracker.
> >
> > "mapred.job.reduce(maps)" means default number of reduce (map) tasks your
> > job will has.
> >
> > To set the number of mappers in your application. You can write like
> this:
> >
> > *configuration.setNumMapTasks(the number you want);*
> >
> > Chen
> >
> > Actually, you can just use configuration.set()
> >
> > On Fri, Mar 9, 2012 at 6:42 PM, Mohit Anchlia  > >wrote:
> >
> > > What's the difference between mapred.tasktracker.reduce.tasks.maximum
> and
> > > mapred.map.tasks
> > > **
> >  > I want my data to be split against only 10 mappers in the entire
> > cluster.
> > > Can I do that using one of the above parameters?
> > >
> >
>


Re: mapred.tasktracker.map.tasks.maximum not working

2012-03-09 Thread Chen He
you set the " mapred.tasktracker.map.tasks.maximum " in your job means
nothing. Because Hadoop mapreduce platform only checks this parameter when
it starts. This is a system configuration.

  You need to set it in your conf/mapred-site.xml file and restart your
hadoop mapreduce.


On Fri, Mar 9, 2012 at 7:32 PM, Mohit Anchlia wrote:

> I have mapred.tasktracker.map.tasks.maximum set to 2 in my job and I have 5
> nodes. I was expecting this to have only 10 concurrent jobs. But I have 30
> mappers running. Does hadoop ignores this setting when supplied from the
> job?
>


Re: mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Chen He
Hi Mohit

" mapred.tasktracker.reduce(map).tasks.maximum " means how many reduce(map)
slot(s) you can have on each tasktracker.

"mapred.job.reduce(maps)" means default number of reduce (map) tasks your
job will has.

To set the number of mappers in your application. You can write like this:

*configuration.setNumMapTasks(the number you want);*

Chen

Actually, you can just use configuration.set()

On Fri, Mar 9, 2012 at 6:42 PM, Mohit Anchlia wrote:

> What's the difference between mapred.tasktracker.reduce.tasks.maximum and
> mapred.map.tasks
> **
> I want my data to be split against only 10 mappers in the entire cluster.
> Can I do that using one of the above parameters?
>


Re: Incompatible namespaceIDs after formatting namenode

2012-01-15 Thread Chen He
For short, here is a script that may be useful for your to remove hdfs
directory on DNs from your headnode.

for each DN hostname
   do
  ssh root@[DN hostname] "rm [your hdfs
directory]/dfs/data/current/VERSION";
done

On Sun, Jan 15, 2012 at 7:22 AM, Uma Maheswara Rao G
wrote:

> Since you already formatted NN, why do you think dataloss if you remove
> storage directories of DNs here?
> Since you formatted the NN, new namespaceID will be generated. When DNs
> registering to it, they will have still old NamespaceID, so, it will say
> incompatible namespaceIDs. So, here currently the solution is to remove the
> storage directories of all DNs.
>
> Regards,
> Uma
>
> 
> From: gdan2000 [gdan2...@gmail.com]
> Sent: Sunday, January 15, 2012 2:15 PM
> To: core-u...@hadoop.apache.org
> Subject: Incompatible namespaceIDs after formatting namenode
>
> Hi
>
> We just started implemented hadoop on our system for the first
> time(Cloudera
> CDH3u2 )
>
> After reformatting a namenode for a few times, DataNode is not coming up
> with error "Incompatible namespaceIDs"
>
> I found a note on this
> http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-troubleshooting/ but I'm
> really not sure about removing data node directories.
>
> How is it possible that data will not be lost? I have to do it on all
> datanodes...
>
> Please explain me how all this reformat tasks preserves user's data ?
>
> --
> View this message in context:
> http://old.nabble.com/Incompatible-namespaceIDs-after-formatting-namenode-tp33142065p33142065.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: dual power for hadoop in datacenter?

2012-01-09 Thread Chen He
If you configure your replica number as 3 or more. I would suggest you keep
half of your nodes with dual power on each rack, especially node with
larger or more disks. As well as your namenode, resource manager,
secondaryNamenode, and all other master nodes.

On Mon, Jan 9, 2012 at 10:50 AM, Robert Evans  wrote:

> Be aware that if half of your cluster goes down, depending of the version
> and configuration of Hadoop, there may be a replication storm, as hadoop
> tries to bring it all back up to the proper number of replications.  Your
> cluster may still be unusable in this case.
>
> --Bobby Evans
>
> On 1/7/12 2:55 PM, "Alexander Lorenz"  wrote:
>
> NN, SN and JT must have separated power adapter, for the entire cluster
> are dual adapter recommend.
> For HBase and zookeeper servers / regionservers also dual adapters with
> seperated power lines recommend.
>
> - Alex
>
> sent via my mobile device
>
> On Jan 7, 2012, at 11:23 AM, Koert Kuipers  wrote:
>
> > what are the thoughts on running a hadoop cluster in a datacenter with
> > respect to power? should all the boxes have redundant power supplies and
> be
> > on dual power? or just dual power for the namenode, secondary namenode,
> and
> > hbase master, and then perhaps switch the power source per rack for the
> > slaves to provide resilience to a power failure? or even just run
> > everything on single power and accept the risk that everything can do
> down
> > at once?
>
>