mappers-node relationship

2013-01-24 Thread jamal sasha
Hi.
  A very very lame question.
Does numbers of mapper depends on the number of nodes I have?
How I imagine map-reduce is this.
For example in word count example
I have bunch of slave nodes.
The documents are distributed across these slave nodes.
Now depending on how big the data is, it will spread across the slave
nodes.. and that is how my number of mappers are decided.
I am sure, this is wrong understanding. As in pseudo-distributed node, i
can see multiple mappers.
So question is.. how does a single node machine runs multiple mappers? is
it run in parallel or sequentially??
Any resources to learn these
Thanks


Re: How to Backup HDFS data ?

2013-01-24 Thread Mahesh Balija
Hi Steve,

On top of Harsh answer, other than Backup there is a feature
called Snapshot offered by some third party vendors like MapR.
Though its not really a backup it is just a point for which you
can revert back at any point in time.

Best,
Mahesh Balija,
CalsoftLabs.

On Fri, Jan 25, 2013 at 11:53 AM, Harsh J  wrote:

> You need some form of space capacity on the backup cluster that can
> withstand it. Lower replication (<3) may also be an option there to
> save yourself some disks/nodes?
>
> On Fri, Jan 25, 2013 at 5:04 AM, Steve Edison  wrote:
> > Backup to disks is what we do right now. Distcp would copy across HDFS
> > clusters, meaning by I will have to build another 12 node cluster ? Is
> that
> > correct ?
> >
> >
> > On Thu, Jan 24, 2013 at 3:32 PM, Mathias Herberts
> >  wrote:
> >>
> >> Backup on tape or on disk?
> >>
> >> On disk, have another Hadoop cluster dans do regular distcp.
> >>
> >> On tape, make sure you have a backup program which can backup streams
> >> so you don't have to materialize your TB files outside of your Hadoop
> >> cluster first... (I know Simpana can't do that :-().
> >>
> >> On Fri, Jan 25, 2013 at 12:29 AM, Steve Edison 
> >> wrote:
> >> > Folks,
> >> >
> >> > Its been an year and my HDFS / Solar /Hive setup is working flawless.
> >> > The
> >> > data logs which were meaningless to my business all of a sudden became
> >> > precious to the extent that our management wants to backup this data.
> I
> >> > am
> >> > talking about 20 TB of active HDFS data with an incremental of 2
> >> > TB/month.
> >> > We would like to have weekly and monthly backups upto 12 months.
> >> >
> >> > Any ideas how to do this ?
> >> >
> >> > -- Steve
> >
> >
>
>
>
> --
> Harsh J
>


Re: Copy files from remote folder to HDFS

2013-01-24 Thread Nitin Pawar
if this is one time activity

then just download hadoop binaries from apache
replace the hdfs-site.xml and core-site.xml  with the one you have on
hadoop cluster
allow this machine to connect with hadoop cluster
then you can just do it with hadoop command line scripts.


On Fri, Jan 25, 2013 at 1:01 PM, Mahesh Balija
wrote:

> Hi Panshul,
>
>  I am also working on similar requirement, one approach is,
> mount your remote folder on your hadoop master node.
>  And simply write a shell script to copy the files to HDFS
> using crontab.
>
>  I believe Flume is literally a wrong choice as Flume is  a
> data collection and aggregation framework and NOT a file transfer tool and
> may NOT be a good choice when you actually want to copy the files as-is
> onto your cluster (NOT 100% sure as I am also working on that).
>
> Thanks,
> Mahesh Balija,
> CalsoftLabs.
>
> On Fri, Jan 25, 2013 at 6:39 AM, Panshul Whisper wrote:
>
>> Hello,
>>
>> I am trying to copy files, Json files from a remote folder - (a folder on
>> my local system, Cloudfiles folder or a folder on S3 server) to the HDFS of
>> a cluster running at a remote location.
>> The job submitting Application is based on Spring Hadoop.
>>
>> Can someone please suggest or point me in the right direction for best
>> option to achieve the above task:
>> 1. Use Spring Integration data pipelines to poll the folders for files
>> and copy them to the HDFS as they arrive in the source folder. - I have
>> tried to implement the solution in Spring Data book, but it does not run -
>> no idea what is wrong as it does not generate logs.
>>
>> 2. Use some other script method to transfer files.
>>
>> Main requirement, I need to transfer files from a remote folder to HDFS
>> everyday at a fixed time for processing in the hadoop cluster. These files
>> are collecting from various sources in the remote folders.
>>
>> Please suggest an efficient approach. I have been searching and finding a
>> lot of approaches but unable to decide what will work best. As this
>> transfer needs to be as fast as possible.
>> The files to be transferred will be almost 10 GB of Json files not more
>> than 6kb each file.
>>
>> Thanking You,
>>
>>
>> --
>>  Regards,
>> Ouch Whisper
>> 010101010101
>>
>
>


-- 
Nitin Pawar


Re: How to Backup HDFS data ?

2013-01-24 Thread Ted Dunning
Incremental backups are nice to avoid copying all your data again.

You can code these at the application layer if you have nice partitioning
and keep track correctly.

You can also use platform level capabilities such as provided for by the
MapR distribution.

On Fri, Jan 25, 2013 at 3:23 PM, Harsh J  wrote:

> You need some form of space capacity on the backup cluster that can
> withstand it. Lower replication (<3) may also be an option there to
> save yourself some disks/nodes?
>
> On Fri, Jan 25, 2013 at 5:04 AM, Steve Edison  wrote:
> > Backup to disks is what we do right now. Distcp would copy across HDFS
> > clusters, meaning by I will have to build another 12 node cluster ? Is
> that
> > correct ?
> >
> >
> > On Thu, Jan 24, 2013 at 3:32 PM, Mathias Herberts
> >  wrote:
> >>
> >> Backup on tape or on disk?
> >>
> >> On disk, have another Hadoop cluster dans do regular distcp.
> >>
> >> On tape, make sure you have a backup program which can backup streams
> >> so you don't have to materialize your TB files outside of your Hadoop
> >> cluster first... (I know Simpana can't do that :-().
> >>
> >> On Fri, Jan 25, 2013 at 12:29 AM, Steve Edison 
> >> wrote:
> >> > Folks,
> >> >
> >> > Its been an year and my HDFS / Solar /Hive setup is working flawless.
> >> > The
> >> > data logs which were meaningless to my business all of a sudden became
> >> > precious to the extent that our management wants to backup this data.
> I
> >> > am
> >> > talking about 20 TB of active HDFS data with an incremental of 2
> >> > TB/month.
> >> > We would like to have weekly and monthly backups upto 12 months.
> >> >
> >> > Any ideas how to do this ?
> >> >
> >> > -- Steve
> >
> >
>
>
>
> --
> Harsh J
>


Re: Copy files from remote folder to HDFS

2013-01-24 Thread Mahesh Balija
Hi Panshul,

 I am also working on similar requirement, one approach is,
mount your remote folder on your hadoop master node.
 And simply write a shell script to copy the files to HDFS
using crontab.

 I believe Flume is literally a wrong choice as Flume is  a
data collection and aggregation framework and NOT a file transfer tool and
may NOT be a good choice when you actually want to copy the files as-is
onto your cluster (NOT 100% sure as I am also working on that).

Thanks,
Mahesh Balija,
CalsoftLabs.

On Fri, Jan 25, 2013 at 6:39 AM, Panshul Whisper wrote:

> Hello,
>
> I am trying to copy files, Json files from a remote folder - (a folder on
> my local system, Cloudfiles folder or a folder on S3 server) to the HDFS of
> a cluster running at a remote location.
> The job submitting Application is based on Spring Hadoop.
>
> Can someone please suggest or point me in the right direction for best
> option to achieve the above task:
> 1. Use Spring Integration data pipelines to poll the folders for files and
> copy them to the HDFS as they arrive in the source folder. - I have tried
> to implement the solution in Spring Data book, but it does not run - no
> idea what is wrong as it does not generate logs.
>
> 2. Use some other script method to transfer files.
>
> Main requirement, I need to transfer files from a remote folder to HDFS
> everyday at a fixed time for processing in the hadoop cluster. These files
> are collecting from various sources in the remote folders.
>
> Please suggest an efficient approach. I have been searching and finding a
> lot of approaches but unable to decide what will work best. As this
> transfer needs to be as fast as possible.
> The files to be transferred will be almost 10 GB of Json files not more
> than 6kb each file.
>
> Thanking You,
>
>
> --
> Regards,
> Ouch Whisper
> 010101010101
>


Re: How to Backup HDFS data ?

2013-01-24 Thread Harsh J
You need some form of space capacity on the backup cluster that can
withstand it. Lower replication (<3) may also be an option there to
save yourself some disks/nodes?

On Fri, Jan 25, 2013 at 5:04 AM, Steve Edison  wrote:
> Backup to disks is what we do right now. Distcp would copy across HDFS
> clusters, meaning by I will have to build another 12 node cluster ? Is that
> correct ?
>
>
> On Thu, Jan 24, 2013 at 3:32 PM, Mathias Herberts
>  wrote:
>>
>> Backup on tape or on disk?
>>
>> On disk, have another Hadoop cluster dans do regular distcp.
>>
>> On tape, make sure you have a backup program which can backup streams
>> so you don't have to materialize your TB files outside of your Hadoop
>> cluster first... (I know Simpana can't do that :-().
>>
>> On Fri, Jan 25, 2013 at 12:29 AM, Steve Edison 
>> wrote:
>> > Folks,
>> >
>> > Its been an year and my HDFS / Solar /Hive setup is working flawless.
>> > The
>> > data logs which were meaningless to my business all of a sudden became
>> > precious to the extent that our management wants to backup this data. I
>> > am
>> > talking about 20 TB of active HDFS data with an incremental of 2
>> > TB/month.
>> > We would like to have weekly and monthly backups upto 12 months.
>> >
>> > Any ideas how to do this ?
>> >
>> > -- Steve
>
>



-- 
Harsh J


Re: Filesystem closed exception

2013-01-24 Thread Harsh J
It is pretty much the same in 0.20.x as well, IIRC. Your two points
are also correct (for a fix to this). Also see:
https://issues.apache.org/jira/browse/HADOOP-7973.

On Fri, Jan 25, 2013 at 6:56 AM, Hemanth Yamijala
 wrote:
> Hi,
>
> We are noticing a problem where we get a filesystem closed exception when a
> map task is done and is finishing execution. By map task, I literally mean
> the MapTask class of the map reduce code. Debugging this we found that the
> mapper is getting a handle to the filesystem object and itself calling a
> close on it. Because filesystem objects are cached, I believe the behaviour
> is as expected in terms of the exception.
>
> I just wanted to confirm that:
>
> - if we do have a requirement to use a filesystem object in a mapper or
> reducer, we should either not close it ourselves
> - or (seems better to me) ask for a new version of the filesystem instance
> by setting the fs.hdfs.impl.disable.cache property to true in job
> configuration.
>
> Also, does anyone know if this behaviour was any different in Hadoop 0.20 ?
>
> For some context, this behaviour is actually seen in Oozie, which runs a
> launcher mapper for a simple java action. Hence, the java action could very
> well interact with a file system. I know this is probably better addressed
> in Oozie context, but wanted to get the map reduce view of things.
>
>
> Thanks,
> Hemanth



-- 
Harsh J


Re: UTF16

2013-01-24 Thread Nitin Pawar
hive by default supports utf-8

I am not sure about utf-16
you can refer to this https://issues.apache.org/jira/browse/HIVE-2859


On Fri, Jan 25, 2013 at 10:23 AM, Koert Kuipers  wrote:

> is it safe to upload UTF16 encoded (unicode) text files to hadoop for
> processing by map-reduce, hive, pig, etc?
> thanks! koert
>



-- 
Nitin Pawar


MapFileOutputFormat class is not found in hadoop-core 1.1.1

2013-01-24 Thread feng lu
Hi all

i want to migrate the nutch WebGraph class to new MapReduce API,
https://issues.apache.org/jira/browse/NUTCH-1223 .  but i found
MapFileOutputFormat class is not in the
org.apache.hadoop.mapreduce.lib.output package in hadoop-core-1.1.1.

I don't know why this class is deleted in hadoop-core-1.1.1? or maybe it
will be added in later version.

Thanks.

-- 
Don't Grow Old, Grow Up... :-)


Re: Copy files from remote folder to HDFS

2013-01-24 Thread Mohammad Tariq
Hello Panshul,

  You might find flume  useful.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Fri, Jan 25, 2013 at 6:39 AM, Panshul Whisper wrote:

> Hello,
>
> I am trying to copy files, Json files from a remote folder - (a folder on
> my local system, Cloudfiles folder or a folder on S3 server) to the HDFS of
> a cluster running at a remote location.
> The job submitting Application is based on Spring Hadoop.
>
> Can someone please suggest or point me in the right direction for best
> option to achieve the above task:
> 1. Use Spring Integration data pipelines to poll the folders for files and
> copy them to the HDFS as they arrive in the source folder. - I have tried
> to implement the solution in Spring Data book, but it does not run - no
> idea what is wrong as it does not generate logs.
>
> 2. Use some other script method to transfer files.
>
> Main requirement, I need to transfer files from a remote folder to HDFS
> everyday at a fixed time for processing in the hadoop cluster. These files
> are collecting from various sources in the remote folders.
>
> Please suggest an efficient approach. I have been searching and finding a
> lot of approaches but unable to decide what will work best. As this
> transfer needs to be as fast as possible.
> The files to be transferred will be almost 10 GB of Json files not more
> than 6kb each file.
>
> Thanking You,
>
>
> --
>  Regards,
> Ouch Whisper
> 010101010101
>


UTF16

2013-01-24 Thread Koert Kuipers
is it safe to upload UTF16 encoded (unicode) text files to hadoop for
processing by map-reduce, hive, pig, etc?
thanks! koert


Re: Problems

2013-01-24 Thread ke yuan
is there anything done with hardware? i used thinkpad t430,this problem
occurs,but i used about 100 machines ,there is nothing to do  with this
,all the machines is redhat 6.0,and the jdk is jdk1.5 to jdk1.6 , so i
think there is something to do with the hardware,any idea?

2013/1/22 Jean-Marc Spaggiari 

> Hi Sean,
>
> Will you be able to run the memtest86 on this VM? Maybe it's an issue
> with the way the VM is managing the memory?
>
> I ran HBase+Hadoop on a desktop with only 1.5G. So you should not have
> any issue with 6GB.
>
> I don't think the issue you are facing is related to hadoop. Can you
> try to run a simple Java application in you JVM? Something which will
> use lot of memory. And see if it works?
>
> JM
>
> 2013/1/22, Sean Hudson :
> > Hi Jean-Marc,
> > The Linux machine on which I am attempting to get
> > Hadoop running is actually Linux running in a VM partition. This VM
> > partition had 2 Gigs of RAM when I first encountered the problem. This
> RAM
> > allocation has been bumped up to 6 Gigs, but the problem still persists,
> > i.e
> >
> > bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
>  still
> > crashes out as before.
> >
> > Is there a minimum RAM size requirement?
> > Will Hadoop run correctly on Linux in a VM partition?
> >
> > I had attempted to run Hadoop in
> Pseudo-Distributed
> >
> > Operation mode and this included modifying the conf/core-site.xml,
> > conf/hdfs-site.xml and the conf/mapred-site.xml files as per the Quick
> Start
> >
> > instructions. I also formatted a new distributed-filesystem as per the
> > instructions. To re-test in Standalone mode with 6 Gigs of RAM, I
> reversed
> > the changes to the above three .xml files in /conf. However, I don't see
> a
> > way to back-out the distributed-filesystem. Will the existence of this
> > distributed-filesystem interfere with my Standalone tests?
> >
> > Regards,
> >
> > Sean Hudson
> >
> > -Original Message-
> > From: Jean-Marc Spaggiari
> > Sent: Friday, January 18, 2013 3:24 PM
> > To: user@hadoop.apache.org
> > Subject: Re: Problems
> >
> > Hi Sean,
> >
> > It's strange. You should not faced that.  I faced same kind of issues
> > on a desktop with memory errors. Can you install memtest86 and fullty
> > test your memory (one pass is enought) to make sure you don't have
> > issues on that side?
> >
> > 2013/1/18, Sean Hudson :
> >> Leo,
> >> I downloaded the suggested 1.6.0_32 Java version to my home
> >> directory, but I am still experiencing the same problem (See error
> >> below).
> >> The only thing that I have set in my hadoop-env.sh file is the JAVA_HOME
> >> environment variable. I have also tried it with the Java directory added
> >> to
> >>
> >> PATH.
> >>
> >> export JAVA_HOME=/home/shu/jre1.6.0_32
> >> export PATH=$PATH:/home/shu/jre1.6.0_32
> >>
> >> Every other environment variable is defaulted.
> >>
> >> Just to clarify, I have tried this in Local Standalone mode and also in
> >> Pseudo-Distributed Mode with the same result.
> >>
> >> Frustrating to say the least,
> >>
> >> Sean Hudson
> >>
> >>
> >> shu@meath-nua:~/hadoop-1.0.4> bin/hadoop jar hadoop-examples-1.0.4.jar
> >> grep
> >>
> >> input output 'dfs[a-z.]+'
> >> #
> >> # A fatal error has been detected by the Java Runtime Environment:
> >> #
> >> #  SIGFPE (0x8) at pc=0xb7fc51fb, pid=23112, tid=3075554208
> >> #
> >> # JRE version: 6.0_32-b05
> >> # Java VM: Java HotSpot(TM) Client VM (20.7-b02 mixed mode, sharing
> >> linux-x86 )
> >> # Problematic frame:
> >> # C  [ld-linux.so.2+0x91fb]  double+0xab
> >> #
> >> # An error report file with more information is saved as:
> >> # /home/shu/hadoop-1.0.4/hs_err_pid23112.log
> >> #
> >> # If you would like to submit a bug report, please visit:
> >> #   http://java.sun.com/webapps/bugreport/crash.jsp
> >> # The crash happened outside the Java Virtual Machine in native code.
> >> # See problematic frame for where to report the bug.
> >> #
> >> Aborted
> >>
> >> -Original Message-
> >> From: Leo Leung
> >> Sent: Thursday, January 17, 2013 6:46 PM
> >> To: user@hadoop.apache.org
> >> Subject: RE: Problems
> >>
> >> Use Sun/Oracle  1.6.0_32+   Build should be 20.7-b02+
> >>
> >> 1.7 causes failure and AFAIK,  not supported,  but you are free to try
> >> the
> >> latest version and report back.
> >>
> >>
> >>
> >> -Original Message-
> >> From: Sean Hudson [mailto:sean.hud...@ostiasolutions.com]
> >> Sent: Thursday, January 17, 2013 6:57 AM
> >> To: user@hadoop.apache.org
> >> Subject: Re: Problems
> >>
> >> Hi,
> >>   My Java version is
> >>
> >> java version "1.6.0_25"
> >> Java(TM) SE Runtime Environment (build 1.6.0_25-b06) Java HotSpot(TM)
> >> Client
> >>
> >> VM (build 20.0-b11, mixed mode, sharing)
> >>
> >> Would you advise obtaining a later Java version?
> >>
> >> Sean
> >>
> >> -Original Message-
> >> From: Jean-Marc Spaggiari
> >> Sent: Thursday, January 17, 2013 2:52 PM
> >> To: user@h

Re: Spring for hadoop

2013-01-24 Thread Radim Kolar

Dne 23.1.2013 22:55, Panshul Whisper napsal(a):

Hello Radim,

Your solution sounds interesting. Is it possible for me to try the
solution before I buy it?

i do not ship demo version since it would be identical to production 
version. I do onsite presentations with live demo and you will get all 
code examples used. Its possible to code something simple on custom 
demand during presentation, for example i got request to show how to use 
hadoop seq file for storing data extracted from jms messages received by 
spring integration.




Re: hdfs du periodicity and hdfs not respond at that time

2013-01-24 Thread Xibin Liu
I'm using ext3 and use df instead of du is a good way to solve this
problem, thanks you all

2013/1/24 Harsh J 

> I missed the periodicity part of your question. Unfortunately the "du"
> refresh interval is hard-coded today, although the "df" interval is
> configurable. Perhaps this is a bug - I filed
> https://issues.apache.org/jira/browse/HADOOP-9241 to make it configurable.
>
> Also, your problem reminded me of a similar issue my team and I faced
> once before, and
>
> http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/
> helped us temporarily there. Perhaps it may help you as well, its a
> good tip.
>
> On Thu, Jan 24, 2013 at 1:24 PM, Xibin Liu  wrote:
> > Thanks, http://search-hadoop.com/m/LLBgUiH0Bg2 is my issue , but I still
> > dont't know how to solve this problem, 3 minutes not respond once an hour
> > is a big problem for me, any clue for this?
> >
> >
> > 2013/1/24 Harsh J 
> >>
> >> Hi,
> >>
> >> HDFS does this to estimate space reports. Perhaps the discussion here
> >> may help you: http://search-hadoop.com/m/LLBgUiH0Bg2
> >>
> >> On Thu, Jan 24, 2013 at 12:51 PM, Xibin Liu 
> >> wrote:
> >> > hi all,
> >> > I found hdfs du periodicity(one hour), and because my disk is big, the
> >> > smallest one is 15T, so when hdfs exec du, datanode will not respond
> for
> >> > about 3 minuts because of io loading, this cause a lot of problem,
> >> > anybody
> >> > knows why hdfs doing this and how to disable it?
> >> >
> >> > --
> >> > Thanks &  Regards
> >> > Xibin Liu
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
> >
> >
> > --
> > Thanks & Best Regards
> > Xibin Liu
> >
>
>
>
> --
> Harsh J
>



-- 
Thanks & Best Regards
Xibin Liu


Re: Moving data in hadoop

2013-01-24 Thread Raj hadoop
Thank you.I am looking into it now.

On Fri, Jan 25, 2013 at 7:27 AM, Mohit Anchlia wrote:

> Have you looked at distcp?
>
>
> On Thu, Jan 24, 2013 at 5:55 PM, Raj hadoop  wrote:
>
>> Hi,
>>
>> Can you please suggest me what is the good way to move 1 peta byte of
>> data from one cluster to another cluster?
>>
>> Thanks
>> Raj
>>
>
>


unsubscribe

2013-01-24 Thread 徐永睿


2013-01-25

Any streaming options that specify data type of key value pairs?

2013-01-24 Thread Yuncong Chen
Hi, 

What would be the best way in hadoop streaming to send a binary object (i.e. a 
python dict, array) as the value in  pairs?

I know I can dump the object to string with pickle.dumps() and encode it to 
eliminate unintended '\t','\n' before sending it to stdout, but I wonder if 
there are native streaming options that specifies the data type of  
pairs?

Thanks,
Yuncong

Re: Moving data in hadoop

2013-01-24 Thread Mohit Anchlia
Have you looked at distcp?

On Thu, Jan 24, 2013 at 5:55 PM, Raj hadoop  wrote:

> Hi,
>
> Can you please suggest me what is the good way to move 1 peta byte of data
> from one cluster to another cluster?
>
> Thanks
> Raj
>


Moving data in hadoop

2013-01-24 Thread Raj hadoop
Hi,

Can you please suggest me what is the good way to move 1 peta byte of data
from one cluster to another cluster?

Thanks
Raj


Re: Help with DataDrivenDBInputFormat: splits are created properly but zero records are sent to the mappers

2013-01-24 Thread Stephen Boesch
It turns out to be an apparent problem in one of the two methods
of  DataDrivenDBAPi.setInput().   The version I used does not work as
shown: it needs to have a primary key column set somehow. But no
information / documentation on how to set the pkcol that I could find.  So
I converted to using the other setIput() method  as follow:

DataDrivenDBInputFormat.setInput(job, DBTextWritable.class,
  APP_DETAILS_CRAWL_QUEUE_V, null, "id", "id");

Now  this is working .




2013/1/24 Stephen Boesch 

>
> I have made an attempt to implement a job using DataDrivenDBInputFormat.
> The result is that the input splits are created successfully with 45K
> records apeice, but zero records are then actually sent to the mappers.
>
> If anyone can point to working example(s) of using DataDrivenDBInputFormat
> it would be much appreciated.
>
>
> Here are further details of my attempt:
>
>
> DBConfiguration.configureDB(job.getConfiguration(), props.getDriver(),
> props.getUrl(), props.getUser(), props.getPassword());
> // Note: i also include code here to verify able to get
> java.sql.Connection using the above props..
>
> DataDrivenDBInputFormat.setInput(job,
>   DBLongWritable.class,
>   "select id,status from app_detail_active_crawl_queue_v where " +
>  DataDrivenDBInputFormat.SUBSTITUTE_TOKEN,
>   "SELECT MIN(id),MAX(id) FROM app_detail_active_crawl_queue_v ");
> // I verified by stepping with debugger that the input query were
> successfully applied by DataDrivenDBInputFormat to create two splits of 40K
> records each
> );
>
> ..   ..
> // Register a custom DBLongWritable class
>   static {
> WritableComparator.define(DBLongWritable.class, new
> DBLongWritable.DBLongKeyComparator());
> int x  = 1;
>   }
>
>
> Here is the job output. No rows were processed (even though 90K rows were
> identified in the INputSplits phase and divided into two 45K splits..So why
> were the input splits not processed?
>
> [Thu Jan 24 12:19:59] Successfully connected to
> driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/classint
> user=stephenb
> [Thu Jan 24 12:19:59] select id,status from
> app_detail_active_crawl_queue_v where $CONDITIONS
> 13/01/24 12:20:03 INFO mapred.JobClient: Running job: job_201301102125_0069
> 13/01/24 12:20:05 INFO mapred.JobClient:  map 0% reduce 0%
> 13/01/24 12:20:22 INFO mapred.JobClient:  map 50% reduce 0%
> 13/01/24 12:20:25 INFO mapred.JobClient:  map 100% reduce 0%
> 13/01/24 12:20:30 INFO mapred.JobClient: Job complete:
> job_201301102125_0069
> 13/01/24 12:20:30 INFO mapred.JobClient: Counters: 17
> 13/01/24 12:20:30 INFO mapred.JobClient:   Job Counters
> 13/01/24 12:20:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=21181
> 13/01/24 12:20:30 INFO mapred.JobClient: Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 13/01/24 12:20:30 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
> 13/01/24 12:20:30 INFO mapred.JobClient: Launched map tasks=2
> 13/01/24 12:20:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
> 13/01/24 12:20:30 INFO mapred.JobClient:   File Output Format Counters
> 13/01/24 12:20:30 INFO mapred.JobClient: Bytes Written=0
> 13/01/24 12:20:30 INFO mapred.JobClient:   FileSystemCounters
> 13/01/24 12:20:30 INFO mapred.JobClient: HDFS_BYTES_READ=215
> 13/01/24 12:20:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44010
> 13/01/24 12:20:30 INFO mapred.JobClient:   File Input Format Counters
> 13/01/24 12:20:30 INFO mapred.JobClient: Bytes Read=0
> 13/01/24 12:20:30 INFO mapred.JobClient:   Map-Reduce Framework
> 13/01/24 12:20:30 INFO mapred.JobClient: Map input records=0
> 13/01/24 12:20:30 INFO mapred.JobClient: Physical memory (bytes)
> snapshot=200056832
> 13/01/24 12:20:30 INFO mapred.JobClient: Spilled Records=0
> 13/01/24 12:20:30 INFO mapred.JobClient: CPU time spent (ms)=2960
> 13/01/24 12:20:30 INFO mapred.JobClient: Total committed heap usage
> (bytes)=247201792
> 13/01/24 12:20:30 INFO mapred.JobClient: Virtual memory (bytes)
> snapshot=4457689088
> 13/01/24 12:20:30 INFO mapred.JobClient: Map output records=0
> 13/01/24 12:20:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=215
>
>


Filesystem closed exception

2013-01-24 Thread Hemanth Yamijala
Hi,

We are noticing a problem where we get a filesystem closed exception when a
map task is done and is finishing execution. By map task, I literally mean
the MapTask class of the map reduce code. Debugging this we found that the
mapper is getting a handle to the filesystem object and itself calling a
close on it. Because filesystem objects are cached, I believe the behaviour
is as expected in terms of the exception.

I just wanted to confirm that:

- if we do have a requirement to use a filesystem object in a mapper or
reducer, we should either not close it ourselves
- or (seems better to me) ask for a new version of the filesystem instance
by setting the fs.hdfs.impl.disable.cache property to true in job
configuration.

Also, does anyone know if this behaviour was any different in Hadoop 0.20 ?

For some context, this behaviour is actually seen in Oozie, which runs a
launcher mapper for a simple java action. Hence, the java action could very
well interact with a file system. I know this is probably better addressed
in Oozie context, but wanted to get the map reduce view of things.


Thanks,
Hemanth


Copy files from remote folder to HDFS

2013-01-24 Thread Panshul Whisper
Hello,

I am trying to copy files, Json files from a remote folder - (a folder on
my local system, Cloudfiles folder or a folder on S3 server) to the HDFS of
a cluster running at a remote location.
The job submitting Application is based on Spring Hadoop.

Can someone please suggest or point me in the right direction for best
option to achieve the above task:
1. Use Spring Integration data pipelines to poll the folders for files and
copy them to the HDFS as they arrive in the source folder. - I have tried
to implement the solution in Spring Data book, but it does not run - no
idea what is wrong as it does not generate logs.

2. Use some other script method to transfer files.

Main requirement, I need to transfer files from a remote folder to HDFS
everyday at a fixed time for processing in the hadoop cluster. These files
are collecting from various sources in the remote folders.

Please suggest an efficient approach. I have been searching and finding a
lot of approaches but unable to decide what will work best. As this
transfer needs to be as fast as possible.
The files to be transferred will be almost 10 GB of Json files not more
than 6kb each file.

Thanking You,


-- 
Regards,
Ouch Whisper
010101010101


Re: How to Backup HDFS data ?

2013-01-24 Thread Steve Edison
Backup to disks is what we do right now. Distcp would copy across HDFS
clusters, meaning by I will have to build another 12 node cluster ? Is that
correct ?


On Thu, Jan 24, 2013 at 3:32 PM, Mathias Herberts <
mathias.herbe...@gmail.com> wrote:

> Backup on tape or on disk?
>
> On disk, have another Hadoop cluster dans do regular distcp.
>
> On tape, make sure you have a backup program which can backup streams
> so you don't have to materialize your TB files outside of your Hadoop
> cluster first... (I know Simpana can't do that :-().
>
> On Fri, Jan 25, 2013 at 12:29 AM, Steve Edison 
> wrote:
> > Folks,
> >
> > Its been an year and my HDFS / Solar /Hive setup is working flawless. The
> > data logs which were meaningless to my business all of a sudden became
> > precious to the extent that our management wants to backup this data. I
> am
> > talking about 20 TB of active HDFS data with an incremental of 2
> TB/month.
> > We would like to have weekly and monthly backups upto 12 months.
> >
> > Any ideas how to do this ?
> >
> > -- Steve
>


Re: How to Backup HDFS data ?

2013-01-24 Thread Mathias Herberts
Backup on tape or on disk?

On disk, have another Hadoop cluster dans do regular distcp.

On tape, make sure you have a backup program which can backup streams
so you don't have to materialize your TB files outside of your Hadoop
cluster first... (I know Simpana can't do that :-().

On Fri, Jan 25, 2013 at 12:29 AM, Steve Edison  wrote:
> Folks,
>
> Its been an year and my HDFS / Solar /Hive setup is working flawless. The
> data logs which were meaningless to my business all of a sudden became
> precious to the extent that our management wants to backup this data. I am
> talking about 20 TB of active HDFS data with an incremental of 2 TB/month.
> We would like to have weekly and monthly backups upto 12 months.
>
> Any ideas how to do this ?
>
> -- Steve


How to Backup HDFS data ?

2013-01-24 Thread Steve Edison
Folks,

Its been an year and my HDFS / Solar /Hive setup is working flawless. The
data logs which were meaningless to my business all of a sudden became
precious to the extent that our management wants to backup this data. I am
talking about 20 TB of active HDFS data with an incremental of 2 TB/month.
We would like to have weekly and monthly backups upto 12 months.

Any ideas how to do this ?

-- Steve


Error after installing Hadoop-1.0.4

2013-01-24 Thread Deepti Garg
Hi,

 

I installed Hadoop 1.0.4 from this link on my Windows 7 machine using
cygwin:

http://hadoop.apache.org/docs/r1.0.4/single_node_setup.html

 

I am getting an error on running 

bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

 

Error:

cd: pattern "--" not found in "c:/cloudera/hadoop-1.0.4"

Warning: $HADOOP_HOME is deprecated.

 

java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName

Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.util.PlatformName

at java.net.URLClassLoader$1.run(URLClassLoader.java:202)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:190)

at java.lang.ClassLoader.loadClass(ClassLoader.java:307)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

at java.lang.ClassLoader.loadClass(ClassLoader.java:248)

Could not find the main class: org.apache.hadoop.util.PlatformName.  Program
will exit.

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/util/RunJar

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.RunJar

at java.net.URLClassLoader$1.run(URLClassLoader.java:202)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:190)

at java.lang.ClassLoader.loadClass(ClassLoader.java:307)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

at java.lang.ClassLoader.loadClass(ClassLoader.java:248)

Could not find the main class: org.apache.hadoop.util.RunJar.  Program will
exit.

Exception in thread "main"

 

I read in some other posts to add "CLASSPATH=`cygpath -wp "$CLASSPATH"`" to
hadoop-config.sh, but the error did not go away after adding it. 

 

I get this on running bin/hadoop:

cd: pattern "--" not found in "c:/cloudera/hadoop-1.0.4"

Warning: $HADOOP_HOME is deprecated.

 

Usage: hadoop [--config confdir] COMMAND

where COMMAND is one of:

  namenode -format format the DFS filesystem

  secondarynamenoderun the DFS secondary namenode

  namenode run the DFS namenode

  datanode run a DFS datanode

 dfsadmin run a DFS admin client

  mradmin  run a Map-Reduce admin client

  fsck run a DFS filesystem checking utility

  fs   run a generic filesystem user client

  balancer run a cluster balancing utility

  fetchdt  fetch a delegation token from the NameNode

  jobtracker   run the MapReduce job Tracker node

  pipesrun a Pipes job

  tasktracker  run a MapReduce task Tracker node

  historyserverrun job history servers as a standalone daemon

  job  manipulate MapReduce jobs

  queueget information regarding JobQueues

  version  print the version

  jar run a jar file

  distcp   copy file or directories recursively

  archive -archiveName NAME -p  *  create a hadoop
archive

  classpathprints the class path needed to get the

   Hadoop jar and the required libraries

 daemonlogget/set the log level for each daemon

or

  CLASSNAMErun the class named CLASSNAME

Most commands print help when invoked w/o parameters.

 

Any help is appreciated.

 

Thanks.,

Deepti



unsubscribe

2013-01-24 Thread Luiz Fernando Figueiredo




Re: Submitting MapReduce job from remote server using JobClient

2013-01-24 Thread bejoy . hadoop
Hi Amit,

Apart for the hadoop jars, Do you have the same config files 
($HADOOP_HOME/conf) that are in the cluster on your analytics server as well?

If you are having the default config files in analytics server then your MR job 
would be running locally and not on the cluster. 

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Amit Sela 
Date: Thu, 24 Jan 2013 18:15:49 
To: 
Reply-To: user@hadoop.apache.org
Subject: Re: Submitting MapReduce job from remote server using JobClient

Hi Harsh,
I'm using Job.waitForCompletion() method to run the job but I can't see it
in the webapp and it doesn't seem to finish...
I get:
 *org.apache.hadoop.mapred.JobClient   - Running
job: job_local_0001*
*INFO  org.apache.hadoop.util.ProcessTree   -
setsid exited with exit code 0*
*2013-01-24 08:10:12.521 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-  Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@7db1be6*
*2013-01-24 08:10:12.536 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - io.sort.mb
= 100*
*2013-01-24 08:10:12.573 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - data buffer
= 79691776/99614720*
*2013-01-24 08:10:12.573 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - record
buffer = 262144/327680*
*2013-01-24 08:10:12.599 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - Starting
flush of map output*
*2013-01-24 08:10:12.608 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-
Task:attempt_local_0001_m_00_0 is done. And is in the process of
commiting*
*2013-01-24 08:10:13.348
[org.springframework.scheduling.quartz.SchedulerFactoryBean#0_Worker-1]
INFO  org.apache.hadoop.mapred.JobClient   -  map
0% reduce 0%*
*2013-01-24 08:10:15.509 [Thread-51]INFO
 org.apache.hadoop.mapred.LocalJobRunner  - *
*2013-01-24 08:10:15.510 [Thread-51]INFO
 org.apache.hadoop.mapred.Task- Task
'attempt_local_0001_m_00_0' done.*
*2013-01-24 08:10:15.511 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-  Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6b02b23d*
*2013-01-24 08:10:15.512 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - io.sort.mb
= 100*
*2013-01-24 08:10:15.549 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - data buffer
= 79691776/99614720*
*2013-01-24 08:10:15.550 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - record
buffer = 262144/327680*
*2013-01-24 08:10:15.557 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - Starting
flush of map output*
*2013-01-24 08:10:15.560 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-
Task:attempt_local_0001_m_01_0 is done. And is in the process of
commiting*
*2013-01-24 08:10:16.358
[org.springframework.scheduling.quartz.SchedulerFactoryBean#0_Worker-1]
INFO  org.apache.hadoop.mapred.JobClient   -  map
100% reduce 0%*

And after that, instead of going to Reduce phase I keep getting map
attempts like:

*INFO  org.apache.hadoop.mapred.MapTask -
io.sort.mb = 100*
*2013-01-24 08:10:21.563 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - data buffer
= 79691776/99614720*
*2013-01-24 08:10:21.563 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - record
buffer = 262144/327680*
*2013-01-24 08:10:21.570 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - Starting
flush of map output*
*2013-01-24 08:10:21.573 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-
Task:attempt_local_0001_m_03_0 is done. And is in the process of
commiting*
*2013-01-24 08:10:24.529 [Thread-51]INFO
 org.apache.hadoop.mapred.LocalJobRunner  - *
*2013-01-24 08:10:24.529 [Thread-51]INFO
 org.apache.hadoop.mapred.Task- Task
'attempt_local_0001_m_03_0' done.*
*2013-01-24 08:10:24.530 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-  Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@42e87d99*
*
*
Any clues ?
Thanks for the help.

On Thu, Jan 24, 2013 at 5:12 PM, Harsh J  wrote:

> The Job cla

Re: Support of RHEL version

2013-01-24 Thread Alexander Alten-Lorenz
Do you mean Transparent Huge Page Defrag 
(https://bugzilla.redhat.com/show_bug.cgi?id=805593)?

Do echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

- Alex

On Jan 24, 2013, at 6:46 PM, Dheeren Bebortha  wrote:

> I do not think this is a cdh specific issues. If this is a hbase compaction 
> issue it would be all pervading as long as it si RHEL62 and above!
> Am I reading it right?
> -Dheeren
> 
> -Original Message-
> From: Alexander Alten-Lorenz [mailto:wget.n...@gmail.com] 
> Sent: Thursday, January 24, 2013 9:41 AM
> To: cdh-u...@cloudera.org
> Subject: Re: Support of RHEL version
> 
> Hi,
> 
> Moving the post to cdh-u...@cloudera.org
> (https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/cdh-user)
> as it is CDH4 you specifically are asking about. BCC'd user@hadoop lists, 
> lets carry forward the discussion on the CDH lists. My response below.
> 
> RHEL 5.x and 6.x wil be supported. Means, RHEL 6.3 too.
> 
> Cheers,
> Alex
> 
> On Jan 24, 2013, at 6:28 PM,  wrote:
> 
>> Hi,
>> 
>> We are working on implementing Cloudera distributed Hadoop (CDH 4.x) on our 
>> environment.  Cloudera website talks about supporting RHEL 6.1 version with 
>> challenges/issues with the newer version. It also though provides a 
>> workaround for it. Wanted to hear from the community on the supported 
>> versions of RedHat Linux and any guidance on which version to choose?
>> 
>> -Nilesh
>> 
>> 
>> https://ccp.cloudera.com/display/CDH4DOC/Known+Issues+and+Work+Arounds
>> +in+CDH4
>> 
>> 
>> Red Hat Linux (RHEL 6.2 and 6.3)
>> - Poor performance running Hadoop on RHEL 6.2 or later when 
>> transparent hugepage compaction is enabled RHEL 6.2 and 6.3 include a 
>> feature called "transparent hugepage compaction" which interacts poorly with 
>> Hadoop workloads. This can cause a serious performance regression compared 
>> to other operating system versions on the same hardware.
>> Symptom: top and other system monitoring tools show a large percentage of 
>> the CPU usage classified as "system CPU". If system CPU usage is 30% or more 
>> of the total CPU usage, your system may be experiencing this issue.
>> Bug: https://bugzilla.redhat.com/show_bug.cgi?id=805593
>> Severity: Medium (up to 3x performance loss) Anticipated Resolution: 
>> Currently working with Red Hat to resolve for a future RHEL update
>> Workaround: Add the following command to /etc/rc.local to disable 
>> transparent hugepage compaction:
>> ?> ds+in+CDH4> echo never > 
>> /sys/kernel/mm/redhat_transparent_hugepage/defrag
>> 
>> 
> 
> --
> Alexander Alten-Lorenz
> http://mapredit.blogspot.com
> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF



RE: Support of RHEL version

2013-01-24 Thread Dheeren Bebortha
I do not think this is a cdh specific issues. If this is a hbase compaction 
issue it would be all pervading as long as it si RHEL62 and above!
Am I reading it right?
-Dheeren

-Original Message-
From: Alexander Alten-Lorenz [mailto:wget.n...@gmail.com] 
Sent: Thursday, January 24, 2013 9:41 AM
To: cdh-u...@cloudera.org
Subject: Re: Support of RHEL version

Hi,

Moving the post to cdh-u...@cloudera.org
(https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/cdh-user)
as it is CDH4 you specifically are asking about. BCC'd user@hadoop lists, lets 
carry forward the discussion on the CDH lists. My response below.

RHEL 5.x and 6.x wil be supported. Means, RHEL 6.3 too.

Cheers,
 Alex

On Jan 24, 2013, at 6:28 PM,  wrote:

> Hi,
> 
> We are working on implementing Cloudera distributed Hadoop (CDH 4.x) on our 
> environment.  Cloudera website talks about supporting RHEL 6.1 version with 
> challenges/issues with the newer version. It also though provides a 
> workaround for it. Wanted to hear from the community on the supported 
> versions of RedHat Linux and any guidance on which version to choose?
> 
> -Nilesh
> 
> 
> https://ccp.cloudera.com/display/CDH4DOC/Known+Issues+and+Work+Arounds
> +in+CDH4
> 
> 
> Red Hat Linux (RHEL 6.2 and 6.3)
> - Poor performance running Hadoop on RHEL 6.2 or later when 
> transparent hugepage compaction is enabled RHEL 6.2 and 6.3 include a feature 
> called "transparent hugepage compaction" which interacts poorly with Hadoop 
> workloads. This can cause a serious performance regression compared to other 
> operating system versions on the same hardware.
> Symptom: top and other system monitoring tools show a large percentage of the 
> CPU usage classified as "system CPU". If system CPU usage is 30% or more of 
> the total CPU usage, your system may be experiencing this issue.
> Bug: https://bugzilla.redhat.com/show_bug.cgi?id=805593
> Severity: Medium (up to 3x performance loss) Anticipated Resolution: 
> Currently working with Red Hat to resolve for a future RHEL update
> Workaround: Add the following command to /etc/rc.local to disable transparent 
> hugepage compaction:
> ? ds+in+CDH4> echo never > 
> /sys/kernel/mm/redhat_transparent_hugepage/defrag
> 
> 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF



Re: Support of RHEL version

2013-01-24 Thread Alexander Alten-Lorenz
Hi,

Moving the post to cdh-u...@cloudera.org 
(https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/cdh-user)
as it is CDH4 you specifically are asking about. BCC'd user@hadoop
lists, lets carry forward the discussion on the CDH lists. My response
below.

RHEL 5.x and 6.x wil be supported. Means, RHEL 6.3 too.

Cheers,
 Alex

On Jan 24, 2013, at 6:28 PM,  wrote:

> Hi,
> 
> We are working on implementing Cloudera distributed Hadoop (CDH 4.x) on our 
> environment.  Cloudera website talks about supporting RHEL 6.1 version with 
> challenges/issues with the newer version. It also though provides a 
> workaround for it. Wanted to hear from the community on the supported 
> versions of RedHat Linux and any guidance on which version to choose?
> 
> -Nilesh
> 
> 
> https://ccp.cloudera.com/display/CDH4DOC/Known+Issues+and+Work+Arounds+in+CDH4
> 
> 
> Red Hat Linux (RHEL 6.2 and 6.3)
> - Poor performance running Hadoop on RHEL 6.2 or later when transparent 
> hugepage compaction is enabled
> RHEL 6.2 and 6.3 include a feature called "transparent hugepage compaction" 
> which interacts poorly with Hadoop workloads. This can cause a serious 
> performance regression compared to other operating system versions on the 
> same hardware.
> Symptom: top and other system monitoring tools show a large percentage of the 
> CPU usage classified as "system CPU". If system CPU usage is 30% or more of 
> the total CPU usage, your system may be experiencing this issue.
> Bug: https://bugzilla.redhat.com/show_bug.cgi?id=805593
> Severity: Medium (up to 3x performance loss)
> Anticipated Resolution: Currently working with Red Hat to resolve for a 
> future RHEL update
> Workaround: Add the following command to /etc/rc.local to disable transparent 
> hugepage compaction:
> ?
> echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
> 
> 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF



Support of RHEL version

2013-01-24 Thread Nilesh_Sangani
Hi,

We are working on implementing Cloudera distributed Hadoop (CDH 4.x) on our 
environment.  Cloudera website talks about supporting RHEL 6.1 version with 
challenges/issues with the newer version. It also though provides a workaround 
for it. Wanted to hear from the community on the supported versions of RedHat 
Linux and any guidance on which version to choose?

-Nilesh


https://ccp.cloudera.com/display/CDH4DOC/Known+Issues+and+Work+Arounds+in+CDH4


Red Hat Linux (RHEL 6.2 and 6.3)
- Poor performance running Hadoop on RHEL 6.2 or later when transparent 
hugepage compaction is enabled
RHEL 6.2 and 6.3 include a feature called "transparent hugepage compaction" 
which interacts poorly with Hadoop workloads. This can cause a serious 
performance regression compared to other operating system versions on the same 
hardware.
Symptom: top and other system monitoring tools show a large percentage of the 
CPU usage classified as "system CPU". If system CPU usage is 30% or more of the 
total CPU usage, your system may be experiencing this issue.
Bug: https://bugzilla.redhat.com/show_bug.cgi?id=805593
Severity: Medium (up to 3x performance loss)
Anticipated Resolution: Currently working with Red Hat to resolve for a future 
RHEL update
Workaround: Add the following command to /etc/rc.local to disable transparent 
hugepage compaction:
?
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag




Re: Modifying Hadoop For join Operation

2013-01-24 Thread Praveen Sripati
Vikas,

Check the below paper on the different ways on performing joins in MR

http://lintool.github.com/MapReduceAlgorithms/index.html

Also, `Hadoop - The Definitive Guide` has a section on the different
approaches and when to use them.


Thanks,
Praveen

Cloudera Certified Developer for Apache Hadoop CDH4 (95%)
http://www.thecloudavenue.com/
http://stackoverflow.com/users/614157/praveen-sripati

If you aren’t taking advantage of big data, then you don’t have big data,
you have just a pile of data.


On Thu, Jan 24, 2013 at 8:39 PM, Harsh J  wrote:

> Hi,
>
> Can you also define 'efficient way' and the idea you have in mind to
> implement that isn't already doable today?
>
> On Thu, Jan 24, 2013 at 6:51 PM, Vikas Jadhav 
> wrote:
> > Anyone has idea about how should i modify Hadoop Code for
> > Performing Join operation in efficient Way.
> > Thanks.
> >
> > --
> >
> >
> > Thanx and Regards
> >  Vikas Jadhav
>
>
>
> --
> Harsh J
>


Re: Submitting MapReduce job from remote server using JobClient

2013-01-24 Thread Amit Sela
Hi Harsh,
I'm using Job.waitForCompletion() method to run the job but I can't see it
in the webapp and it doesn't seem to finish...
I get:
 *org.apache.hadoop.mapred.JobClient   - Running
job: job_local_0001*
*INFO  org.apache.hadoop.util.ProcessTree   -
setsid exited with exit code 0*
*2013-01-24 08:10:12.521 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-  Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@7db1be6*
*2013-01-24 08:10:12.536 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - io.sort.mb
= 100*
*2013-01-24 08:10:12.573 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - data buffer
= 79691776/99614720*
*2013-01-24 08:10:12.573 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - record
buffer = 262144/327680*
*2013-01-24 08:10:12.599 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - Starting
flush of map output*
*2013-01-24 08:10:12.608 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-
Task:attempt_local_0001_m_00_0 is done. And is in the process of
commiting*
*2013-01-24 08:10:13.348
[org.springframework.scheduling.quartz.SchedulerFactoryBean#0_Worker-1]
INFO  org.apache.hadoop.mapred.JobClient   -  map
0% reduce 0%*
*2013-01-24 08:10:15.509 [Thread-51]INFO
 org.apache.hadoop.mapred.LocalJobRunner  - *
*2013-01-24 08:10:15.510 [Thread-51]INFO
 org.apache.hadoop.mapred.Task- Task
'attempt_local_0001_m_00_0' done.*
*2013-01-24 08:10:15.511 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-  Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6b02b23d*
*2013-01-24 08:10:15.512 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - io.sort.mb
= 100*
*2013-01-24 08:10:15.549 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - data buffer
= 79691776/99614720*
*2013-01-24 08:10:15.550 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - record
buffer = 262144/327680*
*2013-01-24 08:10:15.557 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - Starting
flush of map output*
*2013-01-24 08:10:15.560 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-
Task:attempt_local_0001_m_01_0 is done. And is in the process of
commiting*
*2013-01-24 08:10:16.358
[org.springframework.scheduling.quartz.SchedulerFactoryBean#0_Worker-1]
INFO  org.apache.hadoop.mapred.JobClient   -  map
100% reduce 0%*

And after that, instead of going to Reduce phase I keep getting map
attempts like:

*INFO  org.apache.hadoop.mapred.MapTask -
io.sort.mb = 100*
*2013-01-24 08:10:21.563 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - data buffer
= 79691776/99614720*
*2013-01-24 08:10:21.563 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - record
buffer = 262144/327680*
*2013-01-24 08:10:21.570 [Thread-51]INFO
 org.apache.hadoop.mapred.MapTask - Starting
flush of map output*
*2013-01-24 08:10:21.573 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-
Task:attempt_local_0001_m_03_0 is done. And is in the process of
commiting*
*2013-01-24 08:10:24.529 [Thread-51]INFO
 org.apache.hadoop.mapred.LocalJobRunner  - *
*2013-01-24 08:10:24.529 [Thread-51]INFO
 org.apache.hadoop.mapred.Task- Task
'attempt_local_0001_m_03_0' done.*
*2013-01-24 08:10:24.530 [Thread-51]INFO
 org.apache.hadoop.mapred.Task-  Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@42e87d99*
*
*
Any clues ?
Thanks for the help.

On Thu, Jan 24, 2013 at 5:12 PM, Harsh J  wrote:

> The Job class itself has a blocking and non-blocking submitter that is
> similar to JobConf's runJob method you discovered. See
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#submit()
> and its following method waitForCompletion(). These seem to be what
> you're looking for.
>
> On Thu, Jan 24, 2013 at 5:43 PM, Amit Sela  wrote:
> > Hi all,
> >
> > I want to run a MapReduce job using the Hadoop Java api from my analytics
> > server. It is not the master or even a data node but it has the same
> Hadoop
> > i

Re: HDFS File/Folder permission control with POSIX standard

2013-01-24 Thread Harsh J
Hi,

Please give the HDFS Permissions Guide a read, it should answer your
questions: http://hadoop.apache.org/docs/stable/hdfs_permissions_guide.html

On Thu, Jan 24, 2013 at 9:17 PM, Dhanasekaran Anbalagan
 wrote:
> Hi Guys,
>
> In our scenario we have two hdfs user, research and development
>
> now the current environment access of both users data, we want protect  user
> level data folders. The other user can't see others data.
>
> like edit the files and removing files.
>
> How to change folder permission in HDFS
>
> Guys it's like POSIX standard right? Removing other users permission mens
> it's can't access other right.
>
> Please guide me.
>
> -Dhanasekaran.
>
>
>
> Did I learn something today? If not, I wasted it.



-- 
Harsh J


Re: Submitting MapReduce job from remote server using JobClient

2013-01-24 Thread Harsh J
The Job class itself has a blocking and non-blocking submitter that is
similar to JobConf's runJob method you discovered. See
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#submit()
and its following method waitForCompletion(). These seem to be what
you're looking for.

On Thu, Jan 24, 2013 at 5:43 PM, Amit Sela  wrote:
> Hi all,
>
> I want to run a MapReduce job using the Hadoop Java api from my analytics
> server. It is not the master or even a data node but it has the same Hadoop
> installation as all the nodes in the cluster.
> I tried using JobClient.runJob() but it accepts JobConf as argument and when
> using JobConf it is possible to set only mapred Mapper classes and I use
> mapreduce...
> I tried using JobControl and ControlledJob but it seems like it tries to run
> the job locally. the map phase just keeps attempting...
> Anyone tried it before ?
> I'm just looking for a way to submit MapReduce jobs from Java code and be
> able to monitor them.
>
> Thanks,
>
> Amit.



-- 
Harsh J


Re: Join Operation Using Hadoop MapReduce

2013-01-24 Thread Harsh J
The Hadoop: The Definitive Guide (and also other books) has a detailed
topic on Joins and types of Joins in MR, in its MapReduce chapters.
Looking the word up in the index would probably help you find some
good things to read on this topic.

On Thu, Jan 24, 2013 at 6:48 PM, Vikas Jadhav  wrote:
> Hi I am working join operation using MapReduce
> So if anyone has useful information plz share it.
> Example Code or New Technique along with existing one.
>
> Thank You.
>
> --
>
>
> Thanx and Regards
>  Vikas Jadhav



-- 
Harsh J


Re: Modifying Hadoop For join Operation

2013-01-24 Thread Harsh J
Hi,

Can you also define 'efficient way' and the idea you have in mind to
implement that isn't already doable today?

On Thu, Jan 24, 2013 at 6:51 PM, Vikas Jadhav  wrote:
> Anyone has idea about how should i modify Hadoop Code for
> Performing Join operation in efficient Way.
> Thanks.
>
> --
>
>
> Thanx and Regards
>  Vikas Jadhav



-- 
Harsh J


Re: hdfs du periodicity and hdfs not respond at that time

2013-01-24 Thread Harsh J
I missed the periodicity part of your question. Unfortunately the "du"
refresh interval is hard-coded today, although the "df" interval is
configurable. Perhaps this is a bug - I filed
https://issues.apache.org/jira/browse/HADOOP-9241 to make it configurable.

Also, your problem reminded me of a similar issue my team and I faced
once before, and
http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/
helped us temporarily there. Perhaps it may help you as well, its a
good tip.

On Thu, Jan 24, 2013 at 1:24 PM, Xibin Liu  wrote:
> Thanks, http://search-hadoop.com/m/LLBgUiH0Bg2 is my issue , but I still
> dont't know how to solve this problem, 3 minutes not respond once an hour
> is a big problem for me, any clue for this?
>
>
> 2013/1/24 Harsh J 
>>
>> Hi,
>>
>> HDFS does this to estimate space reports. Perhaps the discussion here
>> may help you: http://search-hadoop.com/m/LLBgUiH0Bg2
>>
>> On Thu, Jan 24, 2013 at 12:51 PM, Xibin Liu 
>> wrote:
>> > hi all,
>> > I found hdfs du periodicity(one hour), and because my disk is big, the
>> > smallest one is 15T, so when hdfs exec du, datanode will not respond for
>> > about 3 minuts because of io loading, this cause a lot of problem,
>> > anybody
>> > knows why hdfs doing this and how to disable it?
>> >
>> > --
>> > Thanks &  Regards
>> > Xibin Liu
>> >
>>
>>
>>
>> --
>> Harsh J
>
>
>
>
> --
> Thanks & Best Regards
> Xibin Liu
>



--
Harsh J


Re: hdfs du periodicity and hdfs not respond at that time

2013-01-24 Thread Chris Embree
What type of FS are you using  under HDFS?  XFS, ext3, ext4?  The type and
configuration of the underlying FS will impact performance.
Most notably, ext3 has a lock-up effect when flushing disk cache.

On Thu, Jan 24, 2013 at 2:54 AM, Xibin Liu  wrote:

> Thanks, http://search-hadoop.com/m/LLBgUiH0Bg2 is my issue , but I still
> dont't know how to solve this problem, 3 minutes not respond once an hour
>  is a big problem for me, any clue for this?
>
>
> 2013/1/24 Harsh J 
>
>> Hi,
>>
>> HDFS does this to estimate space reports. Perhaps the discussion here
>> may help you: http://search-hadoop.com/m/LLBgUiH0Bg2
>>
>> On Thu, Jan 24, 2013 at 12:51 PM, Xibin Liu 
>> wrote:
>> > hi all,
>> > I found hdfs du periodicity(one hour), and because my disk is big, the
>> > smallest one is 15T, so when hdfs exec du, datanode will not respond for
>> > about 3 minuts because of io loading, this cause a lot of problem,
>> anybody
>> > knows why hdfs doing this and how to disable it?
>> >
>> > --
>> > Thanks &  Regards
>> > Xibin Liu
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Thanks & Best Regards
> Xibin Liu
>
>


Submitting MapReduce job from remote server using JobClient

2013-01-24 Thread Amit Sela
Hi all,

I want to run a MapReduce job using the Hadoop Java api from my analytics
server. It is not the master or even a data node but it has the same Hadoop
installation as all the nodes in the cluster.
I tried using JobClient.runJob() but it accepts JobConf as argument and
when using JobConf it is possible to set only mapred Mapper classes and I
use mapreduce...
I tried using JobControl and ControlledJob but it seems like it tries to
run the job locally. the map phase just keeps attempting...
Anyone tried it before ?
I'm just looking for a way to submit MapReduce jobs from Java code and be
able to monitor them.

Thanks,

Amit.


Re: Hadoop Nutch Mkdirs failed to create file

2013-01-24 Thread samir das mohapatra
just try to apply
$>chmod 755 -R  /home/wj/apps/apache-nutch-1.6

then try after it.



On Wed, Jan 23, 2013 at 9:23 PM, 吴靖  wrote:

> hi, everyone!
>  I want use the nutch to crawl the web pages, but problem comes as  the
> log like, I think it maybe some permissions problem,but i am not sure.
> Any help will be appreciated, think you
>
> 2013-01-23 07:37:21,809 ERROR mapred.FileOutputCommitter - Mkdirs failed
> to create file
> :/home/wj/apps/apache-nutch-1.6/bin/crawl/crawldb/190684692/_temporary
> 2013-01-23 07:37:24,836 WARN  mapre d.LocalJobRunner - job_local_0002
> java.io.IOException: The temporary job-output directory
> file:/home/wj/apps/apache-nutch-1.6/bin/crawl/crawldb/190684692/_temporary
> doesn't exist!
> at
> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
> at
> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
> at
> org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:46)
> at
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.(ReduceTask.java:448)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
> ** at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
>
>
>


RE: Hadoop Cluster

2013-01-24 Thread Henjarappa, Savitha
Thank you so much for your quick response.

From: Mohammad Tariq [mailto:donta...@gmail.com]
Sent: Tuesday, January 22, 2013 11:51 PM
To: user@hadoop.apache.org; Bejoy Ks
Subject: Re: Hadoop Cluster

The most significant difference between the two, as per my view, is that HA 
eliminates the problem of 'NN as the single point of failure'. For a detailed 
info I would suggest you to visit the official web site. You might also find 
this link useful 
which talk about HA in great detail.

For your second question you can follow Bejoy sir's answer.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com

On Tue, Jan 22, 2013 at 11:37 PM, 
mailto:bejoy.had...@gmail.com>> wrote:
Hi Savitha

HA is a new feature n hadoop introduced in Hadoop 2.x releases. So It is a new 
feature on top of Hadoop cluster.

Ganglia is one of the widely used tools to monitor the cluster in detail. On a 
basic hdfs and mapreduce level, the JobTracker and NameNode web UI would give 
you a good consolidation on hadoop cluster.
Regards
Bejoy KS

Sent from remote device, Please excuse typos

From: "Henjarappa, Savitha" 
mailto:savitha.henjara...@hp.com>>
Date: Tue, 22 Jan 2013 12:22:21 +
To: 
user@hadoop.apache.orgmailto:user@hadoop.apache.org>>
ReplyTo: user@hadoop.apache.org
Subject: Hadoop Cluster

Hi,

I am trying to understand the difference between HA cluster and Hadoop cluster. 
Can this group help me understand, how differently Hadoop cluster should be 
monitored?

Thanks,
Savitha