Re: Clickstream and video Analysis

2012-02-22 Thread Prashant Sharma
Tubemogul is one of them.

On Thu, Feb 23, 2012 at 11:00 AM, shreya@cognizant.com wrote:

 Hi,



 Could someone provide some links on Clickstream and video Analysis in
 Hadoop.



 Thanks and Regards,

 Shreya Pal




 This e-mail and any files transmitted with it are for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information.
 If you are not the intended recipient, please contact the sender by reply
 e-mail and destroy all copies of the original message.
 Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on this
 e-mail is strictly prohibited and may be unlawful.



Re: working with SAS

2012-02-06 Thread Prashant Sharma
+ you will not necessarily need vertical systems for speeding up
things(totally depends on your query) . Give a thought of having commodity
hardware(much cheaper) and hadoop being suited for them, *I hope* your
infrastructure can be cheaper in terms of price to performance ratio.
Having said that, I do not mean you have to throw away you existing
infrastructure, because it is ideal for certain requirements.

your solution can be like writing a mapreduce job which does what query is
supposed to do and run it on a cluster of size ? depends! (how fast you
want things be done? and scale). Incase your querry is adhoc and have to be
run frequently. You might wanna consider HBASE and HIVE as solutions with a
lot of expensive vertical nodes ;).

BTW Is your querry iterative? A little more details on your type of querry
can attract guy's with more wisdom to help.

HTH


On Mon, Feb 6, 2012 at 1:46 PM, alo alt wget.n...@googlemail.com wrote:

 Hi,

 hadoop is running on a linux box (mostly) and can run in a standalone
 installation for testing only. If you decide to use hadoop with hive or
 hbase you have to face a lot of more tasks:

 - installation (whirr and Amazone EC2 as example)
 - write your own mapreduce job or use hive / hbase
 - setup sqoop with the terradata-driver

 You can easy setup part 1 and 2 with Amazon's EC2, I think you can also
 book Windows Server there. For a single query the best option I think
 before you install a hadoop cluster.

 best,
  Alex


 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Feb 6, 2012, at 8:11 AM, Ali Jooan Rizvi wrote:

  Hi,
 
 
 
  I would like to know if hadoop will be of help to me? Let me explain you
  guys my scenario:
 
 
 
  I have a windows server based single machine server having 16 Cores and
 48
  GB of Physical Memory. In addition, I have 120 GB of virtual memory.
 
 
 
  I am running a query with statistical calculation on large data of over 1
  billion rows, on SAS. In this case, SAS is acting like a database on
 which
  both source and target tables are residing. For storage, I can keep the
  source and target data on Teradata as well but the query containing a
 patent
  can only be run on SAS interface.
 
 
 
  The problem is that SAS is taking many days (25 days) to run it (a single
  query with statistical function) and not all cores all the time were used
  and rather merely 5% CPU was utilized on average. However memory
 utilization
  was high, very high, and that's why large virtual memory was used.
 
 
 
  Can I have a hadoop interface in place to do it all so that I may end up
  running the query in lesser time that is in 1 or 2 days. Anything
 squeezing
  my run time will be very helpful.
 
 
 
  Thanks
 
 
 
  Ali Jooan Rizvi
 




Re: Why $HADOOP_PREFIX ?

2012-02-01 Thread Prashant Sharma
I think you have misunderstood something. AFAIK or understand  these
variables are set automatically when you run a script. it's name is obscure
for some strange reason. ;).

Warning: $HADOOP_HOME is deprecated is always there. whether the variable
is set or not. Why?
Because the hadoop-config is sourced in all scripts. And all it does is
sets HADOOP_PREFIX as HADOOP_HOME. I think this can be reported as a bug.

-P


On Wed, Feb 1, 2012 at 5:46 PM, praveenesh kumar praveen...@gmail.comwrote:

 Does anyone have idea on Why $HADOOP_PREFIX was introduced instead of
 $HADOOP_HOME in hadoop 0.20.205 ?

 I believe $HADOOP_HOME was not giving any troubles or is there a reason/new
 feature that require $HADOOP_PREFIX to be added ?

 Its a kind of funny, but I got habitual of using $HADOOP_HOME. Just curious
 to know for this change.
 Also, there are some old packages ( I am not referring apache/cloudera/or
 any hadoop distribution ), that depends on hadoop that still uses
 $HADOOP_HOME inside. So its kind of weird when you use those packages, you
 still get warning messages even though its suppressed from Hadoop side.


 Thanks,
 Praveenesh



Re: Why $HADOOP_PREFIX ?

2012-02-01 Thread Prashant Sharma
@Harsh, I sometimes get similar thoughts :P. But wonder if there is
something can be done.

@Bobby, Thanks for elaborating the strange reason. :)

@Praveenesh, Yes, you can do away with sourcing of hadoop-config.sh and set
all the necessary variables by hand.


On Wed, Feb 1, 2012 at 10:38 PM, Harsh J ha...@cloudera.com wrote:

 Personal opinion here: For branch-1, I do think the earlier tarball
 structure was better. I do not see why it had to change for this
 version at least. Possibly was changed during all the work of adding
 packaging-related scripts for rpm/deb into Hadoop itself, but the
 tarball right now is not as usable as was before, and the older format
 would've still worked today.

 On Wed, Feb 1, 2012 at 10:31 PM, Robert Evans ev...@yahoo-inc.com wrote:
  I think it comes down to a long history of splitting and then remerging
 the hadoop project.  I could be wrong about a lot of this so take it worth
 a grain of salt.  Hadoop originally, and still is on 1.0 a single project.
  HDFS, mapreduce and common are all compiled together into a single jar
 hadoop-core.  In that respect HADOOP_HOME made a lot of since because it
 was a single thing, with some dependencies that needed to be found by some
 shell scripts.
 
  Fast forward the projects were split, HADOOP_HOME was deprecated, and
 HADOOP_COMMON_HOME, HADOOP_MAPRED_HOME, and HADOOP_HDFS_HOME were born.
  But if we install them all into a single tree it is a pain to configure
 all of these to point to the same place, but HADOOP_HOME is deprecated, so
 HADOOP_PREFIX was born.  NOTE: like was stated before all of these are
 supposed to be hidden from the end user and are intended more towards
 packaging and deploying hadoop.  Also the process is not done and it is
 likely to change further.
 
  --Bobby Evans
 
  On 2/1/12 8:10 AM, praveenesh kumar praveen...@gmail.com wrote:
 
  Interesting and strange.
  but are there any reason for setting $HADOOP_HOME to $HADOOP_PREFIX in
  hadoop-conf.sh
  and then checking in /bin/hadoop.sh whether $HADOOP_HOME is not equal to
 
 
  I mean if I comment out the export HADOOP_HOME=${HADOOP_PREFIX} in
  hadoop-conf.sh, does it make any difference ?
 
  Thanks,
  Praveenesh
 
  On Wed, Feb 1, 2012 at 6:04 PM, Prashant Sharma prashan...@imaginea.com
 wrote:
 
  I think you have misunderstood something. AFAIK or understand  these
  variables are set automatically when you run a script. it's name is
 obscure
  for some strange reason. ;).
 
  Warning: $HADOOP_HOME is deprecated is always there. whether the
 variable
  is set or not. Why?
  Because the hadoop-config is sourced in all scripts. And all it does is
  sets HADOOP_PREFIX as HADOOP_HOME. I think this can be reported as a
 bug.
 
  -P
 
 
  On Wed, Feb 1, 2012 at 5:46 PM, praveenesh kumar praveen...@gmail.com
  wrote:
 
   Does anyone have idea on Why $HADOOP_PREFIX was introduced instead of
   $HADOOP_HOME in hadoop 0.20.205 ?
  
   I believe $HADOOP_HOME was not giving any troubles or is there a
  reason/new
   feature that require $HADOOP_PREFIX to be added ?
  
   Its a kind of funny, but I got habitual of using $HADOOP_HOME. Just
  curious
   to know for this change.
   Also, there are some old packages ( I am not referring
 apache/cloudera/or
   any hadoop distribution ), that depends on hadoop that still uses
   $HADOOP_HOME inside. So its kind of weird when you use those packages,
  you
   still get warning messages even though its suppressed from Hadoop
 side.
  
  
   Thanks,
   Praveenesh
  
 
 



 --
 Harsh J
 Customer Ops. Engineer
 Cloudera | http://tiny.cloudera.com/about



Re: Any info on R+Hadoop

2012-01-29 Thread Prashant Sharma
Praveenesh,
Well, It gives you more convenience :). If you have worked on R, then you
might notice with R you can write mapper as a lapply(using rmr). They have
already abstracted a lot of stuff for you so you have less control over
things. But still as far as convenience is concerned its damn cool. For
example you can process data inside R using Hadoop (Nodoubt It uses hadoop
streaming behind the scenes) and have the process data easily loaded back
into R command line from hdfs(using rhdfs). Generally R developers do not
like being engrossed with hassles that hadoop streaming can bring.

-P

P.S. I am not endorsing anyone. It's just my view.

On Sun, Jan 29, 2012 at 12:54 PM, praveenesh kumar praveen...@gmail.comwrote:

 Does anyone has done any work with R + Hadoop ?

 I know there are some flavors of R+Hadoop available such as rmr,rhdfs,
 RHIPE, R-hive

 But as far as I know submitting jobs using Hadoop Streaming is the best way
 right now available. Am I right ?


 Any info on R on Hadoop ?

 Thanks,
 Praveenesh



Re: Hadoop Cluster Quick Setup Script

2011-12-03 Thread Prashant Sharma
Edmon,
  I made some effort but got bored eventually 'cause of no interest. I
think i made some progress and perhaps you can take it forward from there
in MAPREDUCE-3131 https://issues.apache.org/jira/browse/MAPREDUCE-3131 I
am ready to help incase anything I can with. Also it works perfect for a
single node there is wiki in it (which you can compile using mvn site:site.

Thanks,
Prashant

On Sat, Dec 3, 2011 at 9:10 PM, Edmon Begoli ebeg...@gmail.com wrote:

 Does anyone have or know of a simple (Apache) Hadoop cluster script
 that sets up Hadoop accross the cluster using
 some reasonable default values and across set of IPs.

 I am wanting to install a min five virtual node cluster and perhaps
 grow it larger. I would like to use some script
 that pulls components, uses default settings and deploys Hadoop over
 range of IP addresses.

 I am planning to write all of this myself, but I do want to check if
 anyone here has something already,
 or if there is something out there. (I like Cloudera stuff but I want
 to have something that is pure Apache).

 Thank you in advance,
 Edmon



Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion

2011-12-02 Thread Prashant Sharma
Why do you need a plugin at all?

you can do away with it by having a maven project i.e. having a pom.xml and
setting hadoop as one of the dependencies. Then use regular maven commands
to build etc.. e.g. mvn eclipse:eclipse would be an interesting command.

On Fri, Dec 2, 2011 at 1:59 PM, Will L seventeen_reas...@hotmail.comwrote:



 Oops guess the formatting went away:
 I have tried the following combinations:
 * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit),
 hadoop-eclipse-plugin-0.20.203.0.jar
 * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit),
 hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)
 * Hadoop 0.20.203 Eclipse 3.7.1 (32-bit),
 hadoop-eclipse-plugin-0.20.203.0.jar
 * Hadoop 0.20.203, Eclipse 3.7.1 (32-bit),
 hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)
 * Hadoop 0.20.205, Eclipse 3.7.1 (32-bit),
 hadoop-eclipse-plugin-0.20.205.0.jar

  From: seventeen_reas...@hotmail.com
  To: common-user@hadoop.apache.org
  Subject: Help with Hadoop Eclipse Plugin on Mac OS X Lion
  Date: Fri, 2 Dec 2011 00:26:28 -0800
 
 
 
 
 
 
  Hello,
  I am having problems getting my hadoop eclipse plugin to work on Mac OS
 X Lion.
 
  I have tried the following combinations:
  Hadoop 0.20.203, Eclipse 3.6.2 (32-bit),
 hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.6.2
 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop
 0.20.203, Eclipse 3.7.1 (32-bit),
 hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.7.1
 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop
 0.20.205, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.205.0.jar
 
  Has anyone gotten the hadoop eclipse plugin to work on Mac OS X Lion?
 
 
  Thank you for your time and help I greatly appreciate it!
 
 
  Sincerely,
 
 
  Will
 
 




Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion

2011-12-02 Thread Prashant Sharma
nice to know Will, well the way i said you have the same luxury as far as
you are running in stand-alone mode which is ideal for development.

On Fri, Dec 2, 2011 at 10:02 PM, Will L seventeen_reas...@hotmail.comwrote:



 I got the setup working under my laptop running OS X Snow Leopard without
 any problems and I would like to use my new laptop running OS X Lion.

 The plugin is helpful in that I can see hadoop output being dumped to the
 eclipse console and it used to integrate well with the Eclipse IDE making my
 development life a little easier.

 Thank you for your time and help.

 Sincerely,

 Will Lieu

  Date: Fri, 2 Dec 2011 21:44:36 +0530
  Subject: Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion
  From: prashant.ii...@gmail.com
  To: common-user@hadoop.apache.org
 
  Why do you need a plugin at all?
 
  you can do away with it by having a maven project i.e. having a pom.xml
 and
  setting hadoop as one of the dependencies. Then use regular maven
 commands
  to build etc.. e.g. mvn eclipse:eclipse would be an interesting command.
 
  On Fri, Dec 2, 2011 at 1:59 PM, Will L seventeen_reas...@hotmail.com
 wrote:
 
  
  
   Oops guess the formatting went away:
   I have tried the following combinations:
   * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit),
   hadoop-eclipse-plugin-0.20.203.0.jar
   * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit),
   hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)
   * Hadoop 0.20.203 Eclipse 3.7.1 (32-bit),
   hadoop-eclipse-plugin-0.20.203.0.jar
   * Hadoop 0.20.203, Eclipse 3.7.1 (32-bit),
   hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)
   * Hadoop 0.20.205, Eclipse 3.7.1 (32-bit),
   hadoop-eclipse-plugin-0.20.205.0.jar
  
From: seventeen_reas...@hotmail.com
To: common-user@hadoop.apache.org
Subject: Help with Hadoop Eclipse Plugin on Mac OS X Lion
Date: Fri, 2 Dec 2011 00:26:28 -0800
   
   
   
   
   
   
Hello,
I am having problems getting my hadoop eclipse plugin to work on Mac
 OS
   X Lion.
   
I have tried the following combinations:
Hadoop 0.20.203, Eclipse 3.6.2 (32-bit),
   hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.6.2
   (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop
   0.20.203, Eclipse 3.7.1 (32-bit),
   hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.7.1
   (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop
   0.20.205, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.205.0.jar
   
Has anyone gotten the hadoop eclipse plugin to work on Mac OS X Lion?
   
   
Thank you for your time and help I greatly appreciate it!
   
   
Sincerely,
   
   
Will
   
   
  
  




Re: [help]how to stop HDFS

2011-11-29 Thread Prashant Sharma
Try making $HADOOP_CONF point to right classpath including your
configuration folder.


On Tue, Nov 29, 2011 at 3:58 PM, cat fa boost.subscrib...@gmail.com wrote:

 I used the command :

 $HADOOP_PREFIX_HOME/bin/hdfs start namenode --config $HADOOP_CONF_DIR

 to sart HDFS.

 This command is in Hadoop document (here
 
 http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html
 )

 However, I got errors as

 Exception in thread main java.lang.NoClassDefFoundError:start

 Could anyone tell me how to start and stop HDFS?

 By the way, how to set Gmail so that it doesn't top post my reply?



Re: Re: [help]how to stop HDFS

2011-11-29 Thread Prashant Sharma
I mean, you have to export the variables

export HADOOP_CONF_DIR=/path/to/your/configdirectory.

also export HADOOP_HDFS_HOME ,HADOOP_COMMON_HOME. before your run your
command. I suppose this should fix the problem.
-P

On Tue, Nov 29, 2011 at 6:23 PM, cat fa boost.subscrib...@gmail.com wrote:

 it didn't work. It gave me the Usage information.

 2011/11/29 hailong.yang1115 hailong.yang1...@gmail.com

  Try $HADOOP_PREFIX_HOME/bin/hdfs namenode stop --config $HADOOP_CONF_DIR
  and $HADOOP_PREFIX_HOME/bin/hdfs datanode stop --config $HADOOP_CONF_DIR.
  It would stop namenode and datanode separately.
  The HADOOP_CONF_DIR is the directory where you store your configuration
  files.
  Hailong
 
 
 
 
  ***
  * Hailong Yang, PhD. Candidate
  * Sino-German Joint Software Institute,
  * School of Computer ScienceEngineering, Beihang University
  * Phone: (86-010)82315908
  * Email: hailong.yang1...@gmail.com
  * Address: G413, New Main Building in Beihang University,
  *  No.37 XueYuan Road,HaiDian District,
  *  Beijing,P.R.China,100191
  ***
 
  From: cat fa
  Date: 2011-11-29 20:22
  To: common-user
  Subject: Re: [help]how to stop HDFS
  use $HADOOP_CONF or $HADOOP_CONF_DIR ? I'm using hadoop 0.23.
 
  you mean which class? the class of hadoop or of java?
 
  2011/11/29 Prashant Sharma prashant.ii...@gmail.com
 
   Try making $HADOOP_CONF point to right classpath including your
   configuration folder.
  
  
   On Tue, Nov 29, 2011 at 3:58 PM, cat fa boost.subscrib...@gmail.com
   wrote:
  
I used the command :
   
$HADOOP_PREFIX_HOME/bin/hdfs start namenode --config $HADOOP_CONF_DIR
   
to sart HDFS.
   
This command is in Hadoop document (here

   
  
 
 http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html
)
   
However, I got errors as
   
Exception in thread main java.lang.NoClassDefFoundError:start
   
Could anyone tell me how to start and stop HDFS?
   
By the way, how to set Gmail so that it doesn't top post my reply?
   
  
 



Re: Re: [help]how to stop HDFS

2011-11-29 Thread Prashant Sharma
I am sorry, I had no idea you have done a rpm install, my suggestion was
based on the assumption that you have done a tar extract install where all
three distribution have to extracted and then export variables.
Also I have no experience with rpm based installs - so no comments about
what went wrong in your case.

Basically from the error i can say that it is not able to find the jars
needed  on classpath which is referred by scripts through
HADOOP_COMMON_HOME. I would say check with the access permission as in
which user was it installed with and which user is it running with ?

On Tue, Nov 29, 2011 at 10:48 PM, cat fa boost.subscrib...@gmail.comwrote:

 Thank you for your help, but I'm still a little confused.
 Suppose I installed hadoop in /usr/bin/hadoop/ .Should I
 point HADOOP_COMMON_HOME to /usr/bin/hadoop ? Where should I
 point HADOOP_HDFS_HOME? Also to /usr/bin/hadoop/ ?

 2011/11/30 Prashant Sharma prashant.ii...@gmail.com

  I mean, you have to export the variables
 
  export HADOOP_CONF_DIR=/path/to/your/configdirectory.
 
  also export HADOOP_HDFS_HOME ,HADOOP_COMMON_HOME. before your run your
  command. I suppose this should fix the problem.
  -P
 
  On Tue, Nov 29, 2011 at 6:23 PM, cat fa boost.subscrib...@gmail.com
  wrote:
 
   it didn't work. It gave me the Usage information.
  
   2011/11/29 hailong.yang1115 hailong.yang1...@gmail.com
  
Try $HADOOP_PREFIX_HOME/bin/hdfs namenode stop --config
  $HADOOP_CONF_DIR
and $HADOOP_PREFIX_HOME/bin/hdfs datanode stop --config
  $HADOOP_CONF_DIR.
It would stop namenode and datanode separately.
The HADOOP_CONF_DIR is the directory where you store your
 configuration
files.
Hailong
   
   
   
   
***
* Hailong Yang, PhD. Candidate
* Sino-German Joint Software Institute,
* School of Computer ScienceEngineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1...@gmail.com
* Address: G413, New Main Building in Beihang University,
*  No.37 XueYuan Road,HaiDian District,
*  Beijing,P.R.China,100191
***
   
From: cat fa
Date: 2011-11-29 20:22
To: common-user
Subject: Re: [help]how to stop HDFS
use $HADOOP_CONF or $HADOOP_CONF_DIR ? I'm using hadoop 0.23.
   
you mean which class? the class of hadoop or of java?
   
2011/11/29 Prashant Sharma prashant.ii...@gmail.com
   
 Try making $HADOOP_CONF point to right classpath including your
 configuration folder.


 On Tue, Nov 29, 2011 at 3:58 PM, cat fa 
 boost.subscrib...@gmail.com
  
 wrote:

  I used the command :
 
  $HADOOP_PREFIX_HOME/bin/hdfs start namenode --config
  $HADOOP_CONF_DIR
 
  to sart HDFS.
 
  This command is in Hadoop document (here
  
 

   
  
 
 http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html
  )
 
  However, I got errors as
 
  Exception in thread main java.lang.NoClassDefFoundError:start
 
  Could anyone tell me how to start and stop HDFS?
 
  By the way, how to set Gmail so that it doesn't top post my
 reply?
 

   
  
 



Re: choices for deploying a small hadoop cluster on EC2

2011-11-29 Thread Prashant Sharma
yes pallets library. https://github.com/pallet/pallet-hadoop-example


On Wed, Nov 30, 2011 at 1:58 AM, Periya.Data periya.d...@gmail.com wrote:

 Hi All,
I am just beginning to learn how to deploy a small cluster (a 3
 node cluster) on EC2. After some quick Googling, I see the following
 approaches:

   1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it
   have features for persisting (EBS)?
   2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop clusters/POC
   etc. Good stuff - I can persist using EBS snapshots. But, this uses CDH2.
   3. Install hadoop manually and related stuff like Hive...on each cluster
   node...on EC2 (or use some automation tool like Chef). I do not prefer
 it.
   4. Hadoop distribution comes with EC2 (under src/contrib) and there are
   several Hadoop EC2 AMIs available. I have not studied enough to know if
   that is easy for a beginner like me.
   5. Anything else??

 1 and 2 look promising as a beginner. If any of you have any thoughts about
 this, I would like to know (like what to keep in mind, what to take care
 of, caveats etc). I want my data /config to persist (using EBS) and
 continue from where I left off...(after a few days).  Also, I want to have
 HIVE and SQOOP installed. Can this done using 1 or 2? Or, will installation
 of them have to be done manually after I set up the cluster?

 Thanks very much,

 PD.



Re: Issue : Hadoop mapreduce job to process S3 logs gets hung at INFO mapred.JobClient: map 0% reduce 0%

2011-11-28 Thread Prashant Sharma
Can you check your userlogs/xyz_attempt_xyz.log and also jobtracker and
datanode logs.

-P

On Tue, Nov 29, 2011 at 4:17 AM, Nitika Gupta ngu...@rocketfuelinc.comwrote:

 Hi All,

 I am trying to run a mapreduce job to process the Amazon S3 logs.
 However, the code hangs at INFO mapred.JobClient: map 0% reduce 0% and
 does not even attempt to launch the tasks. The sample code for the job
 setup is given below:

 public int run(CommandLine cl) throws Exception
 {
 Configuration conf = getConf();
 String inputPath = ;
 String outputPath = ;
 try
 {
 Job job = new Job(conf, Dummy);
 job.setNumReduceTasks(0);
 job.setMapperClass(Mapper.class);
 inputPath = cl.getOptionValue(input); //input is an s3n path
 outputPath = cl.getOptionValue(output);
 FileInputFormat.setInputPaths(job, inputPath);
 FileOutputFormat.setOutputPath(job, new Path(outputPath));
 _log.info(Input path set as  + inputPath);
 _log.info(Output path set as  + outputPath);
 job.waitForCompletion(true); return 0;
 }
 catch (Exception ex)
 {
 _log.error(ex); return 1; }
 }
 The above code works on the staging machine. However, it fails on the
 production machine which is same as the staging machine with more
 capacity.

 Job Run:
 11/11/22 16:13:38 INFO Driver: Input path being processed is
 s3n://abc//mm/dd/*
 11/11/22 16:13:38 INFO Driver: Output path being processed is
 s3n://xyz//mm/dd/00/
 11/11/22 16:13:51 INFO mapred.FileInputFormat: Total input paths to
 process : 399
 11/11/22 16:13:53 INFO mapred.JobClient: Running job:
 job_20151645_14535
 11/11/22 16:13:54 INFO mapred.JobClient:  map 0% reduce 0%

 --- At this point, it hangs. The job submission goes fine and I can
 see messages in jobtracker logs
 that the task assignment has happened fine. By that I mean the log says
  Adding task (MAP) 'attempt_20262339_1974_r_40_1' to tip
 task_20262339_1974_r_40, for tracker
 'tracker_xx.xx.xx:localhost/127.0.0.1:47937' 
 But if I go to tasktracker logs (to which task was assigned) I do not
 see any mention of this attempt , which hints the tasktracker did not
 pick this task(?).
 We are using fair scheduler , if that has something to do.

 I tried to validate if it is the issue with the connection to s3. So,
 I tried to distcp from s3 to hdfs and it went fine, which hints
 connectivity issues are not there.

 Does anyone know what could be the possible reason for the error?

 Thanks in advance!

 Nitika



Re: Distributed sorting using Hadoop

2011-11-26 Thread Prashant Sharma
Please see my mail on common-dev.

Also you may not send the same mail on all mailing lists, be patient for
people to reply.

On Sat, Nov 26, 2011 at 6:35 PM, madhu_sushmi madhu_sus...@yahoo.comwrote:


 Hi,
 I need to implement distributed sorting using Hadoop. I am quite new to
 Hadoop and I am getting confused. If I want to implement Merge sort, what
 my
 Map and reduce should be doing. ? Should all the sorting happen at reduce
 side?

 Please help. This is an urgent requirement. Please guide me.

 --
 View this message in context:
 http://old.nabble.com/Distributed-sorting-using-Hadoop-tp32876787p32876787.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: How to delete files older than X days in HDFS/Hadoop

2011-11-26 Thread Prashant Sharma
Wont be that easy but its possible to write.
I did something like this.
$HADOOP_HOME/bin/hadoop fs -rmr  `$HADOOP_HOME/bin/hadoop fs -ls | grep
'.*2011.11.1[1-8].*' | cut -f 19 -d \ `

Notice a space in -d \SPACE.

-P

On Sat, Nov 26, 2011 at 8:46 PM, Uma Maheswara Rao G
mahesw...@huawei.comwrote:

 AFAIK, there is no facility like this in HDFS through command line.
 One option is, write small client program and collect the files from root
 based on your condition and invoke delete on them.

 Regards,
 Uma
 
 From: Raimon Bosch [raimon.bo...@gmail.com]
 Sent: Saturday, November 26, 2011 8:31 PM
 To: common-user@hadoop.apache.org
 Subject: How to delete files older than X days in HDFS/Hadoop

 Hi,

 I'm wondering how to delete files older than X days with HDFS/Hadoop. On
 linux we can do it with the folowing command:

 find ~/datafolder/* -mtime +7 -exec rm {} \;

 Any ideas?



Re: how to complie package my hadoop?

2011-11-20 Thread Prashant Sharma
some code in hadoop as in?

well you can read
http://svn.apache.org/repos/asf/hadoop/common/trunk/BUILDING.txt

basically to build entire repo and make distributions.

mvn clean package -Pdist -Dtar -DskipTests

you will find all the jars/tars etc.

On Thu, Nov 17, 2011 at 3:57 PM, seven garfee garfee.se...@gmail.comwrote:

 hi,all
  I modify some code in hadoop.
  but i'm not good at ant or maven.what cmd should i enter in linux shell
 to build a new hadoop-*.jar to test my code?



Re: No HADOOP COMMON HOME set.

2011-11-20 Thread Prashant Sharma
Jay,
And if you are willing to work on the trunk version. you might wanna
compile the documents using mvn site:site. And then follow the guide.

-P


On Fri, Nov 18, 2011 at 3:11 AM, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:

 Jay,

 Did you download stable (0.20.203.X) or 0.23? From what I can tell, after
 looking in the tarball for 0.23, it is a different setup then 0.20 (e.g.
 hadoop-env.sh doesn't exist anymore and is replaced by yarn-env.sh) and the
 documentation you referenced below is for setting up 0.20.

 I would suggest you go back and download stable and then the setup
 documentation you are following will make a lot more sense :)

 Matt

 -Original Message-
 From: Jay Vyas [mailto:jayunit...@gmail.com]
 Sent: Thursday, November 17, 2011 2:07 PM
 To: common-user@hadoop.apache.org
 Subject: No HADOOP COMMON HOME set.

 Hi guys : I followed the exact directions on the hadoop installation guide
 for psuedo-distributed mode
 here

 http://hadoop.apache.org/common/docs/current/single_node_setup.html#Configuration

 However, I get that several environmental variables are not set (for
 example , HaDOOP_COMMON_HOME is not set)

 Also, hadoop reported thatHADOOP CONF was not set, as well.

 Im wondering wether there is a resource on how to set environmental
 variables to run hadoop ?

 Thanks.

 --
 Jay Vyas
 MMSB/UCHC
 This e-mail message may contain privileged and/or confidential
 information, and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other
 use of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring,
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export
 control laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR)
 and sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.




Re: How to manage hadoop job submit?

2011-11-20 Thread Prashant Sharma
Richard and Ramon

Yes, I think there should be a way as you see there is a class named
JobClient in org.apache.hadoop.mapred which is basically invoked from
commandline , if you open hadoop shell script my point will be clearer.
Also I suggest you take a look at oozie there using java apis you can
submit jobs to Hadoop.

http://yahoo.github.com/oozie/releases/3.1.0/DG_Examples.html#Java_API_Example

Thanks
-P

On Sun, Nov 20, 2011 at 11:12 PM, Richard Dixon rich.dixon2...@yahoo.comwrote:

 Ramon,

 You might issue an ./hadoop -list all to get the jobs and then
 -set-priority id priority.  I know that someone from that Ocean Sync (
 http://www.oceansync.com) Hadoop management project is working with
 interacting with MapReduce jobs through a GUI, to set priorities through
 that, but they are still in beta.


 job
 Command to interact with Map Reduce Jobs.
 Usage: hadoop job [GENERIC_OPTIONS]  [-submit job-file] | [-status
 job-id] |  [-counter job-id group-name counter-name] | [-kill
 job-id] |  [-events job-id from-event-# #-of-events] | [-history
 [all] jobOutputDir] | [-list [all]] | [-kill-task task-id] |
 [-fail-task task-id] |  [-set-priority job-id priority]


 - Original Message -
 From: WangRamon ramon_w...@hotmail.com
 To: common-user@hadoop.apache.org
 Cc:
 Sent: Sunday, November 20, 2011 4:44 AM
 Subject: How to manage hadoop job submit?


 Hi All
 I'm new to hadoop, I know I can use haddop jar to submit my M/R job, but
 we need to submit a lot of jobs in my real environment, there is priority
 requirement for each jobs, so is there any way to manage how to submit
 jobs? Any Java API? Or we can only use the hadoop command line with shell
 or python to do the job submit?
 Thanks Ramon



Re: Hadoop MapReduce Poster

2011-10-31 Thread Prashant Sharma
Hi Mathias,

   I wrote a small introduction or a quick ramp up for starting out with
hadoop while learning it at my institute.
http://functionalprograming.files.wordpress.com/2011/07/hadoop-2.pdf
thanks
-P

On Mon, Oct 31, 2011 at 6:44 PM, Mathias Herberts 
mathias.herbe...@gmail.com wrote:

 Hi,

 I'm in the process of putting together a 'Hadoop MapReduce Poster' so
 my students can better understand the various steps of a MapReduce job
 as ran by Hadoop.

 I intend to release the Poster under a CC-BY-NC-ND license.

 I'd be grateful if people could review the current draf (3) of the poster.

 It is available as a 200 dpi PNG here:

 http://www.flickr.com/photos/herberts/6298203371

 Any comment welcome.

 Mathias.