Re: Clickstream and video Analysis
Tubemogul is one of them. On Thu, Feb 23, 2012 at 11:00 AM, shreya@cognizant.com wrote: Hi, Could someone provide some links on Clickstream and video Analysis in Hadoop. Thanks and Regards, Shreya Pal This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: working with SAS
+ you will not necessarily need vertical systems for speeding up things(totally depends on your query) . Give a thought of having commodity hardware(much cheaper) and hadoop being suited for them, *I hope* your infrastructure can be cheaper in terms of price to performance ratio. Having said that, I do not mean you have to throw away you existing infrastructure, because it is ideal for certain requirements. your solution can be like writing a mapreduce job which does what query is supposed to do and run it on a cluster of size ? depends! (how fast you want things be done? and scale). Incase your querry is adhoc and have to be run frequently. You might wanna consider HBASE and HIVE as solutions with a lot of expensive vertical nodes ;). BTW Is your querry iterative? A little more details on your type of querry can attract guy's with more wisdom to help. HTH On Mon, Feb 6, 2012 at 1:46 PM, alo alt wget.n...@googlemail.com wrote: Hi, hadoop is running on a linux box (mostly) and can run in a standalone installation for testing only. If you decide to use hadoop with hive or hbase you have to face a lot of more tasks: - installation (whirr and Amazone EC2 as example) - write your own mapreduce job or use hive / hbase - setup sqoop with the terradata-driver You can easy setup part 1 and 2 with Amazon's EC2, I think you can also book Windows Server there. For a single query the best option I think before you install a hadoop cluster. best, Alex -- Alexander Lorenz http://mapredit.blogspot.com On Feb 6, 2012, at 8:11 AM, Ali Jooan Rizvi wrote: Hi, I would like to know if hadoop will be of help to me? Let me explain you guys my scenario: I have a windows server based single machine server having 16 Cores and 48 GB of Physical Memory. In addition, I have 120 GB of virtual memory. I am running a query with statistical calculation on large data of over 1 billion rows, on SAS. In this case, SAS is acting like a database on which both source and target tables are residing. For storage, I can keep the source and target data on Teradata as well but the query containing a patent can only be run on SAS interface. The problem is that SAS is taking many days (25 days) to run it (a single query with statistical function) and not all cores all the time were used and rather merely 5% CPU was utilized on average. However memory utilization was high, very high, and that's why large virtual memory was used. Can I have a hadoop interface in place to do it all so that I may end up running the query in lesser time that is in 1 or 2 days. Anything squeezing my run time will be very helpful. Thanks Ali Jooan Rizvi
Re: Why $HADOOP_PREFIX ?
I think you have misunderstood something. AFAIK or understand these variables are set automatically when you run a script. it's name is obscure for some strange reason. ;). Warning: $HADOOP_HOME is deprecated is always there. whether the variable is set or not. Why? Because the hadoop-config is sourced in all scripts. And all it does is sets HADOOP_PREFIX as HADOOP_HOME. I think this can be reported as a bug. -P On Wed, Feb 1, 2012 at 5:46 PM, praveenesh kumar praveen...@gmail.comwrote: Does anyone have idea on Why $HADOOP_PREFIX was introduced instead of $HADOOP_HOME in hadoop 0.20.205 ? I believe $HADOOP_HOME was not giving any troubles or is there a reason/new feature that require $HADOOP_PREFIX to be added ? Its a kind of funny, but I got habitual of using $HADOOP_HOME. Just curious to know for this change. Also, there are some old packages ( I am not referring apache/cloudera/or any hadoop distribution ), that depends on hadoop that still uses $HADOOP_HOME inside. So its kind of weird when you use those packages, you still get warning messages even though its suppressed from Hadoop side. Thanks, Praveenesh
Re: Why $HADOOP_PREFIX ?
@Harsh, I sometimes get similar thoughts :P. But wonder if there is something can be done. @Bobby, Thanks for elaborating the strange reason. :) @Praveenesh, Yes, you can do away with sourcing of hadoop-config.sh and set all the necessary variables by hand. On Wed, Feb 1, 2012 at 10:38 PM, Harsh J ha...@cloudera.com wrote: Personal opinion here: For branch-1, I do think the earlier tarball structure was better. I do not see why it had to change for this version at least. Possibly was changed during all the work of adding packaging-related scripts for rpm/deb into Hadoop itself, but the tarball right now is not as usable as was before, and the older format would've still worked today. On Wed, Feb 1, 2012 at 10:31 PM, Robert Evans ev...@yahoo-inc.com wrote: I think it comes down to a long history of splitting and then remerging the hadoop project. I could be wrong about a lot of this so take it worth a grain of salt. Hadoop originally, and still is on 1.0 a single project. HDFS, mapreduce and common are all compiled together into a single jar hadoop-core. In that respect HADOOP_HOME made a lot of since because it was a single thing, with some dependencies that needed to be found by some shell scripts. Fast forward the projects were split, HADOOP_HOME was deprecated, and HADOOP_COMMON_HOME, HADOOP_MAPRED_HOME, and HADOOP_HDFS_HOME were born. But if we install them all into a single tree it is a pain to configure all of these to point to the same place, but HADOOP_HOME is deprecated, so HADOOP_PREFIX was born. NOTE: like was stated before all of these are supposed to be hidden from the end user and are intended more towards packaging and deploying hadoop. Also the process is not done and it is likely to change further. --Bobby Evans On 2/1/12 8:10 AM, praveenesh kumar praveen...@gmail.com wrote: Interesting and strange. but are there any reason for setting $HADOOP_HOME to $HADOOP_PREFIX in hadoop-conf.sh and then checking in /bin/hadoop.sh whether $HADOOP_HOME is not equal to I mean if I comment out the export HADOOP_HOME=${HADOOP_PREFIX} in hadoop-conf.sh, does it make any difference ? Thanks, Praveenesh On Wed, Feb 1, 2012 at 6:04 PM, Prashant Sharma prashan...@imaginea.com wrote: I think you have misunderstood something. AFAIK or understand these variables are set automatically when you run a script. it's name is obscure for some strange reason. ;). Warning: $HADOOP_HOME is deprecated is always there. whether the variable is set or not. Why? Because the hadoop-config is sourced in all scripts. And all it does is sets HADOOP_PREFIX as HADOOP_HOME. I think this can be reported as a bug. -P On Wed, Feb 1, 2012 at 5:46 PM, praveenesh kumar praveen...@gmail.com wrote: Does anyone have idea on Why $HADOOP_PREFIX was introduced instead of $HADOOP_HOME in hadoop 0.20.205 ? I believe $HADOOP_HOME was not giving any troubles or is there a reason/new feature that require $HADOOP_PREFIX to be added ? Its a kind of funny, but I got habitual of using $HADOOP_HOME. Just curious to know for this change. Also, there are some old packages ( I am not referring apache/cloudera/or any hadoop distribution ), that depends on hadoop that still uses $HADOOP_HOME inside. So its kind of weird when you use those packages, you still get warning messages even though its suppressed from Hadoop side. Thanks, Praveenesh -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: Any info on R+Hadoop
Praveenesh, Well, It gives you more convenience :). If you have worked on R, then you might notice with R you can write mapper as a lapply(using rmr). They have already abstracted a lot of stuff for you so you have less control over things. But still as far as convenience is concerned its damn cool. For example you can process data inside R using Hadoop (Nodoubt It uses hadoop streaming behind the scenes) and have the process data easily loaded back into R command line from hdfs(using rhdfs). Generally R developers do not like being engrossed with hassles that hadoop streaming can bring. -P P.S. I am not endorsing anyone. It's just my view. On Sun, Jan 29, 2012 at 12:54 PM, praveenesh kumar praveen...@gmail.comwrote: Does anyone has done any work with R + Hadoop ? I know there are some flavors of R+Hadoop available such as rmr,rhdfs, RHIPE, R-hive But as far as I know submitting jobs using Hadoop Streaming is the best way right now available. Am I right ? Any info on R on Hadoop ? Thanks, Praveenesh
Re: Hadoop Cluster Quick Setup Script
Edmon, I made some effort but got bored eventually 'cause of no interest. I think i made some progress and perhaps you can take it forward from there in MAPREDUCE-3131 https://issues.apache.org/jira/browse/MAPREDUCE-3131 I am ready to help incase anything I can with. Also it works perfect for a single node there is wiki in it (which you can compile using mvn site:site. Thanks, Prashant On Sat, Dec 3, 2011 at 9:10 PM, Edmon Begoli ebeg...@gmail.com wrote: Does anyone have or know of a simple (Apache) Hadoop cluster script that sets up Hadoop accross the cluster using some reasonable default values and across set of IPs. I am wanting to install a min five virtual node cluster and perhaps grow it larger. I would like to use some script that pulls components, uses default settings and deploys Hadoop over range of IP addresses. I am planning to write all of this myself, but I do want to check if anyone here has something already, or if there is something out there. (I like Cloudera stuff but I want to have something that is pure Apache). Thank you in advance, Edmon
Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion
Why do you need a plugin at all? you can do away with it by having a maven project i.e. having a pom.xml and setting hadoop as one of the dependencies. Then use regular maven commands to build etc.. e.g. mvn eclipse:eclipse would be an interesting command. On Fri, Dec 2, 2011 at 1:59 PM, Will L seventeen_reas...@hotmail.comwrote: Oops guess the formatting went away: I have tried the following combinations: * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jar * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA) * Hadoop 0.20.203 Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jar * Hadoop 0.20.203, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA) * Hadoop 0.20.205, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.205.0.jar From: seventeen_reas...@hotmail.com To: common-user@hadoop.apache.org Subject: Help with Hadoop Eclipse Plugin on Mac OS X Lion Date: Fri, 2 Dec 2011 00:26:28 -0800 Hello, I am having problems getting my hadoop eclipse plugin to work on Mac OS X Lion. I have tried the following combinations: Hadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop 0.20.203, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop 0.20.205, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.205.0.jar Has anyone gotten the hadoop eclipse plugin to work on Mac OS X Lion? Thank you for your time and help I greatly appreciate it! Sincerely, Will
Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion
nice to know Will, well the way i said you have the same luxury as far as you are running in stand-alone mode which is ideal for development. On Fri, Dec 2, 2011 at 10:02 PM, Will L seventeen_reas...@hotmail.comwrote: I got the setup working under my laptop running OS X Snow Leopard without any problems and I would like to use my new laptop running OS X Lion. The plugin is helpful in that I can see hadoop output being dumped to the eclipse console and it used to integrate well with the Eclipse IDE making my development life a little easier. Thank you for your time and help. Sincerely, Will Lieu Date: Fri, 2 Dec 2011 21:44:36 +0530 Subject: Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion From: prashant.ii...@gmail.com To: common-user@hadoop.apache.org Why do you need a plugin at all? you can do away with it by having a maven project i.e. having a pom.xml and setting hadoop as one of the dependencies. Then use regular maven commands to build etc.. e.g. mvn eclipse:eclipse would be an interesting command. On Fri, Dec 2, 2011 at 1:59 PM, Will L seventeen_reas...@hotmail.com wrote: Oops guess the formatting went away: I have tried the following combinations: * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jar * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA) * Hadoop 0.20.203 Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jar * Hadoop 0.20.203, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA) * Hadoop 0.20.205, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.205.0.jar From: seventeen_reas...@hotmail.com To: common-user@hadoop.apache.org Subject: Help with Hadoop Eclipse Plugin on Mac OS X Lion Date: Fri, 2 Dec 2011 00:26:28 -0800 Hello, I am having problems getting my hadoop eclipse plugin to work on Mac OS X Lion. I have tried the following combinations: Hadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.6.2 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop 0.20.203, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop 0.20.205, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.205.0.jar Has anyone gotten the hadoop eclipse plugin to work on Mac OS X Lion? Thank you for your time and help I greatly appreciate it! Sincerely, Will
Re: [help]how to stop HDFS
Try making $HADOOP_CONF point to right classpath including your configuration folder. On Tue, Nov 29, 2011 at 3:58 PM, cat fa boost.subscrib...@gmail.com wrote: I used the command : $HADOOP_PREFIX_HOME/bin/hdfs start namenode --config $HADOOP_CONF_DIR to sart HDFS. This command is in Hadoop document (here http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html ) However, I got errors as Exception in thread main java.lang.NoClassDefFoundError:start Could anyone tell me how to start and stop HDFS? By the way, how to set Gmail so that it doesn't top post my reply?
Re: Re: [help]how to stop HDFS
I mean, you have to export the variables export HADOOP_CONF_DIR=/path/to/your/configdirectory. also export HADOOP_HDFS_HOME ,HADOOP_COMMON_HOME. before your run your command. I suppose this should fix the problem. -P On Tue, Nov 29, 2011 at 6:23 PM, cat fa boost.subscrib...@gmail.com wrote: it didn't work. It gave me the Usage information. 2011/11/29 hailong.yang1115 hailong.yang1...@gmail.com Try $HADOOP_PREFIX_HOME/bin/hdfs namenode stop --config $HADOOP_CONF_DIR and $HADOOP_PREFIX_HOME/bin/hdfs datanode stop --config $HADOOP_CONF_DIR. It would stop namenode and datanode separately. The HADOOP_CONF_DIR is the directory where you store your configuration files. Hailong *** * Hailong Yang, PhD. Candidate * Sino-German Joint Software Institute, * School of Computer ScienceEngineering, Beihang University * Phone: (86-010)82315908 * Email: hailong.yang1...@gmail.com * Address: G413, New Main Building in Beihang University, * No.37 XueYuan Road,HaiDian District, * Beijing,P.R.China,100191 *** From: cat fa Date: 2011-11-29 20:22 To: common-user Subject: Re: [help]how to stop HDFS use $HADOOP_CONF or $HADOOP_CONF_DIR ? I'm using hadoop 0.23. you mean which class? the class of hadoop or of java? 2011/11/29 Prashant Sharma prashant.ii...@gmail.com Try making $HADOOP_CONF point to right classpath including your configuration folder. On Tue, Nov 29, 2011 at 3:58 PM, cat fa boost.subscrib...@gmail.com wrote: I used the command : $HADOOP_PREFIX_HOME/bin/hdfs start namenode --config $HADOOP_CONF_DIR to sart HDFS. This command is in Hadoop document (here http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html ) However, I got errors as Exception in thread main java.lang.NoClassDefFoundError:start Could anyone tell me how to start and stop HDFS? By the way, how to set Gmail so that it doesn't top post my reply?
Re: Re: [help]how to stop HDFS
I am sorry, I had no idea you have done a rpm install, my suggestion was based on the assumption that you have done a tar extract install where all three distribution have to extracted and then export variables. Also I have no experience with rpm based installs - so no comments about what went wrong in your case. Basically from the error i can say that it is not able to find the jars needed on classpath which is referred by scripts through HADOOP_COMMON_HOME. I would say check with the access permission as in which user was it installed with and which user is it running with ? On Tue, Nov 29, 2011 at 10:48 PM, cat fa boost.subscrib...@gmail.comwrote: Thank you for your help, but I'm still a little confused. Suppose I installed hadoop in /usr/bin/hadoop/ .Should I point HADOOP_COMMON_HOME to /usr/bin/hadoop ? Where should I point HADOOP_HDFS_HOME? Also to /usr/bin/hadoop/ ? 2011/11/30 Prashant Sharma prashant.ii...@gmail.com I mean, you have to export the variables export HADOOP_CONF_DIR=/path/to/your/configdirectory. also export HADOOP_HDFS_HOME ,HADOOP_COMMON_HOME. before your run your command. I suppose this should fix the problem. -P On Tue, Nov 29, 2011 at 6:23 PM, cat fa boost.subscrib...@gmail.com wrote: it didn't work. It gave me the Usage information. 2011/11/29 hailong.yang1115 hailong.yang1...@gmail.com Try $HADOOP_PREFIX_HOME/bin/hdfs namenode stop --config $HADOOP_CONF_DIR and $HADOOP_PREFIX_HOME/bin/hdfs datanode stop --config $HADOOP_CONF_DIR. It would stop namenode and datanode separately. The HADOOP_CONF_DIR is the directory where you store your configuration files. Hailong *** * Hailong Yang, PhD. Candidate * Sino-German Joint Software Institute, * School of Computer ScienceEngineering, Beihang University * Phone: (86-010)82315908 * Email: hailong.yang1...@gmail.com * Address: G413, New Main Building in Beihang University, * No.37 XueYuan Road,HaiDian District, * Beijing,P.R.China,100191 *** From: cat fa Date: 2011-11-29 20:22 To: common-user Subject: Re: [help]how to stop HDFS use $HADOOP_CONF or $HADOOP_CONF_DIR ? I'm using hadoop 0.23. you mean which class? the class of hadoop or of java? 2011/11/29 Prashant Sharma prashant.ii...@gmail.com Try making $HADOOP_CONF point to right classpath including your configuration folder. On Tue, Nov 29, 2011 at 3:58 PM, cat fa boost.subscrib...@gmail.com wrote: I used the command : $HADOOP_PREFIX_HOME/bin/hdfs start namenode --config $HADOOP_CONF_DIR to sart HDFS. This command is in Hadoop document (here http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html ) However, I got errors as Exception in thread main java.lang.NoClassDefFoundError:start Could anyone tell me how to start and stop HDFS? By the way, how to set Gmail so that it doesn't top post my reply?
Re: choices for deploying a small hadoop cluster on EC2
yes pallets library. https://github.com/pallet/pallet-hadoop-example On Wed, Nov 30, 2011 at 1:58 AM, Periya.Data periya.d...@gmail.com wrote: Hi All, I am just beginning to learn how to deploy a small cluster (a 3 node cluster) on EC2. After some quick Googling, I see the following approaches: 1. Use Whirr for quick deployment and tearing down. Uses CDH3. Does it have features for persisting (EBS)? 2. CDH Cloud Scripts - has EC2 AMI - again for temp Hadoop clusters/POC etc. Good stuff - I can persist using EBS snapshots. But, this uses CDH2. 3. Install hadoop manually and related stuff like Hive...on each cluster node...on EC2 (or use some automation tool like Chef). I do not prefer it. 4. Hadoop distribution comes with EC2 (under src/contrib) and there are several Hadoop EC2 AMIs available. I have not studied enough to know if that is easy for a beginner like me. 5. Anything else?? 1 and 2 look promising as a beginner. If any of you have any thoughts about this, I would like to know (like what to keep in mind, what to take care of, caveats etc). I want my data /config to persist (using EBS) and continue from where I left off...(after a few days). Also, I want to have HIVE and SQOOP installed. Can this done using 1 or 2? Or, will installation of them have to be done manually after I set up the cluster? Thanks very much, PD.
Re: Issue : Hadoop mapreduce job to process S3 logs gets hung at INFO mapred.JobClient: map 0% reduce 0%
Can you check your userlogs/xyz_attempt_xyz.log and also jobtracker and datanode logs. -P On Tue, Nov 29, 2011 at 4:17 AM, Nitika Gupta ngu...@rocketfuelinc.comwrote: Hi All, I am trying to run a mapreduce job to process the Amazon S3 logs. However, the code hangs at INFO mapred.JobClient: map 0% reduce 0% and does not even attempt to launch the tasks. The sample code for the job setup is given below: public int run(CommandLine cl) throws Exception { Configuration conf = getConf(); String inputPath = ; String outputPath = ; try { Job job = new Job(conf, Dummy); job.setNumReduceTasks(0); job.setMapperClass(Mapper.class); inputPath = cl.getOptionValue(input); //input is an s3n path outputPath = cl.getOptionValue(output); FileInputFormat.setInputPaths(job, inputPath); FileOutputFormat.setOutputPath(job, new Path(outputPath)); _log.info(Input path set as + inputPath); _log.info(Output path set as + outputPath); job.waitForCompletion(true); return 0; } catch (Exception ex) { _log.error(ex); return 1; } } The above code works on the staging machine. However, it fails on the production machine which is same as the staging machine with more capacity. Job Run: 11/11/22 16:13:38 INFO Driver: Input path being processed is s3n://abc//mm/dd/* 11/11/22 16:13:38 INFO Driver: Output path being processed is s3n://xyz//mm/dd/00/ 11/11/22 16:13:51 INFO mapred.FileInputFormat: Total input paths to process : 399 11/11/22 16:13:53 INFO mapred.JobClient: Running job: job_20151645_14535 11/11/22 16:13:54 INFO mapred.JobClient: map 0% reduce 0% --- At this point, it hangs. The job submission goes fine and I can see messages in jobtracker logs that the task assignment has happened fine. By that I mean the log says Adding task (MAP) 'attempt_20262339_1974_r_40_1' to tip task_20262339_1974_r_40, for tracker 'tracker_xx.xx.xx:localhost/127.0.0.1:47937' But if I go to tasktracker logs (to which task was assigned) I do not see any mention of this attempt , which hints the tasktracker did not pick this task(?). We are using fair scheduler , if that has something to do. I tried to validate if it is the issue with the connection to s3. So, I tried to distcp from s3 to hdfs and it went fine, which hints connectivity issues are not there. Does anyone know what could be the possible reason for the error? Thanks in advance! Nitika
Re: Distributed sorting using Hadoop
Please see my mail on common-dev. Also you may not send the same mail on all mailing lists, be patient for people to reply. On Sat, Nov 26, 2011 at 6:35 PM, madhu_sushmi madhu_sus...@yahoo.comwrote: Hi, I need to implement distributed sorting using Hadoop. I am quite new to Hadoop and I am getting confused. If I want to implement Merge sort, what my Map and reduce should be doing. ? Should all the sorting happen at reduce side? Please help. This is an urgent requirement. Please guide me. -- View this message in context: http://old.nabble.com/Distributed-sorting-using-Hadoop-tp32876787p32876787.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: How to delete files older than X days in HDFS/Hadoop
Wont be that easy but its possible to write. I did something like this. $HADOOP_HOME/bin/hadoop fs -rmr `$HADOOP_HOME/bin/hadoop fs -ls | grep '.*2011.11.1[1-8].*' | cut -f 19 -d \ ` Notice a space in -d \SPACE. -P On Sat, Nov 26, 2011 at 8:46 PM, Uma Maheswara Rao G mahesw...@huawei.comwrote: AFAIK, there is no facility like this in HDFS through command line. One option is, write small client program and collect the files from root based on your condition and invoke delete on them. Regards, Uma From: Raimon Bosch [raimon.bo...@gmail.com] Sent: Saturday, November 26, 2011 8:31 PM To: common-user@hadoop.apache.org Subject: How to delete files older than X days in HDFS/Hadoop Hi, I'm wondering how to delete files older than X days with HDFS/Hadoop. On linux we can do it with the folowing command: find ~/datafolder/* -mtime +7 -exec rm {} \; Any ideas?
Re: how to complie package my hadoop?
some code in hadoop as in? well you can read http://svn.apache.org/repos/asf/hadoop/common/trunk/BUILDING.txt basically to build entire repo and make distributions. mvn clean package -Pdist -Dtar -DskipTests you will find all the jars/tars etc. On Thu, Nov 17, 2011 at 3:57 PM, seven garfee garfee.se...@gmail.comwrote: hi,all I modify some code in hadoop. but i'm not good at ant or maven.what cmd should i enter in linux shell to build a new hadoop-*.jar to test my code?
Re: No HADOOP COMMON HOME set.
Jay, And if you are willing to work on the trunk version. you might wanna compile the documents using mvn site:site. And then follow the guide. -P On Fri, Nov 18, 2011 at 3:11 AM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Jay, Did you download stable (0.20.203.X) or 0.23? From what I can tell, after looking in the tarball for 0.23, it is a different setup then 0.20 (e.g. hadoop-env.sh doesn't exist anymore and is replaced by yarn-env.sh) and the documentation you referenced below is for setting up 0.20. I would suggest you go back and download stable and then the setup documentation you are following will make a lot more sense :) Matt -Original Message- From: Jay Vyas [mailto:jayunit...@gmail.com] Sent: Thursday, November 17, 2011 2:07 PM To: common-user@hadoop.apache.org Subject: No HADOOP COMMON HOME set. Hi guys : I followed the exact directions on the hadoop installation guide for psuedo-distributed mode here http://hadoop.apache.org/common/docs/current/single_node_setup.html#Configuration However, I get that several environmental variables are not set (for example , HaDOOP_COMMON_HOME is not set) Also, hadoop reported thatHADOOP CONF was not set, as well. Im wondering wether there is a resource on how to set environmental variables to run hadoop ? Thanks. -- Jay Vyas MMSB/UCHC This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Re: How to manage hadoop job submit?
Richard and Ramon Yes, I think there should be a way as you see there is a class named JobClient in org.apache.hadoop.mapred which is basically invoked from commandline , if you open hadoop shell script my point will be clearer. Also I suggest you take a look at oozie there using java apis you can submit jobs to Hadoop. http://yahoo.github.com/oozie/releases/3.1.0/DG_Examples.html#Java_API_Example Thanks -P On Sun, Nov 20, 2011 at 11:12 PM, Richard Dixon rich.dixon2...@yahoo.comwrote: Ramon, You might issue an ./hadoop -list all to get the jobs and then -set-priority id priority. I know that someone from that Ocean Sync ( http://www.oceansync.com) Hadoop management project is working with interacting with MapReduce jobs through a GUI, to set priorities through that, but they are still in beta. job Command to interact with Map Reduce Jobs. Usage: hadoop job [GENERIC_OPTIONS] [-submit job-file] | [-status job-id] | [-counter job-id group-name counter-name] | [-kill job-id] | [-events job-id from-event-# #-of-events] | [-history [all] jobOutputDir] | [-list [all]] | [-kill-task task-id] | [-fail-task task-id] | [-set-priority job-id priority] - Original Message - From: WangRamon ramon_w...@hotmail.com To: common-user@hadoop.apache.org Cc: Sent: Sunday, November 20, 2011 4:44 AM Subject: How to manage hadoop job submit? Hi All I'm new to hadoop, I know I can use haddop jar to submit my M/R job, but we need to submit a lot of jobs in my real environment, there is priority requirement for each jobs, so is there any way to manage how to submit jobs? Any Java API? Or we can only use the hadoop command line with shell or python to do the job submit? Thanks Ramon
Re: Hadoop MapReduce Poster
Hi Mathias, I wrote a small introduction or a quick ramp up for starting out with hadoop while learning it at my institute. http://functionalprograming.files.wordpress.com/2011/07/hadoop-2.pdf thanks -P On Mon, Oct 31, 2011 at 6:44 PM, Mathias Herberts mathias.herbe...@gmail.com wrote: Hi, I'm in the process of putting together a 'Hadoop MapReduce Poster' so my students can better understand the various steps of a MapReduce job as ran by Hadoop. I intend to release the Poster under a CC-BY-NC-ND license. I'd be grateful if people could review the current draf (3) of the poster. It is available as a 200 dpi PNG here: http://www.flickr.com/photos/herberts/6298203371 Any comment welcome. Mathias.