incremental loads into hadoop
Hi, I am relatively new to Hadoop and was wondering how to do incremental loads into HDFS. I have a continuous stream of data flowing into a service which is writing to an OLTP store. Due to the high volume of data, we cannot do aggregations on the OLTP store, since this starts affecting the write performance. We would like to offload this processing into a Hadoop cluster, mainly for doing aggregations/analytics. The question is how can this continuous stream of data be incrementally loaded and processed into Hadoop ? Thank you, Sam
October SF Hadoop Meetup
The October SF Hadoop users meetup will be held Wednesday, October 12, from 7pm to 9pm. This meetup will be hosted by Twitter at their office on Folsom St. *Please note that due to scheduling constraints, we will begin an hour later than usual this month.* As usual, we will use the discussion-based "unconference" format. At the beginning of the meetup we will collaboratively construct an agenda consisting of several discussion breakout groups. All participants may propose a topic and volunteer to facilitate a discussion. All Hadoop-related topics are encouraged, and all members of the Hadoop community are welcome. Event schedule: - *7pm* - Welcome - 7:30pm - Introductions; start creating agenda - Breakout sessions begin as soon as we're ready - 9pm - Conclusion Food and refreshments will be provided, courtesy of Twitter. Please RSVP at http://www.meetup.com/hadoopsf/events/35650052/ Regards, - Aaron Kimball
Re: error for deploying hadoop on macbook pro
Since you're only just beginning, and have unknowingly issued multiple "namenode -format" commands, simply run the following and restart DN alone: $ rm -r /private/tmp/hadoop-hadoop-user/dfs/data (And please do not reformat namenode, lest you go out of namespace ID sync yet again -- You can instead `hadoop dfs -rmr /*` to rid yourself of all HDFS files) On Sat, Oct 1, 2011 at 2:13 AM, Jignesh Patel wrote: > Now I am able to make task tracker and job tracker running but I still have > following problem with datanode. > > ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: > Incompatible namespaceIDs in /private/tmp/hadoop-hadoop-user/dfs/data: > namenode namespaceID = 798142055; datanode namespaceID = 964022125 > > > On Sep 30, 2011, at 3:59 PM, Jignesh Patel wrote: > >> >> >> >>> >>> >>> >>> I am trying to setup single node cluster using hadoop-0.20.204.0 and while >>> setting I found my job tracker and task tracker are not starting. I am >>> attaching the exception. I also don't know why my while formatting name >>> node my IP address still doesn't show 127.0.0.1 as follows. >>> >>> 1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG: >>> / >>> STARTUP_MSG: Starting NameNode >>> STARTUP_MSG: host = Jignesh-MacBookPro.local/192.168.1.120 >>> STARTUP_MSG: args = [-format] >>> STARTUP_MSG: version = 0.20.204.0 >>> STARTUP_MSG: build = git://hrt8n35.cc1.ygridcore.net/ on branch >>> branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; >>> compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011 >>> >> >> >> > > -- Harsh J
Re: error for deploying hadoop on macbook pro
Now I am able to make task tracker and job tracker running but I still have following problem with datanode. ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /private/tmp/hadoop-hadoop-user/dfs/data: namenode namespaceID = 798142055; datanode namespaceID = 964022125 On Sep 30, 2011, at 3:59 PM, Jignesh Patel wrote: > > > >> >> >> >> I am trying to setup single node cluster using hadoop-0.20.204.0 and while >> setting I found my job tracker and task tracker are not starting. I am >> attaching the exception. I also don't know why my while formatting name node >> my IP address still doesn't show 127.0.0.1 as follows. >> >> 1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG: >> / >> STARTUP_MSG: Starting NameNode >> STARTUP_MSG: host = Jignesh-MacBookPro.local/192.168.1.120 >> STARTUP_MSG: args = [-format] >> STARTUP_MSG: version = 0.20.204.0 >> STARTUP_MSG: build = git://hrt8n35.cc1.ygridcore.net/ on branch >> branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; >> compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011 >> > > >
Fwd: error for deploying hadoop on macbook pro
I am trying to setup single node cluster using hadoop-0.20.204.0 and while setting I found my job tracker and task tracker are not starting. I am attaching the exception. I also don't know why my while formatting name node my IP address still doesn't show 127.0.0.1 as follows.1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG: /STARTUP_MSG: Starting NameNodeSTARTUP_MSG: host = Jignesh-MacBookPro.local/192.168.1.120STARTUP_MSG: args = [-format]STARTUP_MSG: version = 0.20.204.0STARTUP_MSG: build = git://hrt8n35.cc1.ygridcore.net/ on branch branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011 hadoop-hadoop-user-tasktracker-Jignesh-MacBookPro.local.out Description: Binary data hadoop-hadoop-user-jobtracker-Jignesh-MacBookPro.local.log Description: Binary data
RE: Learning curve after MapReduce and HDFS
Are you learning for the sake of experimenting or are there functional requirements driving you to dive into this space? *If you are learning for the sake of adding new tools to your portfolio: Look into high level overviews of each of the projects and review architecture solutions that use them. Focus on how they interact and target ones that peak your curiosity the most. *If you are learning the ecosystem to fulfill some customer requirements then just learn the pieces as you need them. Compare the high level differences between the sub projects and let the requirements drive which pieces you focus on. There are plenty of training videos out there (for free) that go over quite a few of the pieces. I recently came across https://www.db2university.com/courses/auth/openid/login.php which has a basic set of reference materials that reviews a few of the sub projects within the eco system with included labs. Yahoo developer network and Cloudera also have some great resources as well. Any one of us could point you in a certain direction but it is all a matter of opinion. Compare your needs with each of the sub projects and that should filter the list down to a manageable size. Matt -Original Message- From: Varad Meru [mailto:meru.va...@gmail.com] Sent: Friday, September 30, 2011 11:19 AM To: common-user@hadoop.apache.org; Varad Meru Subject: Learning curve after MapReduce and HDFS Hi all, I have been working with Hadoop core, Hadoop HDFS and Hadoop MapReduce for the past 8 months. Now I want to learn other projects under Apache Hadoop such as Pig, Hive, HBase ... Can you suggest me a learning path to learn about the Hadoop Eco-System in a structured manner? I am confused between so many alternatives such as Hive vs Jaql vs Pig HBase vs Hypertable vs Cassandra And many other projects which are similar to each other. Thanks in advance, Varad --- Varad Meru Software Engineer Persistent Systems and Solutions Ltd. This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Re: linux containers with Hadoop
Thanks Edward, so mostly the linux containers are used in Hadoop for ensuring isolation in terms of providing security across mapreduce jobs from different users (even mesos seem to leverage the same) not for resource fairness? On Fri, Sep 30, 2011 at 1:39 PM, Edward Capriolo wrote: > On Fri, Sep 30, 2011 at 9:03 AM, bikash sharma >wrote: > > > Hi, > > Does anyone knows if Linux containers (which are like kernel supported > > virtualization technique for providing resource isolation across > > process/appication) have ever been used with Hadoop to provide resource > > isolation for map/reduce tasks? > > If yes, what could be the up/down sides of such approach and how feasible > > it > > is in the context of Hadoop? > > Any pointers if any in terms of papers, etc would be useful. > > > > Thanks, > > Bikash > > > > Previously hadoop launched map reduce tasks as a single user, now with > security tasks can launch as different users in the same OS/VM. I would say > the closest you can to that isolation is the work done with mesos . > http://www.mesosproject.org/ >
hadoop monitoring
I am using nagios to monitor Hadoop cluster. Would like to hear input from you guys. Questions 1. Would that be any difference between monitoring TCP port 9000 and curl port 50070 and grep for "namenode" 2. Job tracker I will monitor tcp port 9001 any drawbacks ? 3. Secondarynamenode what would be the good way to monitor it ? - process if it is up and running - if fsimage is outdate > input are more than welcome.. 4. datanode/tasknode - tcp check port ? Thanks Silvian
Re: linux containers with Hadoop
On Fri, Sep 30, 2011 at 9:03 AM, bikash sharma wrote: > Hi, > Does anyone knows if Linux containers (which are like kernel supported > virtualization technique for providing resource isolation across > process/appication) have ever been used with Hadoop to provide resource > isolation for map/reduce tasks? > If yes, what could be the up/down sides of such approach and how feasible > it > is in the context of Hadoop? > Any pointers if any in terms of papers, etc would be useful. > > Thanks, > Bikash > Previously hadoop launched map reduce tasks as a single user, now with security tasks can launch as different users in the same OS/VM. I would say the closest you can to that isolation is the work done with mesos . http://www.mesosproject.org/
Learning curve after MapReduce and HDFS
Hi all, I have been working with Hadoop core, Hadoop HDFS and Hadoop MapReduce for the past 8 months. Now I want to learn other projects under Apache Hadoop such as Pig, Hive, HBase ... Can you suggest me a learning path to learn about the Hadoop Eco-System in a structured manner? I am confused between so many alternatives such as Hive vs Jaql vs Pig HBase vs Hypertable vs Cassandra And many other projects which are similar to each other. Thanks in advance, Varad --- Varad Meru Software Engineer Persistent Systems and Solutions Ltd.
Re: mapred example task failing with error 127
Thanks Harsh. I did look at userlogs dir. Although it creates subdirs for each job/attempt, there are no files in those directories. just the acl xml file. I had also looked at task tracker log and all it has is this - 2011-09-30 15:50:05,344 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAct ion (registerTask): attempt_201109300014_0002_m_16_0 task's state:UNASSIGNED 2011-09-30 15:50:05,351 INFO org.apache.hadoop.mapred.TaskTracker: Trying to lau nch : attempt_201109300014_0002_m_16_0 which needs 1 slots 2011-09-30 15:50:05,351 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLaunch er, current free slots : 2 and trying to launch attempt_201109300014_0002_m_ 16_0 which needs 1 slots 2011-09-30 15:50:05,478 INFO org.apache.hadoop.mapred.JobLocalizer: Initializing user ec2-user on this TT. 2011-09-30 15:50:05,846 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner c onstructed JVM ID: jvm_201109300014_0002_m_-684431586 2011-09-30 15:50:05,847 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm _201109300014_0002_m_-684431586 spawned. 2011-09-30 15:50:05,849 INFO org.apache.hadoop.mapred.TaskController: Writing co mmands to /media/ephemeral0/hadoop/mapred/local /ttprivate/taskTracker/ec2-user/jobcache/job_201109300014_0002/attempt_20110 9300014_0002_m_16_0/taskjvm.sh 2011-09-30 15:50:05,896 WARN org.apache.hadoop.mapred.DefaultTaskController: Exi t code from task is : 127 2011-09-30 15:50:05,897 INFO org.apache.hadoop.mapred.DefaultTaskController: Out put from DefaultTaskController's launchTask follows: 2011-09-30 15:50:05,897 INFO org.apache.hadoop.mapred.TaskController: 2011-09-30 15:50:05,910 INFO org.apache.hadoop.mapred.JvmManager: JVM Not killed jvm_201109300014_0002_m_-684431586 but just removed 2011-09-30 15:50:05,911 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_2011 09300014_0002_m_-684431586 exited with exit code 127. Number of tasks it ran: 0 2011-09-30 15:50:05,913 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201109 300014_0002_m_16_0 : Child Error java.io.IOException: Task process exit with nonzero status of 127. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) ... if you want the whole file, i can use pastebin. let me know thanks vinod On Thu, Sep 29, 2011 at 10:53 PM, Harsh J wrote: > Vinod, > > There should be some stderr information on the task attempts' userlogs > that should help point out why your task launching is failing. It is > probably cause of something related to the JVM launch parameters (as > defined by mapred.child.java.opts). > > If not there, look into the TaskTracker logs instead to see if you can > make some sense out of it. We'd be happy to look at it for you add it > to your mail as well (paste direct or pastebin link - do not attach a > file). > > On Fri, Sep 30, 2011 at 4:27 AM, Vinod Gupta Tankala > wrote: > > I just setup a pseudo-distributed hadoop setup. but when i run the > example > > task, i get failed child error. I see that this was posted earlier as > well > > but I didn't see the resolution. > > > > > http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3cc30bf131a023ea4d976727cd4fc563fe0afbe...@corp-msg-01.pfshq.com%3E > > > > this is happening on a ec2 linux instance. here are the details - > > > > 11/09/29 22:41:02 INFO mapred.FileInputFormat: Total input paths to > process > > : 15 > > 11/09/29 22:41:04 INFO mapred.JobClient: Running job: > job_201109292240_0001 > > 11/09/29 22:41:05 INFO mapred.JobClient: map 0% reduce 0% > > 11/09/29 22:41:13 INFO mapred.JobClient: Task Id : > > attempt_201109292240_0001_m_16_0, Status : FAILED > > java.lang.Throwable: Child Error > >at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) > > Caused by: java.io.IOException: Task process exit with nonzero status of > > 127. > >at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) > > > > 11/09/29 22:41:13 WARN mapred.JobClient: Error reading task > > > outputhttp://ip-10-32-61-60.ec2.internal:50060/tasklog?plaintext=true&attemptid=attempt_201109292240_0001_m_16_0&filter=stdout > > 11/09/29 22:41:13 WARN mapred.JobClient: Error reading task > > > outputhttp://ip-10-32-61-60.ec2.internal:50060/tasklog?plaintext=true&attemptid=attempt_201109292240_0001_m_16_0&filter=stderr > > 11/09/29 22:41:19 INFO mapred.JobClient: Task Id : > > attempt_201109292240_0001_m_16_1, Status : FAILED > > > > 11/09/29 22:41:55 INFO mapred.JobClient: Job complete: > job_201109292240_0001 > > 11/09/29 22:41:55 INFO mapred.JobClient: Counters: 4 > > 11/09/29 22:41:55 INFO mapred.JobClient: Job Counters > > 11/09/29 22:41:55 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=24566 > > 11/09/29 22:41:55 INFO mapred.JobClient: Total time spent by all > reduces > > waiting after reserving slots (ms)=0 > > 11/09/29 22:41:55 INFO mapred.JobClient: Total time spent by all maps > > waiting after reserving slots (ms)=0 > > 11/09/29 22:41:55 INFO mapr
linux containers with Hadoop
Hi, Does anyone knows if Linux containers (which are like kernel supported virtualization technique for providing resource isolation across process/appication) have ever been used with Hadoop to provide resource isolation for map/reduce tasks? If yes, what could be the up/down sides of such approach and how feasible it is in the context of Hadoop? Any pointers if any in terms of papers, etc would be useful. Thanks, Bikash
Re: getting the process id of mapreduce tasks
Thanks Varad. On Wed, Sep 28, 2011 at 9:35 PM, Varad Meru wrote: > The process ids of each individual task can be seen using jps and jconsole > commands provided by java. > > jconsole command on command-line interface provides a GUI screen for > monitoring running tasks within java. > > The tasks are only visible as java virtual machine instance in the os > system monitoring tool. > > > Regards, > Varad Meru > --- > Sent from my iPod > > On 29-Sep-2011, at 0:15, bikash sharma wrote: > > > Hi, > > Is it possible to get the process id of each task in a MapReduce job? > > When I run a mapreduce job and do a monitoring in linux using ps, i just > see > > the id of the mapreduce job process but not its constituent map/reduce > > tasks. > > The use case is to monitor the resource usage of each task by using sar > > utility in linux with specific process id of task. > > > > Thanks, > > Bikash >
Re: getting the process id of mapreduce tasks
Thanks so much Harsh! On Thu, Sep 29, 2011 at 12:42 AM, Harsh J wrote: > Hello Bikash, > > The tasks run on the tasktracker, so that is where you'll need to look > for the process ID -- not the JobTracker/client. > > Crudely speaking, > $ ssh tasktracker01 # or whichever. > $ jps | grep Child | cut -d " " -f 1 > # And lo, PIDs to play with. > > On Thu, Sep 29, 2011 at 12:15 AM, bikash sharma > wrote: > > Hi, > > Is it possible to get the process id of each task in a MapReduce job? > > When I run a mapreduce job and do a monitoring in linux using ps, i just > see > > the id of the mapreduce job process but not its constituent map/reduce > > tasks. > > The use case is to monitor the resource usage of each task by using sar > > utility in linux with specific process id of task. > > > > Thanks, > > Bikash > > > > > > -- > Harsh J >
Re: FileSystem closed
On 29/09/2011 18:02, Joey Echeverria wrote: Do you close your FileSystem instances at all? IIRC, the FileSystem instance you use is a singleton and if you close it once, it's closed for everybody. My guess is you close it in your cleanup method and you have JVM reuse turned on. I've hit this in the past. In 0.21+ you can ask for a new instance explicity. For 0.20.20x, set "fs.hdfs.impl.disable.cache" to true in the conf, and new instances don't get cached.