incremental loads into hadoop

2011-09-30 Thread Sam Seigal
Hi,

I am relatively new to Hadoop and was wondering how to do incremental
loads into HDFS.

I have a continuous stream of data flowing into a service which is
writing to an OLTP store. Due to the high volume of data, we cannot do
aggregations on the OLTP store, since this starts affecting the write
performance.

We would like to offload this processing into a Hadoop cluster, mainly
for doing aggregations/analytics.

The question is how can this continuous stream of data be
incrementally loaded and processed into Hadoop ?

Thank you,

Sam


October SF Hadoop Meetup

2011-09-30 Thread Aaron Kimball
The October SF Hadoop users meetup will be held Wednesday, October 12, from
7pm to 9pm. This meetup will be hosted by Twitter at their office on Folsom
St. *Please note that due to scheduling constraints, we will begin an hour
later than usual this month.*

As usual, we will use the discussion-based "unconference" format. At the
beginning of the meetup we will collaboratively construct an agenda
consisting of several discussion breakout groups. All participants may
propose a topic and volunteer to facilitate a discussion. All Hadoop-related
topics are encouraged, and all members of the Hadoop community are welcome.

Event schedule:

   - *7pm* - Welcome
   - 7:30pm - Introductions; start creating agenda
   - Breakout sessions begin as soon as we're ready
   - 9pm - Conclusion

Food and refreshments will be provided, courtesy of Twitter.

Please RSVP at http://www.meetup.com/hadoopsf/events/35650052/

Regards,

- Aaron Kimball


Re: error for deploying hadoop on macbook pro

2011-09-30 Thread Harsh J
Since you're only just beginning, and have unknowingly issued multiple
"namenode -format" commands, simply run the following and restart DN
alone:

$ rm -r /private/tmp/hadoop-hadoop-user/dfs/data

(And please do not reformat namenode, lest you go out of namespace ID
sync yet again -- You can instead `hadoop dfs -rmr /*` to rid yourself
of all HDFS files)

On Sat, Oct 1, 2011 at 2:13 AM, Jignesh Patel  wrote:
> Now I am able to make task tracker and job tracker running but I still have 
> following problem with datanode.
>
> ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: 
> Incompatible namespaceIDs in /private/tmp/hadoop-hadoop-user/dfs/data: 
> namenode namespaceID = 798142055; datanode namespaceID = 964022125
>
>
> On Sep 30, 2011, at 3:59 PM, Jignesh Patel wrote:
>
>>
>>
>>
>>>
>>>
>>>
>>> I am trying to setup single node cluster using hadoop-0.20.204.0 and while 
>>> setting I found my job tracker and task tracker are not starting. I am 
>>> attaching the exception. I also don't know why my while formatting name 
>>> node my IP address still doesn't show 127.0.0.1 as follows.
>>>
>>> 1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG:
>>> /
>>> STARTUP_MSG: Starting NameNode
>>> STARTUP_MSG:   host = Jignesh-MacBookPro.local/192.168.1.120
>>> STARTUP_MSG:   args = [-format]
>>> STARTUP_MSG:   version = 0.20.204.0
>>> STARTUP_MSG:   build = git://hrt8n35.cc1.ygridcore.net/ on branch 
>>> branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; 
>>> compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011
>>>
>> 
>> 
>>
>
>



-- 
Harsh J


Re: error for deploying hadoop on macbook pro

2011-09-30 Thread Jignesh Patel
Now I am able to make task tracker and job tracker running but I still have 
following problem with datanode.

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: 
Incompatible namespaceIDs in /private/tmp/hadoop-hadoop-user/dfs/data: namenode 
namespaceID = 798142055; datanode namespaceID = 964022125


On Sep 30, 2011, at 3:59 PM, Jignesh Patel wrote:

> 
> 
> 
>> 
>> 
>> 
>> I am trying to setup single node cluster using hadoop-0.20.204.0 and while 
>> setting I found my job tracker and task tracker are not starting. I am 
>> attaching the exception. I also don't know why my while formatting name node 
>> my IP address still doesn't show 127.0.0.1 as follows.
>> 
>> 1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG: 
>> /
>> STARTUP_MSG: Starting NameNode
>> STARTUP_MSG:   host = Jignesh-MacBookPro.local/192.168.1.120
>> STARTUP_MSG:   args = [-format]
>> STARTUP_MSG:   version = 0.20.204.0
>> STARTUP_MSG:   build = git://hrt8n35.cc1.ygridcore.net/ on branch 
>> branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; 
>> compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011
>> 
> 
> 
> 



Fwd: error for deploying hadoop on macbook pro

2011-09-30 Thread Jignesh Patel
I am trying to setup single node cluster using hadoop-0.20.204.0 and while setting I found my job tracker and task tracker are not starting. I am attaching the exception. I also don't know why my while formatting name node my IP address still doesn't show 127.0.0.1 as follows.1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG: /STARTUP_MSG: Starting NameNodeSTARTUP_MSG:   host = Jignesh-MacBookPro.local/192.168.1.120STARTUP_MSG:   args = [-format]STARTUP_MSG:   version = 0.20.204.0STARTUP_MSG:   build = git://hrt8n35.cc1.ygridcore.net/ on branch branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; compiled by
 'hortonow' on Thu Aug 25 23:35:31 UTC 2011

hadoop-hadoop-user-tasktracker-Jignesh-MacBookPro.local.out
Description: Binary data


hadoop-hadoop-user-jobtracker-Jignesh-MacBookPro.local.log
Description: Binary data


RE: Learning curve after MapReduce and HDFS

2011-09-30 Thread GOEKE, MATTHEW (AG/1000)
Are you learning for the sake of experimenting or are there functional 
requirements driving you to dive into this space?

*If you are learning for the sake of adding new tools to your portfolio: Look 
into high level overviews of each of the projects and review architecture 
solutions that use them. Focus on how they interact and target ones that peak 
your curiosity the most.

*If you are learning the ecosystem to fulfill some customer requirements then 
just learn the pieces as you need them. Compare the high level differences 
between the sub projects and let the requirements drive which pieces you focus 
on.

There are plenty of training videos out there (for free) that go over quite a 
few of the pieces. I recently came across 
https://www.db2university.com/courses/auth/openid/login.php which has a basic 
set of reference materials that reviews a few of the sub projects within the 
eco system with included labs. Yahoo developer network and Cloudera also have 
some great resources as well.

Any one of us could point you in a certain direction but it is all a matter of 
opinion. Compare your needs with each of the sub projects and that should 
filter the list down to a manageable size.

Matt
-Original Message-
From: Varad Meru [mailto:meru.va...@gmail.com] 
Sent: Friday, September 30, 2011 11:19 AM
To: common-user@hadoop.apache.org; Varad Meru
Subject: Learning curve after MapReduce and HDFS

Hi all,

I have been working with Hadoop core, Hadoop HDFS and Hadoop MapReduce for the 
past 8 months. 

Now I want to learn other projects under Apache Hadoop such as Pig, Hive, HBase 
...

Can you suggest me a learning path to learn about the Hadoop Eco-System in a 
structured manner?
I am confused between so many alternatives such as 
Hive vs Jaql vs Pig
HBase vs Hypertable vs Cassandra
And many other projects which are similar to each other.  

Thanks in advance,
Varad


---
Varad Meru
Software Engineer
Persistent Systems and Solutions Ltd. 
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Re: linux containers with Hadoop

2011-09-30 Thread bikash sharma
Thanks Edward, so mostly the linux containers are used in Hadoop for
ensuring isolation in terms of providing security across mapreduce jobs from
different users (even mesos seem to leverage the same) not for resource
fairness?

On Fri, Sep 30, 2011 at 1:39 PM, Edward Capriolo wrote:

> On Fri, Sep 30, 2011 at 9:03 AM, bikash sharma  >wrote:
>
> > Hi,
> > Does anyone knows if Linux containers (which are like kernel supported
> > virtualization technique for providing resource isolation across
> > process/appication) have ever been used with Hadoop to provide resource
> > isolation for map/reduce tasks?
> > If yes, what could be the up/down sides of such approach and how feasible
> > it
> > is in the context of Hadoop?
> > Any pointers if any in terms of papers, etc would be useful.
> >
> > Thanks,
> > Bikash
> >
>
> Previously hadoop launched map reduce tasks as a single user, now with
> security tasks can launch as different users in the same OS/VM. I would say
> the closest you can to that isolation is the work done with mesos .
> http://www.mesosproject.org/
>


hadoop monitoring

2011-09-30 Thread patrick sang
I am using nagios to monitor Hadoop cluster.
Would like to hear input from you guys.

Questions

1. Would that be any difference between monitoring
TCP port 9000 and curl port 50070 and grep for "namenode"

2. Job tracker I will monitor tcp port 9001 any drawbacks ?

3. Secondarynamenode what would be the good way to monitor it ?
- process if it is up and running
- if fsimage is outdate
> input are more than welcome..

4. datanode/tasknode
- tcp check port ?


Thanks
Silvian


Re: linux containers with Hadoop

2011-09-30 Thread Edward Capriolo
On Fri, Sep 30, 2011 at 9:03 AM, bikash sharma wrote:

> Hi,
> Does anyone knows if Linux containers (which are like kernel supported
> virtualization technique for providing resource isolation across
> process/appication) have ever been used with Hadoop to provide resource
> isolation for map/reduce tasks?
> If yes, what could be the up/down sides of such approach and how feasible
> it
> is in the context of Hadoop?
> Any pointers if any in terms of papers, etc would be useful.
>
> Thanks,
> Bikash
>

Previously hadoop launched map reduce tasks as a single user, now with
security tasks can launch as different users in the same OS/VM. I would say
the closest you can to that isolation is the work done with mesos .
http://www.mesosproject.org/


Learning curve after MapReduce and HDFS

2011-09-30 Thread Varad Meru
Hi all,

I have been working with Hadoop core, Hadoop HDFS and Hadoop MapReduce for the 
past 8 months. 

Now I want to learn other projects under Apache Hadoop such as Pig, Hive, HBase 
...

Can you suggest me a learning path to learn about the Hadoop Eco-System in a 
structured manner?
I am confused between so many alternatives such as 
Hive vs Jaql vs Pig
HBase vs Hypertable vs Cassandra
And many other projects which are similar to each other.  

Thanks in advance,
Varad


---
Varad Meru
Software Engineer
Persistent Systems and Solutions Ltd. 

Re: mapred example task failing with error 127

2011-09-30 Thread Vinod Gupta Tankala
Thanks Harsh.
I did look at userlogs dir. Although it creates subdirs for each
job/attempt, there are no files in those directories. just the acl xml file.
I had also looked at task tracker log and all it has is this -

2011-09-30 15:50:05,344 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAct
ion (registerTask): attempt_201109300014_0002_m_16_0 task's
state:UNASSIGNED
2011-09-30 15:50:05,351 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
lau
nch : attempt_201109300014_0002_m_16_0 which needs 1 slots
2011-09-30 15:50:05,351 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLaunch
er, current free slots : 2 and trying to launch
attempt_201109300014_0002_m_
16_0 which needs 1 slots
2011-09-30 15:50:05,478 INFO org.apache.hadoop.mapred.JobLocalizer:
Initializing
 user ec2-user on this TT.
2011-09-30 15:50:05,846 INFO org.apache.hadoop.mapred.JvmManager: In
JvmRunner c
onstructed JVM ID: jvm_201109300014_0002_m_-684431586
2011-09-30 15:50:05,847 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner
jvm
_201109300014_0002_m_-684431586 spawned.
2011-09-30 15:50:05,849 INFO org.apache.hadoop.mapred.TaskController:
Writing co
mmands to /media/ephemeral0/hadoop/mapred/local

/ttprivate/taskTracker/ec2-user/jobcache/job_201109300014_0002/attempt_20110
9300014_0002_m_16_0/taskjvm.sh
2011-09-30 15:50:05,896 WARN org.apache.hadoop.mapred.DefaultTaskController:
Exi
t code from task is : 127
2011-09-30 15:50:05,897 INFO org.apache.hadoop.mapred.DefaultTaskController:
Out
put from DefaultTaskController's launchTask follows:
2011-09-30 15:50:05,897 INFO org.apache.hadoop.mapred.TaskController:
2011-09-30 15:50:05,910 INFO org.apache.hadoop.mapred.JvmManager: JVM Not
killed
 jvm_201109300014_0002_m_-684431586 but just removed
2011-09-30 15:50:05,911 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_2011
09300014_0002_m_-684431586 exited with exit code 127. Number of tasks it
ran: 0
2011-09-30 15:50:05,913 WARN org.apache.hadoop.mapred.TaskRunner:
attempt_201109
300014_0002_m_16_0 : Child Error
java.io.IOException: Task process exit with nonzero status of 127.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
...

if you want the whole file, i can use pastebin. let me know

thanks
vinod


On Thu, Sep 29, 2011 at 10:53 PM, Harsh J  wrote:

> Vinod,
>
> There should be some stderr information on the task attempts' userlogs
> that should help point out why your task launching is failing. It is
> probably cause of something related to the JVM launch parameters (as
> defined by mapred.child.java.opts).
>
> If not there, look into the TaskTracker logs instead to see if you can
> make some sense out of it. We'd be happy to look at it for you add it
> to your mail as well (paste direct or pastebin link - do not attach a
> file).
>
> On Fri, Sep 30, 2011 at 4:27 AM, Vinod Gupta Tankala
>  wrote:
> > I just setup a pseudo-distributed hadoop setup. but when i run the
> example
> > task, i get failed child error. I see that this was posted earlier as
> well
> > but I didn't see the resolution.
> >
> >
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3cc30bf131a023ea4d976727cd4fc563fe0afbe...@corp-msg-01.pfshq.com%3E
> >
> > this is happening on a ec2 linux instance. here are the details -
> >
> > 11/09/29 22:41:02 INFO mapred.FileInputFormat: Total input paths to
> process
> > : 15
> > 11/09/29 22:41:04 INFO mapred.JobClient: Running job:
> job_201109292240_0001
> > 11/09/29 22:41:05 INFO mapred.JobClient:  map 0% reduce 0%
> > 11/09/29 22:41:13 INFO mapred.JobClient: Task Id :
> > attempt_201109292240_0001_m_16_0, Status : FAILED
> > java.lang.Throwable: Child Error
> >at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
> > Caused by: java.io.IOException: Task process exit with nonzero status of
> > 127.
> >at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
> >
> > 11/09/29 22:41:13 WARN mapred.JobClient: Error reading task
> >
> outputhttp://ip-10-32-61-60.ec2.internal:50060/tasklog?plaintext=true&attemptid=attempt_201109292240_0001_m_16_0&filter=stdout
> > 11/09/29 22:41:13 WARN mapred.JobClient: Error reading task
> >
> outputhttp://ip-10-32-61-60.ec2.internal:50060/tasklog?plaintext=true&attemptid=attempt_201109292240_0001_m_16_0&filter=stderr
> > 11/09/29 22:41:19 INFO mapred.JobClient: Task Id :
> > attempt_201109292240_0001_m_16_1, Status : FAILED
> > 
> > 11/09/29 22:41:55 INFO mapred.JobClient: Job complete:
> job_201109292240_0001
> > 11/09/29 22:41:55 INFO mapred.JobClient: Counters: 4
> > 11/09/29 22:41:55 INFO mapred.JobClient:   Job Counters
> > 11/09/29 22:41:55 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=24566
> > 11/09/29 22:41:55 INFO mapred.JobClient: Total time spent by all
> reduces
> > waiting after reserving slots (ms)=0
> > 11/09/29 22:41:55 INFO mapred.JobClient: Total time spent by all maps
> > waiting after reserving slots (ms)=0
> > 11/09/29 22:41:55 INFO mapr

linux containers with Hadoop

2011-09-30 Thread bikash sharma
Hi,
Does anyone knows if Linux containers (which are like kernel supported
virtualization technique for providing resource isolation across
process/appication) have ever been used with Hadoop to provide resource
isolation for map/reduce tasks?
If yes, what could be the up/down sides of such approach and how feasible it
is in the context of Hadoop?
Any pointers if any in terms of papers, etc would be useful.

Thanks,
Bikash


Re: getting the process id of mapreduce tasks

2011-09-30 Thread bikash sharma
Thanks Varad.

On Wed, Sep 28, 2011 at 9:35 PM, Varad Meru  wrote:

> The process ids of each individual task can be seen using jps and jconsole
> commands provided by java.
>
> jconsole command on command-line interface provides a GUI screen for
> monitoring running tasks within java.
>
> The tasks are only visible as java virtual machine instance in the os
> system monitoring tool.
>
>
> Regards,
> Varad Meru
> ---
> Sent from my iPod
>
> On 29-Sep-2011, at 0:15, bikash sharma  wrote:
>
> > Hi,
> > Is it possible to get the process id of each task in a MapReduce job?
> > When I run a mapreduce job and do a monitoring in linux using ps, i just
> see
> > the id of the mapreduce job process but not its constituent map/reduce
> > tasks.
> > The use case is to monitor the resource usage of each task by using sar
> > utility in linux with specific process id of task.
> >
> > Thanks,
> > Bikash
>


Re: getting the process id of mapreduce tasks

2011-09-30 Thread bikash sharma
Thanks so much Harsh!

On Thu, Sep 29, 2011 at 12:42 AM, Harsh J  wrote:

> Hello Bikash,
>
> The tasks run on the tasktracker, so that is where you'll need to look
> for the process ID -- not the JobTracker/client.
>
> Crudely speaking,
> $ ssh tasktracker01 # or whichever.
> $ jps | grep Child | cut -d " " -f 1
> # And lo, PIDs to play with.
>
> On Thu, Sep 29, 2011 at 12:15 AM, bikash sharma 
> wrote:
> > Hi,
> > Is it possible to get the process id of each task in a MapReduce job?
> > When I run a mapreduce job and do a monitoring in linux using ps, i just
> see
> > the id of the mapreduce job process but not its constituent map/reduce
> > tasks.
> > The use case is to monitor the resource usage of each task by using sar
> > utility in linux with specific process id of task.
> >
> > Thanks,
> > Bikash
> >
>
>
>
> --
> Harsh J
>


Re: FileSystem closed

2011-09-30 Thread Steve Loughran

On 29/09/2011 18:02, Joey Echeverria wrote:

Do you close your FileSystem instances at all? IIRC, the FileSystem
instance you use is a singleton and if you close it once, it's closed
for everybody. My guess is you close it in your cleanup method and you
have JVM reuse turned on.



I've hit this in the past. In 0.21+ you can ask for a new instance 
explicity.


For 0.20.20x, set "fs.hdfs.impl.disable.cache" to true in the conf, and 
new instances don't get cached.