Files in hadoop.

2013-03-04 Thread Sai Sai
Just wondering after we put a file in hadoop for running MR jobs after we r done with it. Is it a standard to delete it or just leave it there like that. Just wondering what others do. Any input will be appreciated. Thanks Sai

Re: Files in hadoop.

2013-03-04 Thread Jagat Singh
Hi, Its what you want to do with data. Most of people keep data there and many others use and run various types of queries on it. Many generate temporary data and delete after running analysis. There is no unique answer for this , its what you need to do. Regards, Jagat Singh On Mon, Mar 4,

Re: Unknown processes unable to terminate

2013-03-04 Thread Sai Sai
I have a list of following processes given below, i am trying to kill the process 13082 using: kill 13082 Its not terminating RunJar. I have done a stop-all.sh hoping it would stop all the processes but only stopped the hadoop related processes. I am just wondering if it is necessary to stop

Re: Unknown processes unable to terminate

2013-03-04 Thread shashwat shriparv
You can you kill -9 13082 Is there eclipse or netbeans project running, that may the this process.. ∞ Shashwat Shriparv On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai saigr...@yahoo.in wrote: I have a list of following processes given below, i am trying to kill the process 13082 using: kill

Re: can someone help me how to disable the log info in terminal when type command bin/yarn node in YARN

2013-03-04 Thread Joshi, Rekha
Almost never silenced the logs on terminal, only tuned config for path/retention period on logs, so just top of mind, mostly –S/--silent for no logs, -V/--verbose for max logs possible works on executables, --help will confirm if it is possible. If it doesn't work, well it should :-) Thanks

Re: Unknown processes unable to terminate

2013-03-04 Thread Jean-Marc Spaggiari
Hi Sai, Are you fine to kill all those process on this machine? If you need ALL those process to be killed, and if they are all Java processes, you can use killall -9 java. That will kill ALL the java process under this user. JM 2013/3/4 shashwat shriparv dwivedishash...@gmail.com: You can you

Re: Accumulo and Mapreduce

2013-03-04 Thread Russell Jurney
http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try it. Russell Jurney http://datasyndrome.com On Mar 4, 2013, at 5:30 AM, Aji Janis aji1...@gmail.com

Re: mapper combiner and partitioner for particular dataset

2013-03-04 Thread Vikas Jadhav
Thank You for reply Can u please elaborate because i am not getting wht does following means in programming enviornment you will need a custom written high level partitioner and combiner that can create multiple instances of sub-partitioners/combiners and use the most likely one based on their

Best Practice: How to start and shutdown a complete cluster or adding nodes when needed (Automated with Java API or Rest) (On EC2)

2013-03-04 Thread Christian Schneider
Hi, what is they best way to realize this. On our current scenario we need the cluster only for some overnight processing. Therefore it would be good to shutdown the cluster overnight and store the results on s3. Could you suggest me some libraries or services for that? Like Whirr? Or is the

Re: Best Practice: How to start and shutdown a complete cluster or adding nodes when needed (Automated with Java API or Rest) (On EC2)

2013-03-04 Thread John Conwell
It depends on a couple factors. First are you developing a product where customers will need the freedom to choose what cloud provider to use, or something in house where you can standardize on one cloud provider (like AWS). And second, do you only need to spin up Hadoop resources? Or do you

Re: map reduce and sync

2013-03-04 Thread Lucas Bernardi
Ok, so I found a workaround for this issue, I share it here for others: So the key problem is that hadoop won't update the file size until the file is closed, then the FileInputFormat will see never-closed-files as empty files and generate no splits for the map reduce process. To fix this problem

Re: Accumulo and Mapreduce

2013-03-04 Thread Aji Janis
Russell thanks for the link. I am interested in finding a solution (if out there) where Mapper1 outputs a custom object and Mapper 2 can use that as input. One way to do this obviously by writing to Accumulo, in my case. But, is there another solution for this: ListMyObject Input to Job

Re: Accumulo and Mapreduce

2013-03-04 Thread Justin Woody
Aji, Why don't you just chain the jobs together? http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining Justin On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis aji1...@gmail.com wrote: Russell thanks for the link. I am interested in finding a solution (if out there) where Mapper1 outputs

Re: Accumulo and Mapreduce

2013-03-04 Thread Sandy Ryza
Hi Aji, Oozie is a mature project for managing MapReduce workflows. http://oozie.apache.org/ -Sandy On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody justin.wo...@gmail.com wrote: Aji, Why don't you just chain the jobs together? http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

RE: Unknown processes unable to terminate

2013-03-04 Thread Leo Leung
Hi Sai, The RunJar process is normally the result of someone or something running “hadoop jar something” (i.e: org.apache.hadoop.util.RunJar something) You probably want to find out who/what is running with a more detail info via ps –ef | grep RunJar stop|start-all.sh deals with

Oozie Meetup on March 12

2013-03-04 Thread Robert Kanter
Hi everyone, I apologize for cross-posting this Cloudera will be hosting an Oozie meetup on March 12 from 2:30pm to 5:00pm in our Palo Alto office. Please join us to meet fellow Oozie users and developers and have some free food. Interested users from other projects are welcome to join us too.

Re: Accumulo and Mapreduce

2013-03-04 Thread Russell Jurney
You can chain MR jobs with Oozie, but would suggest using Cascading, Pig or Hive. You can do this is a couple lines of code, I suspect. Two map reduce jobs should not pose any kind of challenge with the right tools. On Monday, March 4, 2013, Sandy Ryza wrote: Hi Aji, Oozie is a mature project

dfs.datanode.du.reserved

2013-03-04 Thread John Meza
the parameter: dfs.datanode.du.reserved is used to reserve disk space PER datanode. Is it possible to reserve a different amount of disk space per DISK? thanksJohn

Re: Accumulo and Mapreduce

2013-03-04 Thread Ted Dunning
Chaining the jobs is a fantastically inefficient solution. If you use Pig or Cascading, the optimizer will glue all of your map functions into a single mapper. The result is something like: (mapper1 - mapper2 - mapper3) = reducer Here the parentheses indicate that all of the map functions

Re: dfs.datanode.du.reserved

2013-03-04 Thread Ellis Miller
Possible to reserve 0 from various testing I have done yet that could cause the obvious side effect of achieving zero disk space:) Have only tested in development environment, however. Yet there are various tuning white papers and other benchmarks where the very same has been tested. Thanks. On

RE: dfs.datanode.du.reserved

2013-03-04 Thread John Meza
I'm probably not being clear.this seems to describe it: dfs.datanode.du.reserved configured per-volume.https://issues.apache.org/jira/browse/HDFS-1564 thanksJohn From: outlaw...@gmail.com Date: Mon, 4 Mar 2013 15:37:36 -0500 Subject: Re: dfs.datanode.du.reserved To: user@hadoop.apache.org

Re: Accumulo and Mapreduce

2013-03-04 Thread Aji Janis
I was considering based on earlier discussions using a JobController or ChainMapper to do this. But like a few of you mentioned Pig, Cascade or Oozie might be better. So what are the use cases for them? How do I decide which one works best for what? Thank you all for your feedback. On Mon, Mar

Re: Accumulo and Mapreduce

2013-03-04 Thread Nick Dimiduk
As Ted said, my first choice would be Cascading. Second choice would be ChainMapper. As you'll see in those search results [0], it's not available in the modern mapreduce API consistently across Hadoop releases. If you've already implemented this against the mapred API, go doe ChainReducer. If you

Best Practices: mapred.job.tracker.handler.count, dfs.namenode.handler.count

2013-03-04 Thread Alex Bohr
Hi, I'm looking for some feedback on how to decide how many threads to assign to the Namenode and Jobtracker? I currently have 24 data nodes (running CDH3) and am finding a lot varying advice on how to set these properties and change them as the cluster grows. Some (older) documentation (*

Hadoop file system

2013-03-04 Thread AMARNATH, Balachandar
Hi, I am new to hdfs. In my java application, I need to perform 'similar operation' over large number of files. I would like to store those files in distributed machines. I don't think, I will need map reduce paradigm. But however I would like to use HDFS for file storage and access. Is it

What is the difference between delegation token and Job token in Hadoop?

2013-03-04 Thread rohit sarewar
Delegation token and job token seems similar to me. I need to understand the exact difference between them ? What is the difference between delegation token and Job token in Hadoop ?

Re: Need help optimizing reducer

2013-03-04 Thread samir das mohapatra
Austin, I think you have to use partitioner to spawn more then one reducer for small data set. Default Partitioner will allow you only one reducer, you have to overwrite and implement you own logic to spawn more then one reducer. On Tue, Mar 5, 2013 at 1:27 AM, Austin Chungath

Re: Need help optimizing reducer

2013-03-04 Thread Ajay Srivastava
Are you using combiner ? If not, that will be first thing to do. On 05-Mar-2013, at 1:27 AM, Austin Chungath wrote: Hi all, I have 1 reducer and I have around 600 thousand unique keys coming to it. The total data is only around 30 mb. My logic doesn't allow me to have more than 1