Re: Need help optimizing reducer

2013-03-04 Thread Ajay Srivastava
Are you using combiner ? If not, that will be first thing to do. On 05-Mar-2013, at 1:27 AM, Austin Chungath wrote: > Hi all, > > I have 1 reducer and I have around 600 thousand unique keys coming to it. The > total data is only around 30 mb. > My logic doesn't allow me to have more than 1 red

Re: Need help optimizing reducer

2013-03-04 Thread samir das mohapatra
Austin, I think you have to use partitioner to spawn more then one reducer for small data set. Default Partitioner will allow you only one reducer, you have to overwrite and implement you own logic to spawn more then one reducer. On Tue, Mar 5, 2013 at 1:27 AM, Austin Chungath wrote: > H

What is the difference between delegation token and Job token in Hadoop?

2013-03-04 Thread rohit sarewar
Delegation token and job token seems similar to me. I need to understand the exact difference between them ? What is the difference between delegation token and Job token in Hadoop ?

Hadoop file system

2013-03-04 Thread AMARNATH, Balachandar
Hi, I am new to hdfs. In my java application, I need to perform 'similar operation' over large number of files. I would like to store those files in distributed machines. I don't think, I will need map reduce paradigm. But however I would like to use HDFS for file storage and access. Is it pos

Best Practices: mapred.job.tracker.handler.count, dfs.namenode.handler.count

2013-03-04 Thread Alex Bohr
Hi, I'm looking for some feedback on how to decide how many threads to assign to the Namenode and Jobtracker? I currently have 24 data nodes (running CDH3) and am finding a lot varying advice on how to set these properties and change them as the cluster grows. Some (older) documentation (* http:/

Re: Accumulo and Mapreduce

2013-03-04 Thread Nick Dimiduk
As Ted said, my first choice would be Cascading. Second choice would be ChainMapper. As you'll see in those search results [0], it's not available in the "modern" mapreduce API consistently across Hadoop releases. If you've already implemented this against the mapred API, go doe ChainReducer. If yo

Re: Accumulo and Mapreduce

2013-03-04 Thread Aji Janis
I was considering based on earlier discussions using a JobController or ChainMapper to do this. But like a few of you mentioned Pig, Cascade or Oozie might be better. So what are the use cases for them? How do I decide which one works best for what? Thank you all for your feedback. On Mon, Mar

RE: dfs.datanode.du.reserved

2013-03-04 Thread John Meza
I'm probably not being clear.this seems to describe it: dfs.datanode.du.reserved configured per-volume.https://issues.apache.org/jira/browse/HDFS-1564 thanksJohn From: outlaw...@gmail.com Date: Mon, 4 Mar 2013 15:37:36 -0500 Subject: Re: dfs.datanode.du.reserved To: user@hadoop.apache.org Poss

Re: dfs.datanode.du.reserved

2013-03-04 Thread Ellis Miller
Possible to reserve 0 from various testing I have done yet that could cause the obvious side effect of achieving zero disk space:) Have only tested in development environment, however. Yet there are various tuning white papers and other benchmarks where the very same has been tested. Thanks. On M

Re: Accumulo and Mapreduce

2013-03-04 Thread Ted Dunning
Chaining the jobs is a fantastically inefficient solution. If you use Pig or Cascading, the optimizer will glue all of your map functions into a single mapper. The result is something like: (mapper1 -> mapper2 -> mapper3) => reducer Here the parentheses indicate that all of the map function

dfs.datanode.du.reserved

2013-03-04 Thread John Meza
the parameter: dfs.datanode.du.reserved is used to reserve disk space PER datanode. Is it possible to reserve a different amount of disk space per DISK? thanksJohn

Re: Accumulo and Mapreduce

2013-03-04 Thread Russell Jurney
You can chain MR jobs with Oozie, but would suggest using Cascading, Pig or Hive. You can do this is a couple lines of code, I suspect. Two map reduce jobs should not pose any kind of challenge with the right tools. On Monday, March 4, 2013, Sandy Ryza wrote: > Hi Aji, > > Oozie is a mature proje

Oozie Meetup on March 12

2013-03-04 Thread Robert Kanter
Hi everyone, I apologize for cross-posting this Cloudera will be hosting an Oozie meetup on March 12 from 2:30pm to 5:00pm in our Palo Alto office. Please join us to meet fellow Oozie users and developers and have some free food. Interested users from other projects are welcome to join us too.

RE: Unknown processes unable to terminate

2013-03-04 Thread Leo Leung
Hi Sai, The RunJar process is normally the result of someone or something running “hadoop jar ” (i.e: org.apache.hadoop.util.RunJar ) You probably want to find out who/what is running with a more detail info via ps –ef | grep RunJar -all.sh deals with hdfs/ M/R specific process on

Re: Accumulo and Mapreduce

2013-03-04 Thread Sandy Ryza
Hi Aji, Oozie is a mature project for managing MapReduce workflows. http://oozie.apache.org/ -Sandy On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody wrote: > Aji, > > Why don't you just chain the jobs together? > http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining > > Justin > > On M

Re: Accumulo and Mapreduce

2013-03-04 Thread Justin Woody
Aji, Why don't you just chain the jobs together? http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining Justin On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis wrote: > Russell thanks for the link. > > I am interested in finding a solution (if out there) where Mapper1 outputs a > custom obj

Re: Accumulo and Mapreduce

2013-03-04 Thread Aji Janis
Russell thanks for the link. I am interested in finding a solution (if out there) where Mapper1 outputs a custom object and Mapper 2 can use that as input. One way to do this obviously by writing to Accumulo, in my case. But, is there another solution for this: List > Input to Job MyObject -

Re: map reduce and sync

2013-03-04 Thread Lucas Bernardi
Ok, so I found a workaround for this issue, I share it here for others: So the key problem is that hadoop won't update the file size until the file is closed, then the FileInputFormat will see never-closed-files as empty files and generate no splits for the map reduce process. To fix this problem

Re: Best Practice: How to start and shutdown a complete cluster or adding nodes when needed (Automated with Java API or Rest) (On EC2)

2013-03-04 Thread John Conwell
It depends on a couple factors. First are you developing a product where customers will need the freedom to choose what cloud provider to use, or something in house where you can standardize on one cloud provider (like AWS). And second, do you only need to spin up Hadoop resources? Or do you nee

Best Practice: How to start and shutdown a complete cluster or adding nodes when needed (Automated with Java API or Rest) (On EC2)

2013-03-04 Thread Christian Schneider
Hi, what is they best way to realize this. On our current scenario we need the cluster only for some overnight processing. Therefore it would be good to shutdown the cluster overnight and store the results on s3. Could you suggest me some libraries or services for that? Like Whirr? Or is the Amazo

Re: mapper combiner and partitioner for particular dataset

2013-03-04 Thread Vikas Jadhav
Thank You for reply Can u please elaborate because i am not getting wht does following means in programming enviornment you will need a custom written "high level" partitioner and combiner that can create multiple instances of sub-partitioners/combiners and use the most likely one based on their

Re: Accumulo and Mapreduce

2013-03-04 Thread Russell Jurney
http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try it. Russell Jurney http://datasyndrome.com On Mar 4, 2013, at 5:30 AM, Aji Janis wrote: Hello, I have

Re: Unknown processes unable to terminate

2013-03-04 Thread Jean-Marc Spaggiari
Hi Sai, Are you fine to kill all those process on this machine? If you need ALL those process to be killed, and if they are all Java processes, you can use killall -9 java. That will kill ALL the java process under this user. JM 2013/3/4 shashwat shriparv : > You can you kill -9 13082 > > Is the

Re: can someone help me how to disable the log info in terminal when type command "bin/yarn node" in YARN

2013-03-04 Thread Joshi, Rekha
Almost never silenced the logs on terminal, only tuned config for path/retention period on logs, so just top of mind, mostly –S/--silent for no logs, -V/--verbose for max logs possible works on executables, --help will confirm if it is possible. If it doesn't work, well it should :-) Thanks Re

Re: Unknown processes unable to terminate

2013-03-04 Thread shashwat shriparv
You can you kill -9 13082 Is there eclipse or netbeans project running, that may the this process.. ∞ Shashwat Shriparv On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai wrote: > I have a list of following processes given below, i am trying to kill the > process 13082 using: > > kill 13082 > > Its n

Re: Unknown processes unable to terminate

2013-03-04 Thread Sai Sai
I have a list of following processes given below, i am trying to kill the process 13082 using: kill 13082 Its not terminating RunJar. I have done a stop-all.sh hoping it would stop all the processes but only stopped the hadoop related processes. I am just wondering if it is necessary to stop a

Re: Files in hadoop.

2013-03-04 Thread Jagat Singh
Hi, Its what you want to do with data. Most of people keep data there and many others use and run various types of queries on it. Many generate temporary data and delete after running analysis. There is no unique answer for this , its what you need to do. Regards, Jagat Singh On Mon, Mar 4,

Files in hadoop.

2013-03-04 Thread Sai Sai
Just wondering after we put a file in hadoop for running MR jobs after we r done with it. Is it a standard to delete it or just leave it there like that. Just wondering what others do. Any input will be appreciated. Thanks Sai