Just wondering after we put a file in hadoop for running MR jobs after we r
done with it.
Is it a standard to delete it or just leave it there like that.
Just wondering what others do.
Any input will be appreciated.
Thanks
Sai
Hi,
Its what you want to do with data.
Most of people keep data there and many others use and run various types of
queries on it.
Many generate temporary data and delete after running analysis.
There is no unique answer for this , its what you need to do.
Regards,
Jagat Singh
On Mon, Mar 4,
I have a list of following processes given below, i am trying to kill the
process 13082 using:
kill 13082
Its not terminating RunJar.
I have done a stop-all.sh hoping it would stop all the processes but only
stopped the hadoop related processes.
I am just wondering if it is necessary to stop
You can you kill -9 13082
Is there eclipse or netbeans project running, that may the this process..
∞
Shashwat Shriparv
On Mon, Mar 4, 2013 at 3:12 PM, Sai Sai saigr...@yahoo.in wrote:
I have a list of following processes given below, i am trying to kill the
process 13082 using:
kill
Almost never silenced the logs on terminal, only tuned config for
path/retention period on logs, so just top of mind, mostly –S/--silent for no
logs, -V/--verbose for max logs possible works on executables, --help will
confirm if it is possible.
If it doesn't work, well it should :-)
Thanks
Hi Sai,
Are you fine to kill all those process on this machine? If you need
ALL those process to be killed, and if they are all Java processes,
you can use killall -9 java. That will kill ALL the java process under
this user.
JM
2013/3/4 shashwat shriparv dwivedishash...@gmail.com:
You can you
http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
AccumuloStorage for Pig comes with Accumulo. Easiest way would be to try it.
Russell Jurney http://datasyndrome.com
On Mar 4, 2013, at 5:30 AM, Aji Janis aji1...@gmail.com
Thank You for reply
Can u please elaborate because i am not getting wht does following means in
programming enviornment
you will need a custom written high level partitioner and combiner that
can create multiple instances of sub-partitioners/combiners and use the
most likely one based on their
Hi,
what is they best way to realize this.
On our current scenario we need the cluster only for some overnight
processing.
Therefore it would be good to shutdown the cluster overnight and store the
results on s3.
Could you suggest me some libraries or services for that? Like Whirr?
Or is the
It depends on a couple factors. First are you developing a product where
customers will need the freedom to choose what cloud provider to use, or
something in house where you can standardize on one cloud provider (like
AWS). And second, do you only need to spin up Hadoop resources? Or do you
Ok, so I found a workaround for this issue, I share it here for others:
So the key problem is that hadoop won't update the file size until the file
is closed, then the FileInputFormat will see never-closed-files as empty
files and generate no splits for the map reduce process.
To fix this problem
Russell thanks for the link.
I am interested in finding a solution (if out there) where Mapper1 outputs
a custom object and Mapper 2 can use that as input. One way to do this
obviously by writing to Accumulo, in my case. But, is there another
solution for this:
ListMyObject Input to Job
Aji,
Why don't you just chain the jobs together?
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
Justin
On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis aji1...@gmail.com wrote:
Russell thanks for the link.
I am interested in finding a solution (if out there) where Mapper1 outputs
Hi Aji,
Oozie is a mature project for managing MapReduce workflows.
http://oozie.apache.org/
-Sandy
On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody justin.wo...@gmail.com wrote:
Aji,
Why don't you just chain the jobs together?
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
Hi Sai,
The RunJar process is normally the result of someone or something running
“hadoop jar something”
(i.e: org.apache.hadoop.util.RunJar something)
You probably want to find out who/what is running with a more detail info
via ps –ef | grep RunJar
stop|start-all.sh deals with
Hi everyone,
I apologize for cross-posting this
Cloudera will be hosting an Oozie meetup on March 12 from 2:30pm to
5:00pm in our Palo Alto office.
Please join us to meet fellow Oozie users and developers and have some
free food. Interested users from other projects are welcome to join us
too.
You can chain MR jobs with Oozie, but would suggest using Cascading, Pig or
Hive. You can do this is a couple lines of code, I suspect. Two map reduce
jobs should not pose any kind of challenge with the right tools.
On Monday, March 4, 2013, Sandy Ryza wrote:
Hi Aji,
Oozie is a mature project
the parameter: dfs.datanode.du.reserved is used to reserve disk space PER
datanode. Is it possible to reserve a different amount of disk space per DISK?
thanksJohn
Chaining the jobs is a fantastically inefficient solution. If you use Pig
or Cascading, the optimizer will glue all of your map functions into a
single mapper. The result is something like:
(mapper1 - mapper2 - mapper3) = reducer
Here the parentheses indicate that all of the map functions
Possible to reserve 0 from various testing I have done yet that could cause
the obvious side effect of achieving zero disk space:) Have only tested in
development environment, however. Yet there are various tuning white papers
and other benchmarks where the very same has been tested.
Thanks.
On
I'm probably not being clear.this seems to describe it:
dfs.datanode.du.reserved configured
per-volume.https://issues.apache.org/jira/browse/HDFS-1564
thanksJohn
From: outlaw...@gmail.com
Date: Mon, 4 Mar 2013 15:37:36 -0500
Subject: Re: dfs.datanode.du.reserved
To: user@hadoop.apache.org
I was considering based on earlier discussions using a JobController or
ChainMapper to do this. But like a few of you mentioned Pig, Cascade or
Oozie might be better. So what are the use cases for them? How do I decide
which one works best for what?
Thank you all for your feedback.
On Mon, Mar
As Ted said, my first choice would be Cascading. Second choice would be
ChainMapper. As you'll see in those search results [0], it's not available
in the modern mapreduce API consistently across Hadoop releases. If
you've already implemented this against the mapred API, go doe
ChainReducer. If you
Hi,
I'm looking for some feedback on how to decide how many threads to assign
to the Namenode and Jobtracker?
I currently have 24 data nodes (running CDH3) and am finding a lot varying
advice on how to set these properties and change them as the cluster grows.
Some (older) documentation (*
Hi,
I am new to hdfs. In my java application, I need to perform 'similar operation'
over large number of files. I would like to store those files in distributed
machines. I don't think, I will need map reduce paradigm. But however I would
like to use HDFS for file storage and access. Is it
Delegation token and job token seems similar to me. I need to understand
the exact difference between them ?
What is the difference between delegation token and Job token in Hadoop ?
Austin,
I think you have to use partitioner to spawn more then one reducer for
small data set.
Default Partitioner will allow you only one reducer, you have to
overwrite and implement you own logic to spawn more then one reducer.
On Tue, Mar 5, 2013 at 1:27 AM, Austin Chungath
Are you using combiner ? If not, that will be first thing to do.
On 05-Mar-2013, at 1:27 AM, Austin Chungath wrote:
Hi all,
I have 1 reducer and I have around 600 thousand unique keys coming to it. The
total data is only around 30 mb.
My logic doesn't allow me to have more than 1
28 matches
Mail list logo