If you're looking at automated file/record/event collection, take a
look at Apache Flume: http://incubator.apache.org/flume/. It does well
for distributed collections as well and is very configurable.
Otherwise, write a scheduled script to do the uploads every X period
(your choice). Consider usin
Hey Robert,
Any chance you can pastebin the JT logs, grepped for the bad job ID,
and send the link across? They shouldn't hang the way you describe.
On Fri, Jul 13, 2012 at 9:33 AM, Robert Dyer wrote:
> I'm using Hadoop 1.0.3 on a small cluster (1 namenode, 1 jobtracker, 2
> compute nodes). My
Hi,
Could you kindly provide the pros and cons of Multifile, combilefile,
sequencefile input format?
Thanks in Advance.
Cheers!
Manoj.
On Fri, Jul 13, 2012 at 10:15 AM, Bejoy KS wrote:
> **
> Hi Manoj
>
> If you are looking at a scheduler and a work flow manager to carry out
> this task you
Hi Manoj
If you are looking at a scheduler and a work flow manager to carry out this
task you can have a look at oozie.
If your xml files are smaller(smaller than hdfs block size) then definitely it
is a better practice to combine them to form larger files. Combining into
Sequence Files should
Hi Robert
It could be because there are no free slots available in your cluster during
job submission time to launch those tasks. Some other tasks may have already
occupied the map/reduce slots.
When you experience this random issue please verify whether there are free
task slots available.
I'm using Hadoop 1.0.3 on a small cluster (1 namenode, 1 jobtracker, 2
compute nodes). My input size is a sequence file of around 280mb.
Generally, my jobs run just fine and all finish in 2-5 minutes. However,
quite randomly the jobs refuse to run. They submit and appear when running
'hadoop jo
Hi Siv,
Moving this to mapreduce-user@, please use developer lists only for
project development questions.
The class StreamXmlRecordReader is still present in 2.0.0 and will
continue to be present. Why do you say it has been removed? It now
resides in the hadoop-tools jar (which is what carries s
This works. Thank you.
-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: Thursday, July 12, 2012 10:58 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Extra output files from mapper ?
Chuck,
Note that the regular file opens from within an MR program (be it strea
Team ,
i had wrote a mapreduce program . scenario of my program is to emit
.
Total no user : 825
Total no seqid:6583100
No of map which the program will emit is : 825 * 6583100
I have Hbase table called ObjectSequence : which consist of 6583100(rows)
i had use TableMapper and
You can ship the module along with a symlink and have Python
auto-import it since "." is always on PATH? I can imagine that helping
you get Pydoop on a cluster without Pydoop on all nodes (or other
libs).
On Thu, Jul 12, 2012 at 11:08 PM, Connell, Chuck
wrote:
> Thanks yet again. Since my goal is
Thanks yet again. Since my goal is to run an existing Python program, as is,
under MR, it looks like I need the os.system(copy-local-to-hdfs) technique.
Chuck
-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: Thursday, July 12, 2012 1:15 PM
To: mapreduce-user@hadoop.
Unfortunately Python does not recognize hdfs:// URIs. It isn't a
standard like HTTP is, so to say, at least not yet :)
You can instead use Pydoop's HDFS APIs though
http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#hdfs-api.
Pydoop authors are pretty active and do releases from time to tim
Ya Harsh its been posted 2 months back no response, if you google for
CombineFileInputFormat sure you will see it.
Mapredue user group rocks!
Thanks for the response.
On 12 Jul 2012 21:25, "Harsh J" wrote:
Thank you. I will try that.
A related question... Shouldn't I just be able to create HDFS files directly
from a Python open statement, when running within MR, like this? It does not
seem to work as intended.
outfile1 = open("hdfs://localhost/tmp/out1.txt", 'w')
Chuck
-Original Message-
Hey Manoj,
I find the asker name here quite strange, although it is the same
question, ha:
http://stackoverflow.com/questions/10380200/how-to-use-combinefileinputformat-in-hadoop
Anyhow, here's one example:
http://blog.yetitrails.com/2011/04/dealing-with-lots-of-small-files-in.html
On Thu, Jul
Gentles,
I want to use the CombineFileInputFormat of Hadoop 0.20.0 / 0.20.2 such
that it processes 1 file per record and also doesn't compromise on data -
locality (which it normally takes care of).
It is mentioned in Tom White's Hadoop Definitive Guide but he has not shown
how to do it. Instead,
Chuck,
Note that the regular file opens from within an MR program (be it
streaming or be it Java), will create files on the local file system
of the node the task executed on.
Hence, at the end of your script, move them to HDFS after closing them.
Something like:
os.system("hadoop fs -put outfi
Here is a test case...
The Python code (file_io.py) that I want to run as a map-only job is below. It
takes one input file (not stdin) and creates two output files (not stdout).
#!/usr/bin/env python
import sys
infile = open(sys.argv[1], 'r')
outfile1 = open(sys.argv[2], 'w')
outfile2 = open(
Background of my problem: I am running Nutch1.4 on Hadoop0.20.203. There
are series of MapReduce jobs that i am performing on Nutch segments to get
final output. But waiting for whole crawl to happen before running
mapreduce causes solution to run for longer time. I am now triggering
MapReduce jobs
19 matches
Mail list logo