function on xml string
Here's the code. If folks are interested, I can submit it as a patch as well.
Prasan Ary wrote:
Colin,Is it possible that you share some of the code with us? thx,
PrasanColin Evans <[EMAIL PROTECTED]> wrote:We ended up subclassi
John,
My meaning didn't come through.
If you encode binary data and treat it like any peice of text going through
hadoop's default input format, at some point your binary data might have a
piece that looks like 1010, in hex it might be 0A, and in ascii, might it
not be interpret
John,
That's an interesting approach, but isn't it possible that an equivalent \n
might get encoded in the binary data?
John Menzer <[EMAIL PROTECTED]> wrote:
so you mean you changed the hadoop streaming source code?
actually i am not really willing to change the source code if it's not
I have 100s of files in S3 bucket and I am trying to consolidate them to fewer
number of large files.
I was working with code for hadoop's 'distcp' , trying to see if it is at all
possible to consolidate files as they are being copied from S3 onto HDFS. So
far, I havent had any luck.
Has
startup time. If your jobs are I/O
bound, they will be able to read 100MB of data in a just a few seconds at
most. Startup time for a hadoop job is typically 10 seconds or more.
On 4/4/08 12:58 PM, "Prasan Ary" wrote:
> I have a question on how input files are split before they are giv
I have a question on how input files are split before they are given out to Map
functions.
Say I have an input directory containing 1000 files whose total size is 100
MB, and I have 10 machines in my cluster and I have configured 10
mapred.map.tasks in hadoop-site.xml.
1. With this conf
files from
S3 to HDFS on EC2 without having to iterate through each file?
[EMAIL PROTECTED] wrote:
It might be a bug. Could you try the following?
bin/hadoop fs -ls s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml
Nicholas
- Original Message
From: Prasan Ary
To: core-user@had
Anybody ? Any thoughts why this might be happening?
Here is what is happening directly from the ec2 screen. The ID and
Secret Key are the only things changed.
I'm running hadoop 15.3 from the public ami. I launched a 2 machine
cluster using the ec2 scripts in the src/contrib/ec2/bin .
Here is what is happening directly from the ec2 screen. The ID and Secret Key
are the only things changed.
I'm running hadoop 15.3 from the public ami. I launched a 2 machine cluster
using the ec2 scripts in the src/contrib/ec2/bin . . .
The file I try and copy is 9KB (I noticed previous d
That was a typo in my email. I do have s3:// in my command when it fails.
---
[EMAIL PROTECTED] wrote:
> bin/hadoop distcp s3//:@/fileone.txt /somefolder_on_hdfs/fileone.txt :
Fails - Input source doesnt exist.
Should "s3//..." be "s3://..."?
Nicholas
Hi,
I am running hadoop 0.15.3 on 2 EC2 instances from a public ami (
ami-381df851) . Our input files are on S3.
When I try to do a distcp for an Input file from S3 onto hdfs on EC2, the
copy fails with an error that the file does not exist. However, if I run
copyToLocal from S3 onto
I changed the configuration a little so that the MR jar file now runs on my
local hadoop cluster, but takes input files from S3.
I get the following output:
08/03/26 17:32:39 INFO mapred.FileInputFormat: Total input paths to process :
1
08/03/26 17:32:44 INFO mapred.JobClient: Running
to image on EC2 and accessed from there.
--
Owen O'Malley <[EMAIL PROTECTED]> wrote:
On Mar 25, 2008, at 1:07 PM, Prasan Ary wrote:
> I am running hadoop on EC2. I want to run a jar MR application
I am running hadoop on EC2. I want to run a jar MR application on EC2 such that
input and output files are on S3.
I configured hadoop-site.xml so that fs.default.name property points to my
s3 bucket with all required identifications (eg; s3://:@ ). I created an input directory in this buc
oing on.
also, setting up FoxyProxy on firefox lets you browse your whole
cluster if you setup a ssh tunnel (socks).
On Mar 20, 2008, at 10:15 AM, Prasan Ary wrote:
> Hi All,
> I have been trying to configure Hadoop on EC2 for large number of
> clusters ( 100 plus). It seems that I have to
Hi All,
I have been trying to configure Hadoop on EC2 for large number of clusters (
100 plus). It seems that I have to copy EC2 private key to all the machines in
the cluster so that they can have SSH connections.
For now it seems I have to run a script to copy the key file to each of the
E
I have two Map/Reduce jobs and both of them output a file each. Is there a way
I can name these output files different from the default names of "part-" ?
thanks.
__
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection
ally sounds like you have taken a bit of an odd turn somewhere in your
porting your algorithm to a parallel form.
On 3/12/08 9:24 AM, "Prasan Ary" wrote:
> I have a very large xml file as input and a couple of Map/Reduce functions.
> Input key/value pair to all of my map fun
I have a very large xml file as input and a couple of Map/Reduce functions.
Input key/value pair to all of my map functions is the same.
I was wondering if there is a way that I read the input xml file only once,
then create key/value pair (also once) and give these k/v pairs as input to my
Hi All,
I am running a Map/Reduce on a textfile.
Map takes as (key,value) input pair , and outputs
as (key,value) output pair.
Reduce takes as (key,value) input pair, and outputs
as (key,value) output pair.
I am getting a type mismatch error.
Any suggestion?
Hi All,
I am using eclipse to write a map/reduce java application that connects to
hadoop on remote cluster.
Is there a way I can display intermediate results of map ( or reduce) much
the same way as I would use System.out.println( variable_name) if I were
running any application on a single
erence for this.
Prasan Ary wrote:
> Hi All,
> I am writing a java implementation for my map/reduce function on hadoop.
> Input to this is a xml file, and the map function has to process a well
> formed xml records. So far I have been unable to split the xml file at xml
> record bou
Hi All,
I am writing a java implementation for my map/reduce function on hadoop.
Input to this is a xml file, and the map function has to process a well
formed xml records. So far I have been unable to split the xml file at xml
record boundary to feed into my map function.
Can anybody point
23 matches
Mail list logo