One way of doing what you need is to extend MultipleTextOutputFormat and
override the following APIs
- generateFileNameForKeyValue()
- generateActualKey()
- generateActualValue()
You will need to prefix the directory and file-name of your choice to the
key/value depending upon your needs. Assum
Hi Tom,
i have seen the tar-to-seq tool but the person who made it says it is
very slow:
"It took about an hour and a half to convert a 615MB tar.bz2 file to an
868MB sequence file". To me it is not acceptable.
Normally to generate a tar file from 615MB od data it take s less then
one minute. A
Yes I am able to ping and ssh between two virtual machine and even i
have set ip address of both the virtual machines in their respective
/etc/hosts file ...
thanx for reply .. if you suggest some other thing which i could
have missed or any remedy
Regards,
Ashish
make sure u can ping that data node and ssh it.
On Thu, May 28, 2009 at 12:02 PM, ashish pareek wrote:
> HI ,
> I am trying to step up a hadoop cluster on 512 MB machine and using
> hadoop 0.18 and have followed procedure given in apache hadoop site for
> hadoop cluster.
> I included
Use the mapside join stuff, if I understand your problem it provides a good
solution but requires getting over the learning hurdle.
Well described in chapter 8 of my book :)
On Thu, May 28, 2009 at 8:29 AM, Chris K Wensel wrote:
> I believe PIG, and I know Cascading use a kind of 'spillable' li
Hi some help me out
On Thu, May 28, 2009 at 10:32 PM, ashish pareek wrote:
> HI ,
> I am trying to step up a hadoop cluster on 512 MB machine and using
> hadoop 0.18 and have followed procedure given in apache hadoop site for
> hadoop cluster.
> I included in conf/slaves
At the minimal level, enable map output compression, it may make some
difference, mapred.compress.map.output.
Sorting is very expensive when there are many keys and the values are large.
Are you quite certain your keys are unique.
Also, do you need them sorted by document id?
On Thu, May 28, 2009
Hi David,
If you go to JobTrackerHistory and then click on this job and then do
Analyse This Job, you should be able to get the split up timings for the
individual phases of the map and reduce tasks, including the average, best
and worst times. Could you provide those numbers so that we can get a
On Thu, May 28, 2009 at 9:50 AM, Owen O'Malley wrote:
>
> The update to the terasort example has an InputFormat that does exactly
> that. The key is 10 bytes and the value is the next 90 bytes. It is pretty
> easy to write, but I should upload it soon. The output types are Text, but
> they just h
I am trying to figure out the best way to split output into different
directories. My goal is to have a directory structure allowing me to add the
content from each batch into the right bucket, like this:
...
/content/200904/batch_20090429
/content/200904/batch_20090430
/content/200904/batch_20090
0.19 is considered unstable by us at Cloudera and by the Y! folks; they
never deployed it to their clusters. That said, we recommend 0.18.3 as the
most stable version of Hadoop right now. Y! has (or will soon) deploy(ed)
0.20, which implies that it's at least stable enough for them to give it a
g
Hi,
I am trying to understand the code of index package to build a distributed
Lucene index. I have some very basic questions and would really appreciate
if someone can help me understand this code-
1) If I already have Lucene index (divided into shards), should I upload
these indexes into HDFS a
Hadoop noob here, just starting to learn it, as we're planning to start
using it heavily in our processing. Just wondering, though, which version
of the code I should start learning/working with.
It looks like the Hadoop API changed pretty significantly from 0.19 to
0.20 (e.g., org.apache.hadoop.
Hi,
How do I convert DataInput to array of String?
How do I convert ResultSet to array of String?
Thanks. Following is the code:
static class Record implements Writable, DBWritable {
String [] aSAssoc;
public void write(DataOutput arg0) throws IOException {
throw new Unsuppo
Thanks Damien.
And can i update a file with hadoop or just create it and read it later?
Olivier
On Thu, May 28, 2009 at 1:31 PM, Damien Cooke wrote:
> Olivier,
> Append is not supported or recommended at this point. You can turn it on
> via dfs.support.append in hdfs-site.xml under 0.20.0. T
On Thu, May 28, 2009 at 6:02 AM, Steve Loughran wrote:
> That really depends on the work you are doing...the bytes in/out to CPU
> work, and the size of any memory structures that are built up over the run.
>
> With 1 core per physical disk, you get the bandwidth of a single disk per
> CPU; for s
On May 28, 2009, at 2:00 PM, Patrick Angeles wrote:
On Thu, May 28, 2009 at 10:24 AM, Brian Bockelman >wrote:
We do both -- push the disk image out to NFS and have a mirrored
SAS hard
drives on the namenode. The SAS drives appear to be overkill.
This sounds like a nice approach, takin
On Thu, May 28, 2009 at 10:24 AM, Brian Bockelman wrote:
>
> We do both -- push the disk image out to NFS and have a mirrored SAS hard
> drives on the namenode. The SAS drives appear to be overkill.
>
This sounds like a nice approach, taking into account hardware, labor and
downtime costs... $70
On Tue, May 26, 2009 at 7:50 PM, Malcolm Matalka <
mmata...@millennialmedia.com> wrote:
> I'm using EBS volumes to have a persistent HDFS on EC2. Do I need to keep
> the master updated on how to map the internal IPs, which change as I
> understand, to a known set of host names so it knows where t
HI ,
I am trying to step up a hadoop cluster on 512 MB machine and using
hadoop 0.18 and have followed procedure given in apache hadoop site for
hadoop cluster.
I included in conf/slaves two datanode i.e including the namenode
vitrual machine and other machine virtual machine . an
Hi everyone,
I'm processing XML files, around 500MB each with several documents,
for the map() function I pass a document from the XML file, which
takes some time to process depending on the size - I'm applying NER to
texts.
Each document has a unique identifier, so I'm using that identifier as
a
Olivier,
Append is not supported or recommended at this point. You can turn it
on via dfs.support.append in hdfs-site.xml under 0.20.0. There have
been some issues making it reliable. If this is not production code
or a production job then turning it on will probably have no
detrimental
On May 28, 2009, at 10:32 AM, Ian Soboroff wrote:
Brian Bockelman writes:
Despite my trying, I've never been able to come even close to pegging
the CPUs on our NN.
I'd recommend going for the fastest dual-cores which are affordable
--
latency is king.
Clue?
Surely the latencies in Had
Brian Bockelman writes:
> Despite my trying, I've never been able to come even close to pegging
> the CPUs on our NN.
>
> I'd recommend going for the fastest dual-cores which are affordable --
> latency is king.
Clue?
Surely the latencies in Hadoop that dominate are not cured with faster
proce
I believe PIG, and I know Cascading use a kind of 'spillable' list
that can be re-iterated across. PIG's version is a bit more
sophisticated last I looked.
that said, if you were using either one of them, you wouldn't need to
write your own many-to-many join.
cheers,
ckw
On May 28, 2009,
One last possible trick to consider:
If you were to subclass SequenceFileRecordReader, you'd have access to its
seek method, allowing you to rewind the reducer input. You could then
implement a block hash join with something like the following pseudocode:
ahash = new HashMap();
while (i have ram
Hi Stuart,
It seems to me like you have a few options.
Option 1: Just use a lot of RAM. Unless you really expect many millions of
entries on both sides of the join, you might be able to get away with
buffering despite its inefficiency.
Option 2: Use LocalDirAllocator to find some local storage t
did you restart hadoop? sorry i'm stuck in the middle of something so
can't give this more attention. i can assure you however that we have
append working in our POC ... and the code isn't that much different
to what you have posted.
-sd
On Thu, May 28, 2009 at 3:31 PM, Olivier Smadja wrote:
>
On May 28, 2009, at 5:15 AM, Stuart White wrote:
I need to process a dataset that contains text records of fixed length
in bytes. For example, each record may be 100 bytes in length
The update to the terasort example has an InputFormat that does
exactly that. The key is 10 bytes and the val
Thanks Sacha,
I have now my hdfs-site.xml like that : (as the hadoop-site.xml seems to be
deprecated)
dfs.support.append
true
But I continue receiving the exception.
Checking the hadoop source code, I saw
public FSDataOutputStream append(Path f, int bufferSize,
Hi Walter,
On Thu, May 28, 2009 at 6:52 AM, walter steffe wrote:
> Hello
> I am a new user and I would like to use hadoop streaming with
> SequenceFile in both input and output side.
>
> -The first difficoulty arises from the lack of a simple tool to generate
> a SequenceFile starting from a set
On May 28, 2009, at 5:02 AM, Steve Loughran wrote:
Patrick Angeles wrote:
Sorry for cross-posting, I realized I sent the following to the
hbase list
when it's really more a Hadoop question.
This is an interesting question. Obviously as an HP employee you
must assume that I'm biased when
http://www.mail-archive.com/core-user@hadoop.apache.org/msg10002.html
On Thu, May 28, 2009 at 3:03 PM, Olivier Smadja wrote:
> Hi Sacha!
>
> Thanks for the quick answer. Is there a simple way to search the mailing
> list? by text or by author.
>
> At http://mail-archives.apache.org/mod_mbox/hado
Hi Sacha!
Thanks for the quick answer. Is there a simple way to search the mailing
list? by text or by author.
At http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I only see a
browse per month...
Thanks,
Olivier
On Thu, May 28, 2009 at 10:57 AM, Sasha Dolgy wrote:
> append isn't su
append isn't supported without modifying the configuration file for
hadoop. check out the mailling list threads ... i've sent a post in
the past explaining how to enable it.
On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja wrote:
> Hello,
>
> I'm trying hadoop for the first time and I'm just tryi
Hi Stuart,
There isn't an InputFormat that comes with Hadoop to do this. Rather
than pre-processing the file, it would be better to implement your own
InputFormat. Subclass FileInputFormat and provide an implementation of
getRecordReader() that returns your implementation of RecordReader to
read f
Hello,
I'm trying hadoop for the first time and I'm just trying to create a file
and append some text in it with the following code:
import java.io.IOException;
import org.apache.hadoop.conf. Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
Thanks, Koji. This is the issue I am facing and I have been using
version 0.18.x.
-/Pankaj
Koji Noguchi wrote:
Maybe
https://issues.apache.org/jira/browse/HADOOP-3792 ?
Koji
-Original Message-
From: pankaj jairath [mailto:pjair...@yahoo-inc.com]
Sent: Thursday, May 28, 2009 4:49 AM
I need to process a dataset that contains text records of fixed length
in bytes. For example, each record may be 100 bytes in length, with
the first field being the first 10 bytes, the second field being the
second 10 bytes, etc... There are no newlines on the file. Field
values have been either
I need to do a reduce-side join of two datasets. It's a many-to-many
join; that is, each dataset can can multiple records with any given
key.
Every description of a reduce-side join I've seen involves
constructing your keys out of your mapper such that records from one
dataset will be presented t
Maybe
https://issues.apache.org/jira/browse/HADOOP-3792 ?
Koji
-Original Message-
From: pankaj jairath [mailto:pjair...@yahoo-inc.com]
Sent: Thursday, May 28, 2009 4:49 AM
To: core-user@hadoop.apache.org
Subject: Issue with usage of fs -test
Hello,
I am facing a strange issue, where i
Hello,
I am facing a strange issue, where in the /fs -test -e/ fails and /fs
-ls/ succeeds to list the file. Following is the grep of such a result :
bin]$ hadoop fs -ls /projects/myproject///.done
Found 1 items
-rw--- 3 user hdfs 0 2009-03-19 22:28
/projects/mypro
Patrick Angeles wrote:
Sorry for cross-posting, I realized I sent the following to the hbase list
when it's really more a Hadoop question.
This is an interesting question. Obviously as an HP employee you must
assume that I'm biased when I say HP DL160 servers are good value for
the workers,
If your reducer does not write anything, you could look at NullOutputFormat
as well.
Jothi
On 5/28/09 1:38 PM, "tim robertson" wrote:
> Yes you can do this.
>
> It is complaining because you are not declaring the output types in
> the method signature, but you will not use them anyway.
>
> S
Yes you can do this.
It is complaining because you are not declaring the output types in
the method signature, but you will not use them anyway.
So please try
private static class Reducer extends MapReduceBase implements
Reducer {
...
The output format will be a TextOutputFormat, but it will no
Hi,
I have maps that do most of the work, and they output the data into a
reducer, so basically key is a constant, and the reducer combines all the
input from maps into a file and it does "LOAD_DATA" the file into mysql db.
So, there won't be any output.collect ( ) in reducer function. But whe
Hi all,
I have a 50 node cluster and I am trying to write some logs of size 1GB
each into hdfs. I need to write them in temporal fashion say for every
15 mins worth of data, I am closing previously opened file and creating
a new file. The snippet of code is
if()
{
47 matches
Mail list logo