Re: Are hadoop fs commands serial or parallel

2011-05-20 Thread Dieter Plaetinck
What do you mean clunky?
IMHO this is a quite elegant, simple, working solution.
Sure this spawns multiple processes, but it beats any
api-overcomplications, imho.

Dieter


On Wed, 18 May 2011 11:39:36 -0500
Patrick Angeles patr...@cloudera.com wrote:

 kinda clunky but you could do this via shell:
 
 for $FILE in $LIST_OF_FILES ; do
   hadoop fs -copyFromLocal $FILE $DEST_PATH 
 done
 
 If doing this via the Java API, then, yes you will have to use
 multiple threads.
 
 On Wed, May 18, 2011 at 1:04 AM, Mapred Learn
 mapred.le...@gmail.comwrote:
 
  Thanks harsh !
  That means basically both APIs as well as hadoop client commands
  allow only serial writes.
  I was wondering what could be other ways to write data in parallel
  to HDFS other than using multiple parallel threads.
 
  Thanks,
  JJ
 
  Sent from my iPhone
 
  On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote:
 
   Hello,
  
   Adding to Joey's response, copyFromLocal's current implementation
   is
  serial
   given a list of files.
  
   On Wed, May 18, 2011 at 9:57 AM, Mapred Learn
   mapred.le...@gmail.com wrote:
   Thanks Joey !
   I will try to find out abt copyFromLocal. Looks like Hadoop Apis
   write
   serially as you pointed out.
  
   Thanks,
   -JJ
  
   On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com
   wrote:
  
   The sequence file writer definitely does it serially as you can
   only ever write to the end of a file in Hadoop.
  
   Doing copyFromLocal could write multiple files in parallel (I'm
   not sure if it does or not), but a single file would be written
   serially.
  
   -Joey
  
   On Tue, May 17, 2011 at 5:44 PM, Mapred Learn
   mapred.le...@gmail.com
   wrote:
   Hi,
   My question is when I run a command from hdfs client, for eg.
   hadoop
  fs
   -copyFromLocal or create a sequence file writer in java code
   and
  append
   key/values to it through Hadoop APIs, does it internally
  transfer/write
   data
   to HDFS serially or in parallel ?
  
   Thanks in advance,
   -JJ
  
  
  
  
   --
   Joseph Echeverria
   Cloudera, Inc.
   443.305.9434
  
  
   --
   Harsh J
 



Re: outputCollector vs. Localfile

2011-05-20 Thread Harsh J
Mark,

On Fri, May 20, 2011 at 10:17 AM, Mark question markq2...@gmail.com wrote:
 This is puzzling me ...

  With a mapper producing output of size ~ 400 MB ... which one is supposed
 to be faster?

  1) output collector: which will write to local file then copy to HDFS since
 I don't have reducers.

A regular map-only job does not write to the local FS, it writes to
the HDFS directly (i.e., a local DN if one is found).

-- 
Harsh J


Why Only 1 Reducer is running ??

2011-05-20 Thread praveenesh kumar
Hello everyone,

I am using wordcount application to test on my hadoop cluster of 5 nodes.
The file size is around 5 GB.
Its taking around 2 min - 40 sec for execution.
But when I am checking the JobTracker web portal, I am seeing only one
reducer is running. Why so  ??
How can I change the code so that I will run multiple reducers also ??

Thanks,
Praveenesh


Re: Why Only 1 Reducer is running ??

2011-05-20 Thread James Seigel Tynt
The job could be designed to use one reducer

On 2011-05-20, at 7:19 AM, praveenesh kumar praveen...@gmail.com wrote:

 Hello everyone,
 
 I am using wordcount application to test on my hadoop cluster of 5 nodes.
 The file size is around 5 GB.
 Its taking around 2 min - 40 sec for execution.
 But when I am checking the JobTracker web portal, I am seeing only one
 reducer is running. Why so  ??
 How can I change the code so that I will run multiple reducers also ??
 
 Thanks,
 Praveenesh


Re: Why Only 1 Reducer is running ??

2011-05-20 Thread praveenesh kumar
I am using the wordcount example that comes along with hadoop.
How can I configure it to make it use multiple reducers.
I guess mutiple reducers will make it run more fast .. Does it ??


On Fri, May 20, 2011 at 6:51 PM, James Seigel Tynt ja...@tynt.com wrote:

 The job could be designed to use one reducer

 On 2011-05-20, at 7:19 AM, praveenesh kumar praveen...@gmail.com wrote:

  Hello everyone,
 
  I am using wordcount application to test on my hadoop cluster of 5 nodes.
  The file size is around 5 GB.
  Its taking around 2 min - 40 sec for execution.
  But when I am checking the JobTracker web portal, I am seeing only one
  reducer is running. Why so  ??
  How can I change the code so that I will run multiple reducers also ??
 
  Thanks,
  Praveenesh



RE: Why Only 1 Reducer is running ??

2011-05-20 Thread Evert Lammerts
Hi Praveenesh,

* You can set the maximum amount of reducers per node in your mapred-site.xml 
using mapred.tasktracker.reduce.tasks.maximum (default set to 2).
* You can set the default number of reduce tasks with mapred.reduce.tasks 
(default set to 1 - this causes your single reducer).
* Your job can try to override this setting by calling 
Job.setNumReduceTasks(INT) 
(http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setNumReduceTasks(int)).

Cheers,
Evert


 -Original Message-
 From: modemide [mailto:modem...@gmail.com]
 Sent: vrijdag 20 mei 2011 15:26
 To: common-user@hadoop.apache.org
 Subject: Re: Why Only 1 Reducer is running ??

 what does your mapred-site.xml file say?

 I've used wordcount and had close to 12 reduces running on a 6
 datanode cluster on a 3 GB file.


 I have a configuration in there which says:
 mapred.reduce.tasks = 12

 The reason I chose 12 was because it was recommended that I choose 2x
 number of tasktrackers.





 On 5/20/11, praveenesh kumar praveen...@gmail.com wrote:
  Hello everyone,
 
  I am using wordcount application to test on my hadoop cluster of 5
 nodes.
  The file size is around 5 GB.
  Its taking around 2 min - 40 sec for execution.
  But when I am checking the JobTracker web portal, I am seeing only
 one
  reducer is running. Why so  ??
  How can I change the code so that I will run multiple reducers also
 ??
 
  Thanks,
  Praveenesh
 


Re: Are hadoop fs commands serial or parallel

2011-05-20 Thread Brian Bockelman

On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote:

 What do you mean clunky?
 IMHO this is a quite elegant, simple, working solution.

Try giving it to a user; watch them feed it a list of 10,000 files; watch the 
machine swap to death and the disks uselessly thrash.

 Sure this spawns multiple processes, but it beats any
 api-overcomplications, imho.
 

Simple doesn't imply scalable, unfortunately.

Brian

 Dieter
 
 
 On Wed, 18 May 2011 11:39:36 -0500
 Patrick Angeles patr...@cloudera.com wrote:
 
 kinda clunky but you could do this via shell:
 
 for $FILE in $LIST_OF_FILES ; do
  hadoop fs -copyFromLocal $FILE $DEST_PATH 
 done
 
 If doing this via the Java API, then, yes you will have to use
 multiple threads.
 
 On Wed, May 18, 2011 at 1:04 AM, Mapred Learn
 mapred.le...@gmail.comwrote:
 
 Thanks harsh !
 That means basically both APIs as well as hadoop client commands
 allow only serial writes.
 I was wondering what could be other ways to write data in parallel
 to HDFS other than using multiple parallel threads.
 
 Thanks,
 JJ
 
 Sent from my iPhone
 
 On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote:
 
 Hello,
 
 Adding to Joey's response, copyFromLocal's current implementation
 is
 serial
 given a list of files.
 
 On Wed, May 18, 2011 at 9:57 AM, Mapred Learn
 mapred.le...@gmail.com wrote:
 Thanks Joey !
 I will try to find out abt copyFromLocal. Looks like Hadoop Apis
 write
 serially as you pointed out.
 
 Thanks,
 -JJ
 
 On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com
 wrote:
 
 The sequence file writer definitely does it serially as you can
 only ever write to the end of a file in Hadoop.
 
 Doing copyFromLocal could write multiple files in parallel (I'm
 not sure if it does or not), but a single file would be written
 serially.
 
 -Joey
 
 On Tue, May 17, 2011 at 5:44 PM, Mapred Learn
 mapred.le...@gmail.com
 wrote:
 Hi,
 My question is when I run a command from hdfs client, for eg.
 hadoop
 fs
 -copyFromLocal or create a sequence file writer in java code
 and
 append
 key/values to it through Hadoop APIs, does it internally
 transfer/write
 data
 to HDFS serially or in parallel ?
 
 Thanks in advance,
 -JJ
 
 
 
 
 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434
 
 
 --
 Harsh J
 



smime.p7s
Description: S/MIME cryptographic signature


Configuring jvm metrics in hadoop-0.20.203.0

2011-05-20 Thread Matyas Markovics
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,
I am trying to get jvm metrics from the new verison of hadoop.
I have read the migration instructions and come up with the following
content for hadoop-metrics2.properties:

*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
jvm.sink.file.period=2
jvm.sink.file.filename=/home/ec2-user/jvmmetrics.log

Any help would be appreciated even if you have a different approach to
get memory usage from reducers.

Thanks in advance.
- -- 
Best Regards,
Matyas Markovics
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3WkKoACgkQGp7rraycDA7lMQCbBbVqYEyOdwVAjfDHvGtr58BN
nYUAn39gGORQKwVzt+Mlz8gZZlYYdymF
=1GJs
-END PGP SIGNATURE-


Re: outputCollector vs. Localfile

2011-05-20 Thread Mark question
I thought it was, because of FileBytesWritten counter. Thanks for the
clarification.
Mark

On Fri, May 20, 2011 at 4:23 AM, Harsh J ha...@cloudera.com wrote:

 Mark,

 On Fri, May 20, 2011 at 10:17 AM, Mark question markq2...@gmail.com
 wrote:
  This is puzzling me ...
 
   With a mapper producing output of size ~ 400 MB ... which one is
 supposed
  to be faster?
 
   1) output collector: which will write to local file then copy to HDFS
 since
  I don't have reducers.

 A regular map-only job does not write to the local FS, it writes to
 the HDFS directly (i.e., a local DN if one is found).

 --
 Harsh J



What's the easiest way to count the number of Key, Value pairs in a directory?

2011-05-20 Thread W.P. McNeill
I've got a directory with a bunch of MapReduce data in it.  I want to know
how many Key, Value pairs it contains.  I could write a mapper-only
process that takes Writeable, Writeable pairs as input and updates a
counter, but it seems like this utility should already exist.  Does it, or
do I have to roll my own?

Bonus question, is there a way to count the number of Key, Value pairs
without deserializing the values?  This can be expensive for the data I'm
working with.


Can I number output results with a Counter?

2011-05-20 Thread Mark Kerzner
Hi, can I use a Counter to give each record in all reducers a consecutive
number? Currently I am using a single Reducer, but it is an anti-pattern.
But I need to assign consecutive numbers to all output records in all
reducers, and it does not matter how, as long as each gets its own number.

If it IS possible, then how are multiple processes accessing those counters
without creating race conditions.

Thank you,

Mark


Re: What's the easiest way to count the number of Key, Value pairs in a directory?

2011-05-20 Thread Joey Echeverria
What format is the input data in?

At first glance, I would run an identity mapper and use a
NullOutputFormat so you don't get any data written. The built in
counters already count the number of key, value pairs read in by the
mappers.

-Joey

On Fri, May 20, 2011 at 9:34 AM, W.P. McNeill bill...@gmail.com wrote:
 I've got a directory with a bunch of MapReduce data in it.  I want to know
 how many Key, Value pairs it contains.  I could write a mapper-only
 process that takes Writeable, Writeable pairs as input and updates a
 counter, but it seems like this utility should already exist.  Does it, or
 do I have to roll my own?

 Bonus question, is there a way to count the number of Key, Value pairs
 without deserializing the values?  This can be expensive for the data I'm
 working with.




-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: What's the easiest way to count the number of Key, Value pairs in a directory?

2011-05-20 Thread James Seigel
The cheapest way would be to check the counters as you write them in
the first place and keep a running score. :)

Sent from my mobile. Please excuse the typos.

On 2011-05-20, at 10:35 AM, W.P. McNeill bill...@gmail.com wrote:

 I've got a directory with a bunch of MapReduce data in it.  I want to know
 how many Key, Value pairs it contains.  I could write a mapper-only
 process that takes Writeable, Writeable pairs as input and updates a
 counter, but it seems like this utility should already exist.  Does it, or
 do I have to roll my own?

 Bonus question, is there a way to count the number of Key, Value pairs
 without deserializing the values?  This can be expensive for the data I'm
 working with.


Re: Can I number output results with a Counter?

2011-05-20 Thread Joey Echeverria
To make sure I understand you correctly, you need a globally unique
one up counter for each output record?

If you had an upper bound on the number of records a single reducer
could output and you can afford to have gaps, you could just use the
task id and multiply that by the max number of records and then one up
from there.

If that doesn't work for you, then you'll need to use some kind of
central service for allocating numbers which could become a
bottleneck.

-Joey

On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner markkerz...@gmail.com wrote:
 Hi, can I use a Counter to give each record in all reducers a consecutive
 number? Currently I am using a single Reducer, but it is an anti-pattern.
 But I need to assign consecutive numbers to all output records in all
 reducers, and it does not matter how, as long as each gets its own number.

 If it IS possible, then how are multiple processes accessing those counters
 without creating race conditions.

 Thank you,

 Mark




-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: Can I number output results with a Counter?

2011-05-20 Thread Mark Kerzner
Joey,

You understood me perfectly well. I see your first advice, but I am not
allowed to have gaps. A central service is something I may consider if
single reducer becomes a worse bottleneck than it.

But what are counters for? They seem to be exactly that.

Mark

On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria j...@cloudera.com wrote:

 To make sure I understand you correctly, you need a globally unique
 one up counter for each output record?

 If you had an upper bound on the number of records a single reducer
 could output and you can afford to have gaps, you could just use the
 task id and multiply that by the max number of records and then one up
 from there.

 If that doesn't work for you, then you'll need to use some kind of
 central service for allocating numbers which could become a
 bottleneck.

 -Joey

 On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner markkerz...@gmail.com
 wrote:
  Hi, can I use a Counter to give each record in all reducers a consecutive
  number? Currently I am using a single Reducer, but it is an anti-pattern.
  But I need to assign consecutive numbers to all output records in all
  reducers, and it does not matter how, as long as each gets its own
 number.
 
  If it IS possible, then how are multiple processes accessing those
 counters
  without creating race conditions.
 
  Thank you,
 
  Mark
 



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434



Re: What's the easiest way to count the number of Key, Value pairs in a directory?

2011-05-20 Thread W.P. McNeill
The keys are Text and the values are large custom data structures serialized
with Avro.

I also have counters for the job that generates these files that gives me
this information but sometimes...Well, it's a long story.  Suffice to say
that it's nice to have a post-hoc method too.  :-)

The identity mapper sounds like the way to go.


Re: Can I number output results with a Counter?

2011-05-20 Thread Joey Echeverria
Counters are a way to get status from your running job. They don't
increment a global state. They locally save increments and
periodically report those increments to the central counter. That
means that the final count will be correct, but you can't use them to
coordinate counts while your job is running.

-Joey

On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner markkerz...@gmail.com wrote:
 Joey,

 You understood me perfectly well. I see your first advice, but I am not
 allowed to have gaps. A central service is something I may consider if
 single reducer becomes a worse bottleneck than it.

 But what are counters for? They seem to be exactly that.

 Mark

 On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria j...@cloudera.com wrote:

 To make sure I understand you correctly, you need a globally unique
 one up counter for each output record?

 If you had an upper bound on the number of records a single reducer
 could output and you can afford to have gaps, you could just use the
 task id and multiply that by the max number of records and then one up
 from there.

 If that doesn't work for you, then you'll need to use some kind of
 central service for allocating numbers which could become a
 bottleneck.

 -Joey

 On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner markkerz...@gmail.com
 wrote:
  Hi, can I use a Counter to give each record in all reducers a consecutive
  number? Currently I am using a single Reducer, but it is an anti-pattern.
  But I need to assign consecutive numbers to all output records in all
  reducers, and it does not matter how, as long as each gets its own
 number.
 
  If it IS possible, then how are multiple processes accessing those
 counters
  without creating race conditions.
 
  Thank you,
 
  Mark
 



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434





-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: What's the easiest way to count the number of Key, Value pairs in a directory?

2011-05-20 Thread Joey Echeverria
Are you storing the data in sequence files?

-Joey

On Fri, May 20, 2011 at 10:33 AM, W.P. McNeill bill...@gmail.com wrote:
 The keys are Text and the values are large custom data structures serialized
 with Avro.

 I also have counters for the job that generates these files that gives me
 this information but sometimes...Well, it's a long story.  Suffice to say
 that it's nice to have a post-hoc method too.  :-)

 The identity mapper sounds like the way to go.




-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: What's the easiest way to count the number of Key, Value pairs in a directory?

2011-05-20 Thread W.P. McNeill
No.


Re: Can I number output results with a Counter?

2011-05-20 Thread Kai Voigt
Also, with speculative execution enabled, you might see a higher count as you 
expect while the same task is running multiple times in parallel. When a task 
gets killed because another instance was quicker, those counters will be 
removed from the global count though.

Kai

Am 20.05.2011 um 19:34 schrieb Joey Echeverria:

 Counters are a way to get status from your running job. They don't
 increment a global state. They locally save increments and
 periodically report those increments to the central counter. That
 means that the final count will be correct, but you can't use them to
 coordinate counts while your job is running.
 
 -Joey
 
 On Fri, May 20, 2011 at 10:17 AM, Mark Kerzner markkerz...@gmail.com wrote:
 Joey,
 
 You understood me perfectly well. I see your first advice, but I am not
 allowed to have gaps. A central service is something I may consider if
 single reducer becomes a worse bottleneck than it.
 
 But what are counters for? They seem to be exactly that.
 
 Mark
 
 On Fri, May 20, 2011 at 12:01 PM, Joey Echeverria j...@cloudera.com wrote:
 
 To make sure I understand you correctly, you need a globally unique
 one up counter for each output record?
 
 If you had an upper bound on the number of records a single reducer
 could output and you can afford to have gaps, you could just use the
 task id and multiply that by the max number of records and then one up
 from there.
 
 If that doesn't work for you, then you'll need to use some kind of
 central service for allocating numbers which could become a
 bottleneck.
 
 -Joey
 
 On Fri, May 20, 2011 at 9:55 AM, Mark Kerzner markkerz...@gmail.com
 wrote:
 Hi, can I use a Counter to give each record in all reducers a consecutive
 number? Currently I am using a single Reducer, but it is an anti-pattern.
 But I need to assign consecutive numbers to all output records in all
 reducers, and it does not matter how, as long as each gets its own
 number.
 
 If it IS possible, then how are multiple processes accessing those
 counters
 without creating race conditions.
 
 Thank you,
 
 Mark
 
 
 
 
 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434
 
 
 
 
 
 -- 
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434
 

-- 
Kai Voigt
k...@123.org






Problem: Unknown scheme hdfs. It should correspond to a JournalType enumeration value

2011-05-20 Thread Eduardo Dario Ricci
Hy People

I'm starting in hadoop commom.. and got some problem to try using a
cluster..

I'm following the steps of this page:
http://hadoop.apache.org/common/docs/r0.21.0/cluster_setup.html

I done everything, but when I will format the HDFS, this error happens:

I searched for something to help-me, but didn't find nothing.


 If some guy could help-me, I will be thankfull.



Re-format filesystem in /fontes/cluster/namedir ? (Y or N) Y
11/05/20 16:41:40 ERROR namenode.NameNode: java.io.IOException: Unknown
scheme hdfs. It should correspond to a JournalType enumeration value
at
org.apache.hadoop.hdfs.server.namenode.FSImage.checkSchemeConsistency(FSImage.java:269)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.setStorageDirectories(FSImage.java:222)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.init(FSImage.java:178)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1240)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1348)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)

11/05/20 16:41:40 INFO namenode.NameNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.217.134
/




-- 
 
 Eduardo Dario Ricci
   Cel: 14-81354813
 MSN: thenigma...@hotmail.com


Re: Problem: Unknown scheme hdfs. It should correspond to a JournalType enumeration value

2011-05-20 Thread Todd Lipcon
Hi Eduardo,

Sounds like you've configured your dfs.name.dirs to be on HDFS instead
of like file paths.

-Todd

On Fri, May 20, 2011 at 2:20 PM, Eduardo Dario Ricci duzas...@gmail.com wrote:
 Hy People

 I'm starting in hadoop commom.. and got some problem to try using a
 cluster..

 I'm following the steps of this page:
 http://hadoop.apache.org/common/docs/r0.21.0/cluster_setup.html

 I done everything, but when I will format the HDFS, this error happens:

 I searched for something to help-me, but didn't find nothing.


  If some guy could help-me, I will be thankfull.



 Re-format filesystem in /fontes/cluster/namedir ? (Y or N) Y
 11/05/20 16:41:40 ERROR namenode.NameNode: java.io.IOException: Unknown
 scheme hdfs. It should correspond to a JournalType enumeration value
        at
 org.apache.hadoop.hdfs.server.namenode.FSImage.checkSchemeConsistency(FSImage.java:269)
        at
 org.apache.hadoop.hdfs.server.namenode.FSImage.setStorageDirectories(FSImage.java:222)
        at
 org.apache.hadoop.hdfs.server.namenode.FSImage.init(FSImage.java:178)
        at
 org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1240)
        at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1348)
        at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)

 11/05/20 16:41:40 INFO namenode.NameNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.217.134
 /




 --
  
             Eduardo Dario Ricci
               Cel: 14-81354813
     MSN: thenigma...@hotmail.com




-- 
Todd Lipcon
Software Engineer, Cloudera


Using df instead of du to calculate datanode space

2011-05-20 Thread Joe Stein
I came up with a nice little hack to trick hadoop into calculating disk
usage with df instead of du

http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/

I am running this in production, works like a charm and already
seeing benefit, woot!

I hope it works well for others too.

/*
Joe Stein
http://www.twitter.com/allthingshadoop
*/


Re: Applications creates bigger output than input?

2011-05-20 Thread elton sky
Thanks Robert, Niels

Ye, I think text manipulation, especially ngram is a good application for
me.
Cheers

On Fri, May 20, 2011 at 12:57 AM, Robert Evans ev...@yahoo-inc.com wrote:

 I'm not sure if this has been mentioned or not but in Machine Learning with
 text based documents, the first stage is often a glorified word count
 action.  Except much of the time they will do N-Gram.  So

 Map Input:
 Hello this is a test

 Map Output:
 Hello
 This
 is
 a
 test
 Hello this
 this is
 is a
 a test
 ...


 You may also be extracting all kinds of other features form the text, but
 the tokenization/n-gram is not that CPU intensive.

 --Bobby Evans

 On 5/19/11 3:06 AM, elton sky eltonsky9...@gmail.com wrote:

 Hello,
 I pick up this topic again, because what I am looking for is something not
 CPU bound. Augmenting data for ETL and generating index are good examples.
 Neither of them requires too much cpu time on map side. The main bottle
 neck
 for them is shuffle and merge.

 Market basket analysis is cpu intensive in map phase, for sampling all
 possible combinations of items.

 I am still looking for more applications, which creates bigger output and
 not CPU bound.
 Any further idea? I appreciate.


 On Tue, May 3, 2011 at 3:10 AM, Steve Loughran ste...@apache.org wrote:

  On 30/04/2011 05:31, elton sky wrote:
 
  Thank you for suggestions:
 
  Weblog analysis, market basket analysis and generating search index.
 
  I guess for these applications we need more reduces than maps, for
  handling
  large intermediate output, isn't it. Besides, the input split for map
  should
  be smaller than usual,  because the workload for spill and merge on
 map's
  local disk is heavy.
 
 
  any form of rendering can generate very large images
 
  see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf