Re: Are SequenceFiles split? If so, how?

2009-04-20 Thread Jim Twensky
In addition to what Aaron mentioned, you can configure the minimum split
size in hadoop-site.xml to have smaller or larger input splits depending on
your application.

-Jim

On Mon, Apr 20, 2009 at 12:18 AM, Aaron Kimball aa...@cloudera.com wrote:

 Yes, there can be more than one InputSplit per SequenceFile. The file will
 be split more-or-less along 64 MB boundaries. (the actual edges of the
 splits will be adjusted to hit the next block of key-value pairs, so it
 might be a few kilobytes off.)

 The SequenceFileInputFormat regards mapred.map.tasks
 (conf.setNumMapTasks())
 as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
 is always 100% user-controlled.) If you need exact control over the number
 of map tasks, you'll need to subclass it and modify this behavior. That
 having been said -- are you sure you actually need to precisely control
 this
 value? Or is it enough to know how many splits were created?

 - Aaron

 On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman b.wag...@comcast.net
 wrote:

  Suppose a SequenceFile (containing keys and values that are
 BytesWritable)
  is used as input. Will it be divided into InputSplits?  If so, what's the
  criteria use for splitting?
 
  I'm interested in this because I need to control the number of map tasks
  used, which (if I understand it correctly), is equal to the number of
  InputSplits.
 
  thanks,
 
  bw
 



Re: getting DiskErrorException during map

2009-04-16 Thread Jim Twensky
Yes, here is how it looks:

property
namehadoop.tmp.dir/name
value/scratch/local/jim/hadoop-${user.name}/value
/property

so I don't know why it still writes to /tmp. As a temporary workaround, I
created a symbolic link from /tmp/hadoop-jim to /scratch/...
and it works fine now but if you think this might be a considered as a bug,
I can report it.

Thanks,
Jim


On Thu, Apr 16, 2009 at 12:44 PM, Alex Loddengaard a...@cloudera.comwrote:

 Have you set hadoop.tmp.dir away from /tmp as well?  If hadoop.tmp.dir is
 set somewhere in /scratch vs. /tmp, then I'm not sure why Hadoop would be
 writing to /tmp.

 Hope this helps!

 Alex

 On Wed, Apr 15, 2009 at 2:37 PM, Jim Twensky jim.twen...@gmail.com
 wrote:

  Alex,
 
  Yes, I bounced the Hadoop daemons after I changed the configuration
 files.
 
  I also tried setting  $HADOOP_CONF_DIR to the directory where my
  hadop-site.xml file resides but it didn't work.
  However, I'm sure that HADOOP_CONF_DIR is not the issue because other
  properties that I changed in hadoop-site.xml
  seem to be properly set. Also, here is a section from my hadoop-site.xml
  file:
 
 property
 namehadoop.tmp.dir/name
  value/scratch/local/jim/hadoop-${user.name}/value
  /property
 property
 namemapred.local.dir/name
  value/scratch/local/jim/hadoop-${user.name
 }/mapred/local/value
 /property
 
  I also created /scratch/local/jim/hadoop-jim/mapred/local on each task
  tracker since I know
  directories that do not exist are ignored.
 
  When I manually ssh to the task trackers, I can see the directory
  /scratch/local/jim/hadoop-jim/dfs
  is automatically created so is it seems like  hadoop.tmp.dir is set
  properly. However, hadoop still creates
  /tmp/hadoop-jim/mapred/local and uses that directory for the local
 storage.
 
  I'm starting to suspect that mapred.local.dir is overwritten to a default
  value of /tmp/hadoop-${user.name}
  somewhere inside the binaries.
 
  -jim
 
  On Tue, Apr 14, 2009 at 4:07 PM, Alex Loddengaard a...@cloudera.com
  wrote:
 
   First, did you bounce the Hadoop daemons after you changed the
   configuration
   files?  I think you'll have to do this.
  
   Second, I believe 0.19.1 has hadoop-default.xml baked into the jar.
  Try
   setting $HADOOP_CONF_DIR to the directory where hadoop-site.xml lives.
   For
   whatever reason your hadoop-site.xml (and the hadoop-default.xml you
  tried
   to change) are probably not being loaded.  $HADOOP_CONF_DIR should fix
   this.
  
   Good luck!
  
   Alex
  
   On Mon, Apr 13, 2009 at 11:25 AM, Jim Twensky jim.twen...@gmail.com
   wrote:
  
Thank you Alex, you are right. There are quotas on the systems that
 I'm
working. However, I tried to change mapred.local.dir as follows:
   
--inside hadoop-site.xml:
   
   property
   namemapred.child.tmp/name
   value/scratch/local/jim/value
   /property
   property
   namehadoop.tmp.dir/name
   value/scratch/local/jim/value
   /property
   property
   namemapred.local.dir/name
   value/scratch/local/jim/value
   /property
   
 and observed that the intermediate map outputs are still being
 written
under /tmp/hadoop-jim/mapred/local
   
I'm confused at this point since I also tried setting these values
   directly
inside the hadoop-default.xml and that didn't work either. Is there
 any
other property that I'm supposed to change? I tried searching for
  /tmp
   in
the hadoop-default.xml file but couldn't find anything else.
   
Thanks,
Jim
   
   
On Tue, Apr 7, 2009 at 9:35 PM, Alex Loddengaard a...@cloudera.com
wrote:
   
 The getLocalPathForWrite function that throws this Exception
 assumes
   that
 you have space on the disks that mapred.local.dir is configured on.
Can
 you
 verify with `df` that those disks have space available?  You might
  also
try
 moving mapred.local.dir off of /tmp if it's configured to use /tmp
   right
 now; I believe some systems have quotas on /tmp.

 Hope this helps.

 Alex

 On Tue, Apr 7, 2009 at 7:22 PM, Jim Twensky jim.twen...@gmail.com
 
wrote:

  Hi,
 
  I'm using Hadoop 0.19.1 and I have a very small test cluster with
 9
 nodes,
  8
  of them being task trackers. I'm getting the following error and
 my
jobs
  keep failing when map processes start hitting 30%:
 
  org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not
  find
any
  valid local directory for
 
 

   
  
 
 taskTracker/jobcache/job_200904072051_0001/attempt_200904072051_0001_m_00_1/output/file.out
 at
 
 

   
  
 
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335

Re: Hadoop basic question

2009-04-16 Thread Jim Twensky
http://wiki.apache.org/hadoop/FAQ#7

On Thu, Apr 16, 2009 at 6:52 PM, Jae Joo jaejo...@gmail.com wrote:

 Will anyone guide me how to avoid the  the single point failure of master
 node.
 This is what I know. If the master node is donw by some reason, the hadoop
 system is down and there is no way to have failover system for master node.
 Please correct me if I am not understanding correctly.

 Jae



Re: getting DiskErrorException during map

2009-04-15 Thread Jim Twensky
Alex,

Yes, I bounced the Hadoop daemons after I changed the configuration files.

I also tried setting  $HADOOP_CONF_DIR to the directory where my
hadop-site.xml file resides but it didn't work.
However, I'm sure that HADOOP_CONF_DIR is not the issue because other
properties that I changed in hadoop-site.xml
seem to be properly set. Also, here is a section from my hadoop-site.xml
file:

property
namehadoop.tmp.dir/name
value/scratch/local/jim/hadoop-${user.name}/value
/property
property
namemapred.local.dir/name
value/scratch/local/jim/hadoop-${user.name}/mapred/local/value
/property

I also created /scratch/local/jim/hadoop-jim/mapred/local on each task
tracker since I know
directories that do not exist are ignored.

When I manually ssh to the task trackers, I can see the directory
/scratch/local/jim/hadoop-jim/dfs
is automatically created so is it seems like  hadoop.tmp.dir is set
properly. However, hadoop still creates
/tmp/hadoop-jim/mapred/local and uses that directory for the local storage.

I'm starting to suspect that mapred.local.dir is overwritten to a default
value of /tmp/hadoop-${user.name}
somewhere inside the binaries.

-jim

On Tue, Apr 14, 2009 at 4:07 PM, Alex Loddengaard a...@cloudera.com wrote:

 First, did you bounce the Hadoop daemons after you changed the
 configuration
 files?  I think you'll have to do this.

 Second, I believe 0.19.1 has hadoop-default.xml baked into the jar.  Try
 setting $HADOOP_CONF_DIR to the directory where hadoop-site.xml lives.  For
 whatever reason your hadoop-site.xml (and the hadoop-default.xml you tried
 to change) are probably not being loaded.  $HADOOP_CONF_DIR should fix
 this.

 Good luck!

 Alex

 On Mon, Apr 13, 2009 at 11:25 AM, Jim Twensky jim.twen...@gmail.com
 wrote:

  Thank you Alex, you are right. There are quotas on the systems that I'm
  working. However, I tried to change mapred.local.dir as follows:
 
  --inside hadoop-site.xml:
 
 property
 namemapred.child.tmp/name
 value/scratch/local/jim/value
 /property
 property
 namehadoop.tmp.dir/name
 value/scratch/local/jim/value
 /property
 property
 namemapred.local.dir/name
 value/scratch/local/jim/value
 /property
 
   and observed that the intermediate map outputs are still being written
  under /tmp/hadoop-jim/mapred/local
 
  I'm confused at this point since I also tried setting these values
 directly
  inside the hadoop-default.xml and that didn't work either. Is there any
  other property that I'm supposed to change? I tried searching for /tmp
 in
  the hadoop-default.xml file but couldn't find anything else.
 
  Thanks,
  Jim
 
 
  On Tue, Apr 7, 2009 at 9:35 PM, Alex Loddengaard a...@cloudera.com
  wrote:
 
   The getLocalPathForWrite function that throws this Exception assumes
 that
   you have space on the disks that mapred.local.dir is configured on.
  Can
   you
   verify with `df` that those disks have space available?  You might also
  try
   moving mapred.local.dir off of /tmp if it's configured to use /tmp
 right
   now; I believe some systems have quotas on /tmp.
  
   Hope this helps.
  
   Alex
  
   On Tue, Apr 7, 2009 at 7:22 PM, Jim Twensky jim.twen...@gmail.com
  wrote:
  
Hi,
   
I'm using Hadoop 0.19.1 and I have a very small test cluster with 9
   nodes,
8
of them being task trackers. I'm getting the following error and my
  jobs
keep failing when map processes start hitting 30%:
   
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
  any
valid local directory for
   
   
  
 
 taskTracker/jobcache/job_200904072051_0001/attempt_200904072051_0001_m_00_1/output/file.out
   at
   
   
  
 
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
   at
   
   
  
 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
   at
   
   
  
 
 org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
   at
   
   
  
 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209)
   at
   
  org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
   at org.apache.hadoop.mapred.Child.main(Child.java:158)
   
   
I googled many blogs and web pages but I could neither understand why
   this
happens nor found a solution to this. What does that error message
 mean
   and
how can avoid it, any suggestions?
   
Thanks in advance,
-jim
   
  
 



Re: Total number of records processed in mapper

2009-04-14 Thread Jim Twensky
Hi Andy,

Take a look at this piece of code:

Counters counters = job.getCounters();
counters.findCounter(org.apache.hadoop.mapred.Task$Counter,
REDUCE_INPUT_RECORDS).getCounter()

This is for reduce input records but I believe there is also a counter for
reduce output records. You should dig into the source code to find out what
it is because unfortunately, the default counters associated with the
map/reduce jobs are not public yet.

-Jim


On Tue, Apr 14, 2009 at 11:19 AM, Andy Liu andyliu1...@gmail.com wrote:

 Is there a way for all the reducers to have access to the total number of
 records that were processed in the Map phase?

 For example, I'm trying to perform a simple document frequency calculation.
 During the map phase, I emit word, 1 pairs for every unique word in every
 document.  During the reduce phase, I sum the values for each word group.
 Then I want to divide that value by the total number of documents.

 I suppose I can create a whole separate m/r job whose sole purpose is to
 count all the records, then pass that number on.  Is there a more
 straighforward way of doing this?

 Andy



Re: Map-Reduce Slow Down

2009-04-13 Thread Jim Twensky
Mithila,

You said all the slaves were being utilized in the 3 node cluster. Which
application did you run to test that and what was your input size? If you
tried the word count application on a 516 MB input file on both cluster
setups, than some of your nodes in the 15 node cluster may not be running at
all. Generally, one map job is assigned to each input split and if you are
running your cluster with the defaults, the splits are 64 MB each. I got
confused when you said the Namenode seemed to do all the work. Can you check
conf/slaves and make sure you put the names of all task trackers there? I
also suggest comparing both clusters with a larger input size, say at least
5 GB, to really see a difference.

Jim

On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball aa...@cloudera.com wrote:

 in hadoop-*-examples.jar, use randomwriter to generate the data and
 sort
 to sort it.
 - Aaron

 On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi forpan...@gmail.com wrote:

  Your data is too small I guess for 15 clusters ..So it might be overhead
  time of these clusters making your total MR jobs more time consuming.
  I guess you will have to try with larger set of data..
 
  Pankil
  On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra mnage...@asu.edu
  wrote:
 
   Aaron
  
   That could be the issue, my data is just 516MB - wouldn't this see a
 bit
  of
   speed up?
   Could you guide me to the example? I ll run my cluster on it and see
 what
  I
   get. Also for my program I had a java timer running to record the time
   taken
   to complete execution. Does Hadoop have an inbuilt timer?
  
   Mithila
  
   On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball aa...@cloudera.com
  wrote:
  
Virtually none of the examples that ship with Hadoop are designed to
showcase its speed. Hadoop's speedup comes from its ability to
 process
   very
large volumes of data (starting around, say, tens of GB per job, and
   going
up in orders of magnitude from there). So if you are timing the pi
calculator (or something like that), its results won't necessarily be
   very
consistent. If a job doesn't have enough fragments of data to
 allocate
   one
per each node, some of the nodes will also just go unused.
   
The best example for you to run is to use randomwriter to fill up
 your
cluster with several GB of random data and then run the sort program.
  If
that doesn't scale up performance from 3 nodes to 15, then you've
definitely
got something strange going on.
   
- Aaron
   
   
On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra mnage...@asu.edu
wrote:
   
 Hey all
 I recently setup a three node hadoop cluster and ran an examples on
  it.
It
 was pretty fast, and all the three nodes were being used (I checked
  the
log
 files to make sure that the slaves are utilized).

 Now I ve setup another cluster consisting of 15 nodes. I ran the
 same
 example, but instead of speeding up, the map-reduce task seems to
  take
 forever! The slaves are not being used for some reason. This second
cluster
 has a lower, per node processing power, but should that make any
 difference?
 How can I ensure that the data is being mapped to all the nodes?
Presently,
 the only node that seems to be doing all the work is the Master
 node.

 Does 15 nodes in a cluster increase the network cost? What can I do
  to
 setup
 the cluster to function more efficiently?

 Thanks!
 Mithila Nagendra
 Arizona State University

   
  
 



Re: Grouping Values for Reducer Input

2009-04-13 Thread Jim Twensky
I'm not sure if this is exactly what you want but, can you emit map records
as:

cat, doc5 - 3
cat, doc1 - 1
cat, doc5 - 1
and so on..

This way, your reducers will get the intermediate key,value pairs as

cat, doc5 - 3
cat, doc5 - 1
cat, doc1 - 1

then you can split the keys (cat, doc*) inside the reducer and perform your
additions.

-Jim

On Mon, Apr 13, 2009 at 4:53 PM, Streckfus, William [USA] 
streckfus_will...@bah.com wrote:

  Hi Everyone,

 I'm working on a relatively simple MapReduce job with a slight complication
 with regards to the ordering of my key/values heading into the reducer. The
 output from the mapper might be something like

 cat - doc5, 1
 cat - doc1, 1
 cat - doc5, 3
 ...

 Here, 'cat' is my key and the value is the document ID and the count (my
 own WritableComparable.) Originally I was going to create a HashMap in the
 reduce method and add an entry for each document ID and sum the counts for
 each. I realized the method would be better if the values were in order like
 so:

  cat - doc1, 1
 cat - doc5, 1
 cat - doc5, 3
 ...

 Using this style I can continue summing until I reach a new document ID and
 just collect the output at this point thus avoiding data structures and
 object creation costs. I tried setting
 JobConf.setOutputValueGroupingComparator() but this didn't seem to do
 anything. In fact, I threw an exception from the Comparator I supplied but
 this never showed up when running the job. My map output value consists of a
 UTF and a Long so perhaps the Comparator I'm using (identical to
 Text.Comparator) is incorrect:

 *public* *int* compare(*byte*[] b1, *int* s1, *int* l1, *byte*[] b2, *int*s2,
 *int* l2) {
 *int* n1 = WritableUtils.*decodeVIntSize*(b1[s1]);
 *int* n2 = WritableUtils.*decodeVIntSize*(b2[s2]);

 *return* *compareBytes*(b1, s1 + n1, l1 - n1, b2, s2 + n2, l2 - n2);
 }
 In my final output I'm basically running into the same word - documentID
 being output multiple times. So for the above example I have multiple lines
 with cat - doc5, X.

 Reducer method just in case:

 *public* *void* reduce(Text key, IteratorTermFrequencyWritable values,
 OutputCollectorText, TermFrequencyWritable output, Reporter reporter) *
 throws* IOException {
 *long* sum = 0;
 String lastDocID = *null*;

 // Iterate through all values
 *while*(values.hasNext()) {
 TermFrequencyWritable value = values.next();

 // Encountered new document ID = record and reset
 *if*(!value.getDocumentID().equals(lastDocID)) {
 // Ignore first go through
 *if*(sum != 0) {
 termFrequency.setDocumentID(lastDocID);
 termFrequency.setFrequency(sum);
 output.collect(key, termFrequency);
 }

 sum = 0;
 lastDocID = value.getDocumentID();
 }

 sum += value.getFrequency();
 }

 // Record last one
 termFrequency.setDocumentID(lastDocID);
 termFrequency.setFrequency(sum);
 output.collect(key, termFrequency);
 }

 Any ideas (Using Hadoop .19.1)?

 Thanks,
 - Bill



Re: Grouping Values for Reducer Input

2009-04-13 Thread Jim Twensky
Oh, I forgot to tell that you should change your partitioner to send all the
keys in the form of cat,* to the same reducer but it seems like Jeremy has
been much faster than me :)

-Jim

On Mon, Apr 13, 2009 at 5:24 PM, Jim Twensky jim.twen...@gmail.com wrote:

 I'm not sure if this is exactly what you want but, can you emit map records
 as:

 cat, doc5 - 3
 cat, doc1 - 1
 cat, doc5 - 1
 and so on..

 This way, your reducers will get the intermediate key,value pairs as

 cat, doc5 - 3
 cat, doc5 - 1
 cat, doc1 - 1

 then you can split the keys (cat, doc*) inside the reducer and perform your
 additions.

 -Jim


 On Mon, Apr 13, 2009 at 4:53 PM, Streckfus, William [USA] 
 streckfus_will...@bah.com wrote:

  Hi Everyone,

 I'm working on a relatively simple MapReduce job with a slight
 complication with regards to the ordering of my key/values heading into the
 reducer. The output from the mapper might be something like

 cat - doc5, 1
 cat - doc1, 1
 cat - doc5, 3
 ...

 Here, 'cat' is my key and the value is the document ID and the count (my
 own WritableComparable.) Originally I was going to create a HashMap in the
 reduce method and add an entry for each document ID and sum the counts for
 each. I realized the method would be better if the values were in order like
 so:

  cat - doc1, 1
 cat - doc5, 1
 cat - doc5, 3
 ...

 Using this style I can continue summing until I reach a new document ID
 and just collect the output at this point thus avoiding data structures and
 object creation costs. I tried setting
 JobConf.setOutputValueGroupingComparator() but this didn't seem to do
 anything. In fact, I threw an exception from the Comparator I supplied but
 this never showed up when running the job. My map output value consists of a
 UTF and a Long so perhaps the Comparator I'm using (identical to
 Text.Comparator) is incorrect:

 *public* *int* compare(*byte*[] b1, *int* s1, *int* l1, *byte*[] b2, *int
 * s2, *int* l2) {
 *int* n1 = WritableUtils.*decodeVIntSize*(b1[s1]);
 *int* n2 = WritableUtils.*decodeVIntSize*(b2[s2]);

 *return* *compareBytes*(b1, s1 + n1, l1 - n1, b2, s2 + n2, l2 - n2);
 }
 In my final output I'm basically running into the same word - documentID
 being output multiple times. So for the above example I have multiple lines
 with cat - doc5, X.

 Reducer method just in case:

 *public* *void* reduce(Text key, IteratorTermFrequencyWritable values,
 OutputCollectorText, TermFrequencyWritable output, Reporter reporter) *
 throws* IOException {
 *long* sum = 0;
 String lastDocID = *null*;

 // Iterate through all values
 *while*(values.hasNext()) {
 TermFrequencyWritable value = values.next();

 // Encountered new document ID = record and reset
 *if*(!value.getDocumentID().equals(lastDocID)) {
 // Ignore first go through
 *if*(sum != 0) {
 termFrequency.setDocumentID(lastDocID);
 termFrequency.setFrequency(sum);
 output.collect(key, termFrequency);
 }

 sum = 0;
 lastDocID = value.getDocumentID();
 }

 sum += value.getFrequency();
 }

 // Record last one
 termFrequency.setDocumentID(lastDocID);
 termFrequency.setFrequency(sum);
 output.collect(key, termFrequency);
 }

 Any ideas (Using Hadoop .19.1)?

 Thanks,
 - Bill





Re: Map-Reduce Slow Down

2009-04-13 Thread Jim Twensky
Can you ssh between the nodes?

-jim

On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra mnage...@asu.edu wrote:

 Thanks Aaron.
 Jim: The three clusters I setup had ubuntu running on them and the dfs was
 accessed at port 54310. The new cluster which I ve setup has Red Hat Linux
 release 7.2 (Enigma)running on it. Now when I try to access the dfs from
 one
 of the slaves i get the following response: dfs cannot be accessed. When I
 access the DFS throught the master there s no problem. So I feel there a
 problem with the port. Any ideas? I did check the list of slaves, it looks
 fine to me.

 Mithila




 On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky jim.twen...@gmail.com
 wrote:

  Mithila,
 
  You said all the slaves were being utilized in the 3 node cluster. Which
  application did you run to test that and what was your input size? If you
  tried the word count application on a 516 MB input file on both cluster
  setups, than some of your nodes in the 15 node cluster may not be running
  at
  all. Generally, one map job is assigned to each input split and if you
 are
  running your cluster with the defaults, the splits are 64 MB each. I got
  confused when you said the Namenode seemed to do all the work. Can you
  check
  conf/slaves and make sure you put the names of all task trackers there? I
  also suggest comparing both clusters with a larger input size, say at
 least
  5 GB, to really see a difference.
 
  Jim
 
  On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball aa...@cloudera.com
 wrote:
 
   in hadoop-*-examples.jar, use randomwriter to generate the data and
   sort
   to sort it.
   - Aaron
  
   On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi forpan...@gmail.com
  wrote:
  
Your data is too small I guess for 15 clusters ..So it might be
  overhead
time of these clusters making your total MR jobs more time consuming.
I guess you will have to try with larger set of data..
   
Pankil
On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra mnage...@asu.edu
wrote:
   
 Aaron

 That could be the issue, my data is just 516MB - wouldn't this see
 a
   bit
of
 speed up?
 Could you guide me to the example? I ll run my cluster on it and
 see
   what
I
 get. Also for my program I had a java timer running to record the
  time
 taken
 to complete execution. Does Hadoop have an inbuilt timer?

 Mithila

 On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball aa...@cloudera.com
 
wrote:

  Virtually none of the examples that ship with Hadoop are designed
  to
  showcase its speed. Hadoop's speedup comes from its ability to
   process
 very
  large volumes of data (starting around, say, tens of GB per job,
  and
 going
  up in orders of magnitude from there). So if you are timing the
 pi
  calculator (or something like that), its results won't
 necessarily
  be
 very
  consistent. If a job doesn't have enough fragments of data to
   allocate
 one
  per each node, some of the nodes will also just go unused.
 
  The best example for you to run is to use randomwriter to fill up
   your
  cluster with several GB of random data and then run the sort
  program.
If
  that doesn't scale up performance from 3 nodes to 15, then you've
  definitely
  got something strange going on.
 
  - Aaron
 
 
  On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra 
  mnage...@asu.edu
  wrote:
 
   Hey all
   I recently setup a three node hadoop cluster and ran an
 examples
  on
it.
  It
   was pretty fast, and all the three nodes were being used (I
  checked
the
  log
   files to make sure that the slaves are utilized).
  
   Now I ve setup another cluster consisting of 15 nodes. I ran
 the
   same
   example, but instead of speeding up, the map-reduce task seems
 to
take
   forever! The slaves are not being used for some reason. This
  second
  cluster
   has a lower, per node processing power, but should that make
 any
   difference?
   How can I ensure that the data is being mapped to all the
 nodes?
  Presently,
   the only node that seems to be doing all the work is the Master
   node.
  
   Does 15 nodes in a cluster increase the network cost? What can
 I
  do
to
   setup
   the cluster to function more efficiently?
  
   Thanks!
   Mithila Nagendra
   Arizona State University
  
 

   
  
 



getting DiskErrorException during map

2009-04-07 Thread Jim Twensky
Hi,

I'm using Hadoop 0.19.1 and I have a very small test cluster with 9 nodes, 8
of them being task trackers. I'm getting the following error and my jobs
keep failing when map processes start hitting 30%:

org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_200904072051_0001/attempt_200904072051_0001_m_00_1/output/file.out
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.Child.main(Child.java:158)


I googled many blogs and web pages but I could neither understand why this
happens nor found a solution to this. What does that error message mean and
how can avoid it, any suggestions?

Thanks in advance,
-jim


Re: Please help!

2009-03-31 Thread Jim Twensky
See the original Map Reduce paper by Google at
http://labs.google.com/papers/mapreduce.html and please don't spam the list.

-jim

On Tue, Mar 31, 2009 at 6:15 PM, Hadooper kusanagiyang.had...@gmail.comwrote:

 Dear developers,

 Is there any detailed example of how Hadoop processes input?
 Article
 http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
 a good idea, but I want to see input data being passed from class to
 class, and how each class manipulates data. The purpose is to analyze the
 time and space complexity of Hadoop as a generalized computational
 model/algorithm. I tried to search the web and could not find more detail.
 Any pointer/hint?
 Thanks a million.

 --
 Cheers! Hadoop core



Re: Using HDFS for common purpose

2009-01-27 Thread Jim Twensky
You may also want to have a look at this to reach a decision based on your
needs:

http://www.swaroopch.com/notes/Distributed_Storage_Systems

Jim

On Tue, Jan 27, 2009 at 1:22 PM, Jim Twensky jim.twen...@gmail.com wrote:

 Rasit,

 What kind of data will you be storing on Hbase or directly on HDFS? Do you
 aim to use it as a data source to do some key/value lookups for small
 strings/numbers or do you want to store larger files labeled with some sort
 of a key and retrieve them during a map reduce run?

 Jim


 On Tue, Jan 27, 2009 at 11:51 AM, Jonathan Gray jl...@streamy.com wrote:

 Perhaps what you are looking for is HBase?

 http://hbase.org

 HBase is a column-oriented, distributed store that sits on top of HDFS and
 provides random access.

 JG

  -Original Message-
  From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
  Sent: Tuesday, January 27, 2009 1:20 AM
  To: core-user@hadoop.apache.org
  Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr;
  hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr;
  hakan.kocaku...@uzay.tubitak.gov.tr; caglar.bi...@uzay.tubitak.gov.tr
  Subject: Using HDFS for common purpose
 
  Hi,
  I wanted to ask, if HDFS is a good solution just as a distributed db
  (no
  running jobs, only get and put commands)
  A review says that HDFS is not designed for low latency and besides,
  it's
  implemented in Java.
  Do these disadvantages prevent us using it?
  Or could somebody suggest a better (faster) one?
 
  Thanks in advance..
  Rasit





Re: Suitable for Hadoop?

2009-01-21 Thread Jim Twensky
Ricky,

Hadoop was formerly optimized for large files, usually files of size larger
than one input split. However, there is an input format called
MultiFileInputFormat which can be used to utilize Hadoop to work efficiently
on smaller files. You can also set the isSplittable method of an input
format to false and ensure that a file is not split into pieces but rather
processed by only one mapper.

Jim

On Wed, Jan 21, 2009 at 9:14 AM, Ricky Ho r...@adobe.com wrote:

 Hmmm ...

 From a space efficiency perspective, given HDFS (with large block size) is
 expecting large files, is Hadoop optimized for processing large number of
 small files ?  Does each file take up at least 1 block ? or multiple files
 can sit on the same block.

 Rgds,
 Ricky
 -Original Message-
 From: Zak, Richard [USA] [mailto:zak_rich...@bah.com]
 Sent: Wednesday, January 21, 2009 6:42 AM
 To: core-user@hadoop.apache.org
 Subject: RE: Suitable for Hadoop?

 You can do that.  I did a Map/Reduce job for about 6 GB of PDFs to
 concatenate them, and the New York times used Hadoop to process a few TB of
 PDFs.

 What I would do is this:
 - Use the iText library, a Java library for PDF manipulation (don't know
 what you would use for reading Word docs)
 - Don't use any Reducers
 - Have the input be a text file with the directory(ies) to process, since
 the mapper takes in file contents (and you don't want to read in one line
 of
 binary)
 - Have the map process all contents for that one given directory from the
 input text file
 - Break down the documents into more directories to go easier on the memory
 - Use Amazon's EC2, and the scripts in hadoop_dir/src/contrib/ec2/bin/
 (there is a script which passes environment variables to launched
 instances,
 modify the script to allow Hadoop to use more memory by setting the
 HADOOP_HEAPSIZE environment variable and having the variable properly
 passed)

 I realize this isn't the strong point of Map/Reduce or Hadoop, but it still
 uses the HDFS in a beneficial manner, and the distributed part is very
 helpful!


 Richard J. Zak

 -Original Message-
 From: Darren Govoni [mailto:dar...@ontrenet.com]
 Sent: Wednesday, January 21, 2009 08:08
 To: core-user@hadoop.apache.org
 Subject: Suitable for Hadoop?

 Hi,
  I have a task to process large quantities of files by converting them into
 other formats. Each file is processed as a whole and converted to a target
 format. Since there are 100's of GB of data I thought it suitable for
 Hadoop, but the problem is, I don't think the files can be broken apart and
 processed. For example, how would mapreduce work to convert a Word Document
 to PDF if the file is reduced to blocks? I'm not sure that's possible, or
 is
 it?

 thanks for any advice
 Darren




Re: Indexed Hashtables

2009-01-15 Thread Jim Twensky
Delip,

Why do you think Hbase will be an overkill? I do something similar to what
you're trying to do with Hbase and I haven't encountered any significant
problems so far. Can you give some more info on the size of the data you
have?

Jim

On Wed, Jan 14, 2009 at 8:47 PM, Delip Rao delip...@gmail.com wrote:

 Hi,

 I need to lookup a large number of key/value pairs in my map(). Is
 there any indexed hashtable available as a part of Hadoop I/O API?
 I find Hbase an overkill for my application; something on the lines of
 HashStore (www.cellspark.com/hashstore.html) should be fine.

 Thanks,
 Delip



Re: Merging reducer outputs into a single part-00000 file

2009-01-14 Thread Jim Twensky
Owen and Rasit,

Thank you for the responses. I've figured that mapred.reduce.tasks was set
to 1 in my hadoop-default xml and I didn't overwrite it in my
hadoop-site.xml configuration file.

Jim

On Wed, Jan 14, 2009 at 11:23 AM, Owen O'Malley omal...@apache.org wrote:

 On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wrote:

  Jim,

 As far as I know, there is no operation done after Reducer.


 Correct, other than output promotion, which moves the output file to the
 final filename.

  But if you  are a little experienced, you already know these.
 Ordered list means one final file, or am I missing something?


 There is no value and a lot of cost associated with creating a single file
 for the output. The question is how you want the keys divided between the
 reduces (and therefore output files). The default partitioner hashes the key
 and mods by the number of reduces, which stripes the keys across the
 output files. You can use the mapred.lib.InputSampler to generate good
 partition keys and mapred.lib.TotalOrderPartitioner to get completely sorted
 output based on the partition keys. With the total order partitioner, each
 reduce gets an increasing range of keys and thus has all of the nice
 properties of a single file without the costs.

 -- Owen



Merging reducer outputs into a single part-00000 file

2009-01-10 Thread Jim Twensky
Hello,

The original map-reduce paper states: After successful completion, the
output of the map-reduce execution is available in the R output files (one
per reduce task, with file names as specified by the user). However, when
using Hadoop's TextOutputFormat, all the reducer outputs are combined in a
single file called part-0. I was wondering how and when this merging
process is done. When the reducer calls output.collect(key,value), is this
record written to a local temporary output file in the reducer's disk and
then these local files (a total of R) are later merged into one single file
with a final thread or is it directly written to the final output file
(part-0)? I am asking this because I'd like to get an ordered sample of
the final output data, ie. one record per every 1000 records or something
similar and I don't want to run a serial process that iterates on the final
output file.

Thanks,
Jim


Re: Combiner run specification and questions

2009-01-02 Thread Jim Twensky
Hello Saptarshi,

E.g if there are only 10 value corresponding
to a key(as outputted by the mapper), will these 10 values go straight
to the reducer or to the reducer via the combiner?

It depends on whether or not you use the method JobConf.setCombinerClass()
or not. If you don't, Hadoop  does not run any combiners by default. If you
use your reducer class as the combiner, you must make sure that your mapper
and reducer outputs are of same type. Because otherwise you will get a
runtime error about types not matching. In your case, I strongly recommend
you to use a combiner to reduce the size of the intermediate data. My
understanding is that, combiners are just local reducers that run right
after the completion of the map step.

Jim

On Fri, Jan 2, 2009 at 11:57 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote:

 Hello,
 I would just like to confirm, when does the Combiner run(since it
 might not be run at all,see below). I read somewhere that it is run,
 if there is at least one reduce (which in my case i can be sure of).
 I also read, that the combiner is an optimization. However, it is also
 a chance for a function to transform the key/value (keeping the class
 the same i.e the combiner semantics are not changed) and deal with a
 smaller set ( this could be done in the reducer but the number of
 values for a key might be relatively large).

 However, I guess it would be a mistake for reducer to expect its input
 coming from a combiner? E.g if there are only 10 value corresponding
 to a key(as outputted by the mapper), will these 10 values go straight
 to the reducer or to the reducer via the combiner?

 Here I am assuming my reduce operations does not need all the values
 for a key to work(so that a combiner can be used) i.e additive
 operations.

 Thank you
 Saptarshi


 On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote:
  The Combiner may be called 0, 1, or many times on each key between the
  mapper and reducer. Combiners are just an application specific
 optimization
  that compress the intermediate output. They should not have side effects
 or
  transform the types. Unfortunately, since there isn't a separate
 interface
  for Combiners, there is isn't a great place to document this requirement.
  I've just filed HADOOP-4668 to improve the documentation.



 --
 Saptarshi Guha - saptarshi.g...@gmail.com



Re: Shared thread safe variables?

2009-01-01 Thread Jim Twensky
Aaron,

I actually do something different than word count. I count all possible
phrases for every sentence in my corpus. So for instance, if I have a
sentence like Hello world, my mappers emit:

Hello 1
World 1
Hello World 1

As you can easily realize, for longer sentences the number of intermediate
records grow much more than the original input size.

Anyway, I did what I said last week based on your previous replies and it
worked well. Thank you for the advice.

Jim

On Wed, Dec 31, 2008 at 4:06 AM, Aaron Kimball aa...@cloudera.com wrote:

 Hmm. Check your math on the data set size. Your input corpus may be a few
 (dozen, hundred) TB, but how many distinct words are there? The output data
 set should be at least a thousand times smaller. If you've got the hardware
 to do that initial word count step on a few TB of data, the second pass
 will
 not be a major performance concern.

 MapReduce is, to borrow from a tired analogy, a lot like driving a freight
 train. The raw speed of any given algorithm on it might not sound
 impressive, but even if its got a much higher constant-factor of time
 associated with it, the ability to provide nearly-flat parallelism as your
 data set grows really large more than makes up for it in the long run.
 - Aaron

 On Thu, Dec 25, 2008 at 2:22 AM, Jim Twensky jim.twen...@gmail.com
 wrote:

  Hello again,
 
  I think I found an answer to my question. If I write a new
  WritableComparable object that extends IntWritable and then overwrite the
  compareTo method, I can change the sorting order from ascending to
  descending. That will solve my problem for getting the top 100 most
  frequent
  words at each combiner/reducer.
 
  Jim
 
  On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky jim.twen...@gmail.com
  wrote:
 
   Hi Aaron,
  
   Thanks for the advice. I actually thought of using multiple combiners
 and
  a
   single reducer but I was worried about the key sorting phase to be a
  vaste
   for my purpose. If the input is just a bunch of (word,count) pairs
 which
  is
   in the order of TeraBytes, wouldn't sorting be an overkill? That's why
 I
   thought a single serial program might perform better but I'm not sure
 how
   long it would take to sort the keys in such a case so probably it is
  nothing
   beyond speculation and I should go and give it a try to see how well it
   performs.
  
   Secondly, I didn't quite understand how I can take advantage of the
  sorted
   keys if I use an inverting mapper that transforms (k,v) -- (v,k)
 pairs.
  In
   both cases, the combiners and the single reducer will still have to
  iterate
   over all the (v,k) pairs to find the top 100 right? Or is there a way
 to
  say
   something like Give me the last 100 keys at each reducer/combiner?
  
   Thanks in advance,
   Jim
  
  
   On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball aa...@cloudera.com
  wrote:
  
   (Addendum to my own post -- an identity mapper is probably not what
 you
   want. You'd actually want an inverting mapper that transforms (k, v)
 --
   (v,
   k), to take advantage of the key-based sorting.)
  
   - Aaron
  
   On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball aa...@cloudera.com
   wrote:
  
Hi Jim,
   
The ability to perform locking of shared mutable state is a distinct
anti-goal of the MapReduce paradigm. One of the major benefits of
   writing
MapReduce programs is knowing that you don't have to worry about
   deadlock in
your code. If mappers could lock objects, then the failure and
 restart
semantics of individual tasks would be vastly more complicated.
 (What
happens if a map task crashes after it obtains a lock? Does it
   automatically
release the lock? Does some rollback mechanism undo everything that
   happened
after the lock was acquired? How would that work if--by
  definition--the
mapper node is no longer available?)
   
A word frequency histogram function can certainly be written in
   MapReduce
without such state. You've got the right intuition, but a serial
  program
   is
not necessarily the best answer. Take the existing word count
 program.
   This
converts bags of words into (word, count) pairs. Then pass this
  through
   a
second pass, via an identity mapper to a set of combiners that each
  emit
   the
100 most frequent words, to a single reducer that emits the 100 most
frequent words obtained by the combiners.
   
Many other more complicated problems which seem to require shared
  state,
   in
truth, only require a second (or n+1'th) MapReduce pass. Adding
  multiple
passes is a very valid technique for building more complex
 dataflows.
   
Cheers,
- Aaron
   
   
   
On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky jim.twen...@gmail.com
   wrote:
   
Hello,
   
I was wondering if Hadoop provides thread safe shared variables
 that
   can
be
accessed from individual mappers/reducers along with a proper
 locking
mechanism. To clarify things

Re: Shared thread safe variables?

2008-12-25 Thread Jim Twensky
Hello again,

I think I found an answer to my question. If I write a new
WritableComparable object that extends IntWritable and then overwrite the
compareTo method, I can change the sorting order from ascending to
descending. That will solve my problem for getting the top 100 most frequent
words at each combiner/reducer.

Jim

On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky jim.twen...@gmail.com wrote:

 Hi Aaron,

 Thanks for the advice. I actually thought of using multiple combiners and a
 single reducer but I was worried about the key sorting phase to be a vaste
 for my purpose. If the input is just a bunch of (word,count) pairs which is
 in the order of TeraBytes, wouldn't sorting be an overkill? That's why I
 thought a single serial program might perform better but I'm not sure how
 long it would take to sort the keys in such a case so probably it is nothing
 beyond speculation and I should go and give it a try to see how well it
 performs.

 Secondly, I didn't quite understand how I can take advantage of the sorted
 keys if I use an inverting mapper that transforms (k,v) -- (v,k) pairs. In
 both cases, the combiners and the single reducer will still have to iterate
 over all the (v,k) pairs to find the top 100 right? Or is there a way to say
 something like Give me the last 100 keys at each reducer/combiner?

 Thanks in advance,
 Jim


 On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball aa...@cloudera.com wrote:

 (Addendum to my own post -- an identity mapper is probably not what you
 want. You'd actually want an inverting mapper that transforms (k, v) --
 (v,
 k), to take advantage of the key-based sorting.)

 - Aaron

 On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball aa...@cloudera.com
 wrote:

  Hi Jim,
 
  The ability to perform locking of shared mutable state is a distinct
  anti-goal of the MapReduce paradigm. One of the major benefits of
 writing
  MapReduce programs is knowing that you don't have to worry about
 deadlock in
  your code. If mappers could lock objects, then the failure and restart
  semantics of individual tasks would be vastly more complicated. (What
  happens if a map task crashes after it obtains a lock? Does it
 automatically
  release the lock? Does some rollback mechanism undo everything that
 happened
  after the lock was acquired? How would that work if--by definition--the
  mapper node is no longer available?)
 
  A word frequency histogram function can certainly be written in
 MapReduce
  without such state. You've got the right intuition, but a serial program
 is
  not necessarily the best answer. Take the existing word count program.
 This
  converts bags of words into (word, count) pairs. Then pass this through
 a
  second pass, via an identity mapper to a set of combiners that each emit
 the
  100 most frequent words, to a single reducer that emits the 100 most
  frequent words obtained by the combiners.
 
  Many other more complicated problems which seem to require shared state,
 in
  truth, only require a second (or n+1'th) MapReduce pass. Adding multiple
  passes is a very valid technique for building more complex dataflows.
 
  Cheers,
  - Aaron
 
 
 
  On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky jim.twen...@gmail.com
 wrote:
 
  Hello,
 
  I was wondering if Hadoop provides thread safe shared variables that
 can
  be
  accessed from individual mappers/reducers along with a proper locking
  mechanism. To clarify things, let's say that in the word count example,
 I
  want to know the word that has the highest frequency and how many times
 it
  occured. I believe that the latter can be done using the counters that
  come
  with the Hadoop framework but I don't know how to get the word itself
 as a
  String. Of course, the problem can be more complicated like the top 100
  words or so.
 
  I thought of writing a serial program which can go over the final
 output
  of
  the word count but this wouldn't be a good idea if the output file gets
  too
  large. However, if there is a way to define and use shared variables,
 this
  would be really easy to do on the fly during the word count's reduce
  phase.
 
  Thanks,
  Jim
 
 
 





Shared thread safe variables?

2008-12-24 Thread Jim Twensky
Hello,

I was wondering if Hadoop provides thread safe shared variables that can be
accessed from individual mappers/reducers along with a proper locking
mechanism. To clarify things, let's say that in the word count example, I
want to know the word that has the highest frequency and how many times it
occured. I believe that the latter can be done using the counters that come
with the Hadoop framework but I don't know how to get the word itself as a
String. Of course, the problem can be more complicated like the top 100
words or so.

I thought of writing a serial program which can go over the final output of
the word count but this wouldn't be a good idea if the output file gets too
large. However, if there is a way to define and use shared variables, this
would be really easy to do on the fly during the word count's reduce phase.

Thanks,
Jim


Re: Shared thread safe variables?

2008-12-24 Thread Jim Twensky
Hi Aaron,

Thanks for the advice. I actually thought of using multiple combiners and a
single reducer but I was worried about the key sorting phase to be a vaste
for my purpose. If the input is just a bunch of (word,count) pairs which is
in the order of TeraBytes, wouldn't sorting be an overkill? That's why I
thought a single serial program might perform better but I'm not sure how
long it would take to sort the keys in such a case so probably it is nothing
beyond speculation and I should go and give it a try to see how well it
performs.

Secondly, I didn't quite understand how I can take advantage of the sorted
keys if I use an inverting mapper that transforms (k,v) -- (v,k) pairs. In
both cases, the combiners and the single reducer will still have to iterate
over all the (v,k) pairs to find the top 100 right? Or is there a way to say
something like Give me the last 100 keys at each reducer/combiner?

Thanks in advance,
Jim

On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball aa...@cloudera.com wrote:

 (Addendum to my own post -- an identity mapper is probably not what you
 want. You'd actually want an inverting mapper that transforms (k, v) --
 (v,
 k), to take advantage of the key-based sorting.)

 - Aaron

 On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball aa...@cloudera.com wrote:

  Hi Jim,
 
  The ability to perform locking of shared mutable state is a distinct
  anti-goal of the MapReduce paradigm. One of the major benefits of writing
  MapReduce programs is knowing that you don't have to worry about deadlock
 in
  your code. If mappers could lock objects, then the failure and restart
  semantics of individual tasks would be vastly more complicated. (What
  happens if a map task crashes after it obtains a lock? Does it
 automatically
  release the lock? Does some rollback mechanism undo everything that
 happened
  after the lock was acquired? How would that work if--by definition--the
  mapper node is no longer available?)
 
  A word frequency histogram function can certainly be written in MapReduce
  without such state. You've got the right intuition, but a serial program
 is
  not necessarily the best answer. Take the existing word count program.
 This
  converts bags of words into (word, count) pairs. Then pass this through a
  second pass, via an identity mapper to a set of combiners that each emit
 the
  100 most frequent words, to a single reducer that emits the 100 most
  frequent words obtained by the combiners.
 
  Many other more complicated problems which seem to require shared state,
 in
  truth, only require a second (or n+1'th) MapReduce pass. Adding multiple
  passes is a very valid technique for building more complex dataflows.
 
  Cheers,
  - Aaron
 
 
 
  On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky jim.twen...@gmail.com
 wrote:
 
  Hello,
 
  I was wondering if Hadoop provides thread safe shared variables that can
  be
  accessed from individual mappers/reducers along with a proper locking
  mechanism. To clarify things, let's say that in the word count example,
 I
  want to know the word that has the highest frequency and how many times
 it
  occured. I believe that the latter can be done using the counters that
  come
  with the Hadoop framework but I don't know how to get the word itself as
 a
  String. Of course, the problem can be more complicated like the top 100
  words or so.
 
  I thought of writing a serial program which can go over the final output
  of
  the word count but this wouldn't be a good idea if the output file gets
  too
  large. However, if there is a way to define and use shared variables,
 this
  would be really easy to do on the fly during the word count's reduce
  phase.
 
  Thanks,
  Jim
 
 
 



Re: Predefined counters

2008-12-22 Thread Jim Twensky
Hello Tom,

Thanks for the swift response. That really worked, I'll vote for what you
suggested now.

Cheers,
Jim

On Mon, Dec 22, 2008 at 5:09 AM, Tom White t...@cloudera.com wrote:

 Hi Jim,

 Try something like:

 Counters counters = job.getCounters();
 counters.findCounter(org.apache.hadoop.mapred.Task$Counter,
 REDUCE_INPUT_RECORDS).getCounter()

 The pre-defined counters are unfortunately not public and are not in
 one place in the source code, so you'll need to hunt to find them
 (search the source for the counter name you see in the web UI). I
 opened https://issues.apache.org/jira/browse/HADOOP-4043 a while back
 to address the fact they are not public. Please consider voting for it
 if you think it would be useful.

 Cheers,
 Tom

 On Mon, Dec 22, 2008 at 2:47 AM, Jim Twensky jim.twen...@gmail.com
 wrote:
  Hello,
  I need to collect some statistics using some of the counters defined by
 the
  Map/Reduce framework such as Reduce input records. I know I should use
  the  getCounter method from Counters.Counter but I couldn't figure how to
  use it. Can someone give me a two line example of how to read the values
 for
  those counters and where I can find the names/groups of the predefined
  counters?
 
  Thanks in advance,
  Jim
 



Predefined counters

2008-12-21 Thread Jim Twensky
Hello,
I need to collect some statistics using some of the counters defined by the
Map/Reduce framework such as Reduce input records. I know I should use
the  getCounter method from Counters.Counter but I couldn't figure how to
use it. Can someone give me a two line example of how to read the values for
those counters and where I can find the names/groups of the predefined
counters?

Thanks in advance,
Jim


Re: Can hadoop sort by values rather than keys?

2008-09-24 Thread Jim Twensky
Sorting according to keys is a requirement for the map/reduce algorithm. I'd
suggest running a second map/reduce phase on the output files of your
application and use the values as keys in that second phase. I know that
will increase the running time, but this is how I do it when I need to get
my output files sorted according to their values rather then keys.

Jim

On Wed, Sep 24, 2008 at 9:28 PM, Qin Gao [EMAIL PROTECTED] wrote:

 Why not use the value as keys.

 On Wed, Sep 24, 2008 at 10:22 PM, Jeremy Chow [EMAIL PROTECTED] wrote:

  Hi list,
   The default way hadoop doing its sorting is by keys , can it sort by
  values rather than keys?
 
  Regards,
  Jeremy
  --
  My research interests are distributed systems, parallel computing and
  bytecode based virtual machine.
 
  http://coderplay.javaeye.com
 



Re: debugging hadoop application!

2008-09-24 Thread Jim Twensky
As far as I know, there is a Hadoop plug-in for Eclipse but it is not
possible to debug when running on a real cluster. If you want to add watches
and expressions to trace your programs or profile your code, I'd suggest
looking at the log files or use other tracing tools such as xtrace (
http://www.x-trace.net/wiki/doku.php). Somebody please correct me if I'm
wrong.

Jim

On Wed, Sep 24, 2008 at 4:41 PM, Gerardo Velez [EMAIL PROTECTED]wrote:

 Hi everybody!

 I'm a newbie on hadoop and after follow up some hadoop examples and studied
 them. I will start my own application but I got a question.

 Is there anyway I could debug my own hadoop application?

 Actually I've been working on IntelliJ IDE, but I'm feeling comfortable
 with
 netbeans and eclipse as well.

 Note: So far I've attached jboss server on IntelliJ using this:

 -Xdebug -Xnoagent -Djava.compiler=NONE
 -Xrunjdwp:transport=dt_shmem,server=y,suspend=n,address=javadebug


 I just attached last config line to jboss execution file (run.sh).


 I was wondering if I could do ahything similar to this for hadoop!


 Thanks in advance!



Re: installing hadoop on a OS X cluster

2008-09-10 Thread Jim Twensky
Apparently you have one node with 2 processors where each processor has 4
cores. What do you want to use Hadoop for? If you have a single disk drive
and multiple cores on one node then pseudo distributed environment seems
like the best approach to me as long as you are not dealing with large
amounts of data. If you have a single disk drive and huge amount of data to
process, then the disk drive might be a bottleneck for your applications.
Hadoop is usually used for data intensive applications whereas your hardware
seems more like to be designed for cpu intensive job considering 8 cores on
a single node.

Tim

On Wed, Sep 10, 2008 at 4:59 PM, Sandy [EMAIL PROTECTED] wrote:

 I am starting an install of hadoop on a new cluster. However, I am a little
 confused what set of instructions I should follow, having only installed
 and
 played around with hadoop on a single node ubuntu box with 2 cores (on a
 single board) and 2 GB of RAM.
 The new machine has 2 internal nodes, each with 4 cores. I would like to
 run
 Hadoop to run in a distributed context over these 8 cores. One of my
 biggest
 issues is the definition of the word node. From the Hadoop wiki and
 documentation, it seems that node means machine, and not a board. So,
 by
 this definition, our cluster is really one node. Is this correct?

 If this is the case, then I shouldn't be using the cluster setup
 instructions, located here:
 http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html

 But this one:
 http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html

 Which is what I've been doing. But what should the operation be? I don't
 think it should be standalone. Should it be Psuedo-distributed? If so, how
 can I guarantee that it will be spread over all the 8 processors? What is
 necessary for the hadoop-site.xml file?

 Here are the specs of the machine.
-Mac Pro RAID Card  065-7214
-Two 3.0GHz Quad-Core Intel Xeon (8-core)   065-7534

-16GB RAM (4 x 4GB) 065-7179
-1TB 7200-rpm Serial ATA 3Gb/s  065-7544

-1TB 7200-rpm Serial ATA 3Gb/s  065-7546

-1TB 7200-rpm Serial ATA 3Gb/s  065-7193

-1TB 7200-rpm Serial ATA 3Gb/s  065-7548


 Could someone please point me to the correct mode of operation/instructions
 to install things correctly on this machine? I found some information how
 to
 install on a OS X machine in the archives, but they are a touch outdated
 and
 seems to be missing some things.

 Thank you very much for you time.

 -SM



Re: Hadoop Streaming and Multiline Input

2008-09-09 Thread Jim Twensky
If I understand your question correctly, you need to write your own
FileInputFormat. Please see
http://hadoop.apache.org/core/docs/r0.18.0/api/index.html for details.

Regards,
Tim

On Sat, Sep 6, 2008 at 9:20 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

 Is is possible to set a multiline text input in streaming to be used as a
 single record?  For example say I wanted to scan a webpage for a specific
 regex that is multiline, is this possible in streaming?

 Dennis



Question on Streaming

2008-09-09 Thread Jim Twensky
Hello, I need to use Hadoop Streaming to run several instances of a single
program on different files. Before doing it, I wrote a simple test
application as the mapper, which basically outputs the standard input
without doing anything useful. So it looks like the following:

---echo.sh--
echo Running mapper, input is $1
---echo.sh--

For the input, I created a single text file input.txt that has number from 1
to 10 on each line, so it goes like:

---input.txt---
1
2
..
10
---input.txt---

I uploaded input.txt on hdfs://stream/ directory and then ran Hadoop
Streaming utility as follows:

bin/hadoop jar hadoop-0.18.0-streaming.jar  \
-input /stream \
-output /trash \
-mapper echo.sh \
-file echo.sh \
-jobconf mapred.reduce.tasks=0

and from what I understood in the streaming tutorial, I expected that each
mapper would run an instance of echo.sh with one of the lines in input.txt
so I expected to get an output in the form of

Running mapper, input is 2
Running mapper, input is 5
...
and so on but I got only two output files, part-0 and part-1 that
contain the string Running mapper, input is  . As far as I see, the
mappers ran the mapper script echo.sh without the standard input. I basicly
followed the tutorial and I'm confused now so could you please tell me what
I'm missing here?

Thanks in advance,
Jim


Different Map and Reduce output types - weird error message

2008-08-29 Thread Jim Twensky
Hello, I am working on a Hadoop application that produces different
(key,value) types after the map and reduce phases so I'm aware that I need
to use JobConf.setMapOutputKeyClass and JobConf.setMapOutputValueClass.
However, I still keep getting the following runtime error when I run my
application:

java.io.IOException: wrong value class: org.apache.hadoop.io.FloatWritable
is not class org.apache.hadoop.io.IntWritable
at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:938)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.collect(MapTask.java:414)
at
test.DistributionCreator$Reduce.reduce(DistributionCreator.java:104)
at
test.DistributionCreator$Reduce.reduce(DistributionCreator.java:85)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:439)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:604)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804)

My mapper class goes like:

  public static class MapClass extends MapReduceBase
implements MapperLongWritable, Text, IntWritable, IntWritable {

(...)

public void map(LongWritable key, Text value,
OutputCollectorIntWritable, IntWritable output,
Reporter reporter) throws IOException {

 (...)
}

}

and my Reducer goes like:

  public static class Reduce extends MapReduceBase
implements ReducerIntWritable, IntWritable, IntWritable, FloatWritable
{

(...)

public void reduce(IntWritable key, IteratorIntWritable values,
   OutputCollectorIntWritable, FloatWritable output,
   Reporter reporter) throws IOException {


float sum = 0;

(...)

output.collect(key, new FloatWritable(sum));

 }

   }

   and the corresponding part of my configuration goes as follows:

conf.setMapOutputValueClass(IntWritable.class);
conf.setMapOutputKeyClass(IntWritable.class);
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(FloatWritable.class);

   which I believe is consistent with the mapper and the reducer classes.
Can you please let me know what I'm missing here?

   Thanks in advance,

   Jim


Re: Different Map and Reduce output types - weird error message

2008-08-29 Thread Jim Twensky
Here is the relevant part of my mapper:

(...)

private final static IntWritable one = new IntWritable(1);
private IntWritable bound = new IntWritable();

(...)

while(...) {

output.collect(bound,one);

   }

   so I'm not sure why my mapper tries to output a FloatWritable.




On Fri, Aug 29, 2008 at 6:17 PM, Owen O'Malley [EMAIL PROTECTED] wrote:

 The error message is saying that your map tried to output a FloatWritable.



Re: Different Map and Reduce output types - weird error message

2008-08-29 Thread Jim Twensky
I think I've found the problem. When I removed the following line:

conf.setCombinerClass(Reduce.class);

everything worked fine. During the map phase, when the combiner uses the
Reduce.class as the Reducer, the final map (key,value) pairs are attempted
to be written as the Reducer output types, which contradict with the
specified Mapper output types.  If I'm correct, am I supposed to write a
separate reducer for the local combiner in order to speed things up?

Jim


On Fri, Aug 29, 2008 at 6:30 PM, Jim Twensky [EMAIL PROTECTED] wrote:

 Here is the relevant part of my mapper:

 (...)

 private final static IntWritable one = new IntWritable(1);
 private IntWritable bound = new IntWritable();

 (...)

 while(...) {

 output.collect(bound,one);

}

so I'm not sure why my mapper tries to output a FloatWritable.




 On Fri, Aug 29, 2008 at 6:17 PM, Owen O'Malley [EMAIL PROTECTED] wrote:

 The error message is saying that your map tried to output a FloatWritable.