Intermittent "Already Being Created Exception"

2009-05-28 Thread Palleti, Pallavi
Hi all,

 

 I have a 50 node cluster and I am trying to write some logs of size 1GB
each into hdfs.  I need to write them in temporal fashion say for every
15 mins worth of data, I am closing previously opened file and creating
a new file. The snippet of code is 

if()  

  {

if(out != null)

{

  out.close();

}

String outFileStr = outputDir+ File.separator + outDate +
File.separator + outFileSuffix + "." + outputMinute;

System.out.println("Creating outFileStr:"+ outFileStr);

Path outFile = new Path(outFileStr);

out = fs.create(outFile); //It throws exception here saying
Already Being Created

  }

 

When I run this code, I am getting Intermittent "Already Being Created
Exceptions". I am not having any threads and this code is running
sequentially. I went thru the previous mailing list posts but  couldn't
get much information. Can anyone please tell me why this is happening
and how to avoid this?

 

Thanks

Pallavi



Can I have Reducer with No Output?

2009-05-28 Thread dealmaker

Hi,
  I have maps that do most of the work, and they output the data into a
reducer, so basically key is a constant, and the reducer combines all the
input from maps into a file and it does "LOAD_DATA" the file into mysql db. 
So, there won't be any output.collect ( ) in reducer function.  But when I
write the class Reducer, the compiler keeps complaining about the missing 2
types at the end of in "private static class Reducer extends MapReduceBase
implements Reducer " which doesn't have output.collect ( ). 
What should I put in there?

Do I even need Reducer at all?  I think that only one reducer saves the file
will be more efficient, rather than having all the map to save data into
same file individually.  Any better way to do this?

Thanks.
-- 
View this message in context: 
http://www.nabble.com/Can-I-have-Reducer-with-No-Output--tp23756950p23756950.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Can I have Reducer with No Output?

2009-05-28 Thread tim robertson
Yes you can do this.

It is complaining because you are not declaring the output types in
the method signature, but you will not use them anyway.

So please try

private static class Reducer extends MapReduceBase implements
Reducer {
...

The output format will be a TextOutputFormat, but it will not do anything.

Cheers

Tim



On Thu, May 28, 2009 at 9:57 AM, dealmaker  wrote:
>
> Hi,
>  I have maps that do most of the work, and they output the data into a
> reducer, so basically key is a constant, and the reducer combines all the
> input from maps into a file and it does "LOAD_DATA" the file into mysql db.
> So, there won't be any output.collect ( ) in reducer function.  But when I
> write the class Reducer, the compiler keeps complaining about the missing 2
> types at the end of in "private static class Reducer extends MapReduceBase
> implements Reducer " which doesn't have output.collect ( ).
> What should I put in there?
>
> Do I even need Reducer at all?  I think that only one reducer saves the file
> will be more efficient, rather than having all the map to save data into
> same file individually.  Any better way to do this?
>
> Thanks.
> --
> View this message in context: 
> http://www.nabble.com/Can-I-have-Reducer-with-No-Output--tp23756950p23756950.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: Can I have Reducer with No Output?

2009-05-28 Thread Jothi Padmanabhan
If your reducer does not write anything, you could look at NullOutputFormat
as well.

Jothi


On 5/28/09 1:38 PM, "tim robertson"  wrote:

> Yes you can do this.
> 
> It is complaining because you are not declaring the output types in
> the method signature, but you will not use them anyway.
> 
> So please try
> 
> private static class Reducer extends MapReduceBase implements
> Reducer {
> ...
> 
> The output format will be a TextOutputFormat, but it will not do anything.
> 
> Cheers
> 
> Tim
> 
> 
> 
> On Thu, May 28, 2009 at 9:57 AM, dealmaker  wrote:
>> 
>> Hi,
>>  I have maps that do most of the work, and they output the data into a
>> reducer, so basically key is a constant, and the reducer combines all the
>> input from maps into a file and it does "LOAD_DATA" the file into mysql db.
>> So, there won't be any output.collect ( ) in reducer function.  But when I
>> write the class Reducer, the compiler keeps complaining about the missing 2
>> types at the end of in "private static class Reducer extends MapReduceBase
>> implements Reducer " which doesn't have output.collect ( ).
>> What should I put in there?
>> 
>> Do I even need Reducer at all?  I think that only one reducer saves the file
>> will be more efficient, rather than having all the map to save data into
>> same file individually.  Any better way to do this?
>> 
>> Thanks.
>> --
>> View this message in context:
>> http://www.nabble.com/Can-I-have-Reducer-with-No-Output--tp23756950p23756950.
>> html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> 
>> 



Re: hadoop hardware configuration

2009-05-28 Thread Steve Loughran

Patrick Angeles wrote:

Sorry for cross-posting, I realized I sent the following to the hbase list
when it's really more a Hadoop question.


This is an interesting question. Obviously as an HP employee you must 
assume that I'm biased when I say HP DL160 servers are good  value for 
the workers, though our blade systems are very good for a high physical 
density -provided you have the power to fill up the rack.




2 x Hadoop Master (and Secondary NameNode)

   - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W)
   - 16GB DDR2-800 Registered ECC Memory
   - 4 x 1TB 7200rpm SATA II Drives
   - Hardware RAID controller
   - Redundant Power Supply
   - Approx. 390W power draw (1.9amps 208V)
   - Approx. $4000 per unit


I do not know the what the advantages of that many cores are on a NN. 
Someone needs to do some experiments. I do know you need enough RAM to 
hold the index in memory, and you may want to go for a bigger block size 
to keep the index size down.




6 x Hadoop Task Nodes

   - 1 x 2.3Ghz Quad Core (Opteron 1356)
   - 8GB DDR2-800 Registered ECC Memory
   - 4 x 1TB 7200rpm SATA II Drives
   - No RAID (JBOD)
   - Non-Redundant Power Supply
   - Approx. 210W power draw (1.0amps 208V)
   - Approx. $2000 per unit

I had some specific questions regarding this configuration...




   1. Is hardware RAID necessary for the master node?



You need a good story to ensure that loss of a disk on the master 
doesn't lose the filesystem. I like RAID there, but the alternative is 
to push the stuff out over the network to other storage you trust. That 
could be NFS-mounted RAID storage, it could be NFS mounted JBOD. 
Whatever your chosen design, test it works before you go live by running 
the cluster then simulate different failures, see how well the 
hardware/ops team handles it.


Keep an eye on where that data goes, because when the NN runs out of 
file storage, the consequences can be pretty dramatic (i,e the cluster 
doesnt come up unless you edit the editlog by hand)



   2. What is a good processor-to-storage ratio for a task node with 4TB of
   raw storage? (The config above has 1 core per 1TB of raw storage.)


That really depends on the work you are doing...the bytes in/out to CPU 
work, and the size of any memory structures that are built up over the run.


With 1 core per physical disk, you get the bandwidth of a single disk 
per CPU; for some IO-intensive work you can make the case for two 
disks/CPU -one in, one out, but then you are using more power, and 
if/when you want to add more storage, you have to pull out the disks to 
stick in new ones. If you go for more CPUs, you will probably need more 
RAM to go with it.



   3. Am I better off using dual quads for a task node, with a higher power
   draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly $3200,
   but draws almost 2x as much power. The tradeoffs are:
  1. I will get more CPU per dollar and per watt.
  2. I will only be able to fit 1/2 as much dual quad machines into a
  rack.
  3. I will get 1/2 the storage capacity per watt.
  4. I will get less I/O throughput overall (less spindles per core)


First there is the algorithm itself, and whether you are IO or CPU 
bound. Most MR jobs that I've encountered are fairly IO bound -without 
indexes, every lookup has to stream through all the data, so it's power 
inefficient and IO limited. but if you are trying to do higher level 
stuff than just lookup, then you will be doing more CPU-work


Then there is the question of where your electricity comes from, what 
the limits for the room are, whether you are billed on power drawn or 
quoted PSU draw, what the HVAC limits are, what the maximum allowed 
weight per rack is, etc, etc.


I'm a fan of low Joule work, though we don't have any benchmarks yet of 
the power efficiency of different clusters; the number of MJ used to do 
a a terasort. I'm debating doing some single-cpu tests for this on my 
laptop, as the battery knows how much gets used up by some work.



   4. In planning storage capacity, how much spare disk space should I take
   into account for 'scratch'? For now, I'm assuming 1x the input data size.


That you should probably be able to determine on experimental work on 
smaller datasets. Some maps can throw out a lot of data, most reduces do 
actually reduce the final amount.



-Steve

(Disclaimer: I'm not making any official recommendations for hardware 
here, just making my opinions known. If you do want an official 
recommendation from HP, talk to your reseller or account manager, 
someone will look at your problem in more detail and make some 
suggestions. If you have any code/data that could be shared for 
benchmarking, that would help validate those suggestions)




Issue with usage of fs -test

2009-05-28 Thread pankaj jairath

Hello,

I am facing a strange issue, where in the /fs -test -e/ fails and /fs 
-ls/ succeeds to list the file. Following is the grep of such a  result :


bin]$ hadoop fs -ls /projects/myproject///.done
Found 1 items
-rw---   3 user hdfs  0 2009-03-19 22:28 
/projects/myproject///.done

[...@mymachine bin]$ echo $?
0
[...@mymachine bin]$ hadoop fs -test -e /projects/myproject///.done
[...@mymachine bin]$ echo $?
1


What is the cause of such a behaviour, any pointers would much be appreciated. 
(HADOOP_CONF_DIR and HADOOP_HOME are set correctly at env vars)


Thanks
Pankaj 






RE: Issue with usage of fs -test

2009-05-28 Thread Koji Noguchi
Maybe 
https://issues.apache.org/jira/browse/HADOOP-3792 ?

Koji

-Original Message-
From: pankaj jairath [mailto:pjair...@yahoo-inc.com] 
Sent: Thursday, May 28, 2009 4:49 AM
To: core-user@hadoop.apache.org
Subject: Issue with usage of fs -test

Hello,

I am facing a strange issue, where in the /fs -test -e/ fails and /fs 
-ls/ succeeds to list the file. Following is the grep of such a  result
:

bin]$ hadoop fs -ls /projects/myproject///.done
Found 1 items
-rw---   3 user hdfs  0 2009-03-19 22:28 
/projects/myproject///.done
[...@mymachine bin]$ echo $?
0
[...@mymachine bin]$ hadoop fs -test -e
/projects/myproject///.done
[...@mymachine bin]$ echo $?
1


What is the cause of such a behaviour, any pointers would much be
appreciated. (HADOOP_CONF_DIR and HADOOP_HOME are set correctly at env
vars)


Thanks
Pankaj 





Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Stuart White
I need to do a reduce-side join of two datasets.  It's a many-to-many
join; that is, each dataset can can multiple records with any given
key.

Every description of a reduce-side join I've seen involves
constructing your keys out of your mapper such that records from one
dataset will be presented to the reducers before records from the
second dataset.  I should "hold on" to the value from the one dataset
and remember it as I iterate across the values from the second
dataset.

This seems like it only works well for one-to-many joins (when one of
your datasets will only have a single record with any given key).
This scales well because you're only remembering one value.

In a many-to-many join, if you apply this same algorithm, you'll need
to remember all values from one dataset, which of course will be
problematic (and won't scale) when dealing with large datasets with
large numbers of records with the same keys.

Does an efficient algorithm exist for a many-to-many reduce-side join?


InputFormat for fixed-width records?

2009-05-28 Thread Stuart White
I need to process a dataset that contains text records of fixed length
in bytes.  For example, each record may be 100 bytes in length, with
the first field being the first 10 bytes, the second field being the
second 10 bytes, etc...  There are no newlines on the file.  Field
values have been either whitespace-padded or truncated to fit within
the specific locations in these fixed-width records.

Does Hadoop have an InputFormat to support processing of such files?
I looked but couldn't find one.

Of course, I could pre-process the file (outside of Hadoop) to put
newlines at the end of each record, but I'd prefer not to require such
a prep step.

Thanks.


Re: Issue with usage of fs -test

2009-05-28 Thread pankaj jairath
Thanks, Koji. This is the issue I am facing and I have been using 
version 0.18.x.

-/Pankaj

Koji Noguchi wrote:
Maybe 
https://issues.apache.org/jira/browse/HADOOP-3792 ?


Koji

-Original Message-
From: pankaj jairath [mailto:pjair...@yahoo-inc.com] 
Sent: Thursday, May 28, 2009 4:49 AM

To: core-user@hadoop.apache.org
Subject: Issue with usage of fs -test

Hello,

I am facing a strange issue, where in the /fs -test -e/ fails and /fs 
-ls/ succeeds to list the file. Following is the grep of such a  result

:

bin]$ hadoop fs -ls /projects/myproject///.done
Found 1 items
-rw---   3 user hdfs  0 2009-03-19 22:28 
/projects/myproject///.done

[...@mymachine bin]$ echo $?
0
[...@mymachine bin]$ hadoop fs -test -e
/projects/myproject///.done
[...@mymachine bin]$ echo $?
1


What is the cause of such a behaviour, any pointers would much be
appreciated. (HADOOP_CONF_DIR and HADOOP_HOME are set correctly at env
vars)


Thanks
Pankaj 




  





Re: InputFormat for fixed-width records?

2009-05-28 Thread Tom White
Hi Stuart,

There isn't an InputFormat that comes with Hadoop to do this. Rather
than pre-processing the file, it would be better to implement your own
InputFormat. Subclass FileInputFormat and provide an implementation of
getRecordReader() that returns your implementation of RecordReader to
read fixed width records. In the next() method you would do something
like:

byte[] buf = new byte[100];
IOUtils.readFully(in, buf, pos, 100);
pos += 100;

You would also need to check for the end of the stream. See
LineRecordReader for some ideas. You'll also have to handle finding
the start of records for a split, which you can do by looking at the
offset and seeking to the next multiple of 100.

If the RecordReader was a RecordReader
(no keys) then it would return each record as a byte array to the
mapper, which would then break it into fields. Alternatively, you
could do it in the RecordReader, and use your own type which
encapsulates the fields for the value.

Hope this helps.

Cheers,
Tom

On Thu, May 28, 2009 at 1:15 PM, Stuart White  wrote:
> I need to process a dataset that contains text records of fixed length
> in bytes.  For example, each record may be 100 bytes in length, with
> the first field being the first 10 bytes, the second field being the
> second 10 bytes, etc...  There are no newlines on the file.  Field
> values have been either whitespace-padded or truncated to fit within
> the specific locations in these fixed-width records.
>
> Does Hadoop have an InputFormat to support processing of such files?
> I looked but couldn't find one.
>
> Of course, I could pre-process the file (outside of Hadoop) to put
> newlines at the end of each record, but I'd prefer not to require such
> a prep step.
>
> Thanks.
>


Re: Appending to a file / updating a file

2009-05-28 Thread Sasha Dolgy
append isn't supported without modifying the configuration file for
hadoop.  check out the mailling list threads ... i've sent a post in
the past explaining how to enable it.

On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja  wrote:
> Hello,
>
> I'm trying hadoop for the first time and I'm just trying to create a file
> and append some text in it with the following code:
>
>
> import java.io.IOException;
>
> import org.apache.hadoop.conf. Configuration;
> import org.apache.hadoop.fs.FSDataOutputStream;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
>
> /**
>  * @author olivier
>  *
>  */
> public class HadoopIO {
>
>    public static void main(String[] args) throws IOException {
>
>
>        String directory = "/Users/olivier/tmp/hadoop-data";
>        Configuration conf = new Configuration(true);
>        Path path = new Path(directory);
>        // Create the File system
>        FileSystem fs = path.getFileSystem(conf);
>        // Sets the working directory
>        fs.setWorkingDirectory(path);
>
>        System.out.println(fs.getWorkingDirectory());
>
>        // Creates a files
>        FSDataOutputStream out = fs.create(new Path("test.txt"));
>        out.writeBytes("Testing hadoop - first line");
>        out.close();
>        // then try to append something
>        out = fs.append(new Path("test.txt"));
>        out.writeBytes("Testing hadoop - second line");
>        out.close();
>
>        fs.close();
>
>
>    }
>
> }
>
>
> but I receive the following exception:
>
> Exception in thread "main" java.io.IOException: Not supported
>    at
> org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290)
>    at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
>    at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38)
>
>
> 1) Can someone tell me what am i doing wrong?
>
> 2) How can I update the file (for example, just update the first 10 bytes of
> the file)?
>
>
> Thanks,
> Olivier
>



-- 
Sasha Dolgy
sasha.do...@gmail.com


Re: Appending to a file / updating a file

2009-05-28 Thread Olivier Smadja
Hi Sacha!

Thanks for the quick answer. Is there a simple way to search the mailing
list? by text or by author.

At http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I only see a
browse per month...

Thanks,
Olivier


On Thu, May 28, 2009 at 10:57 AM, Sasha Dolgy  wrote:

> append isn't supported without modifying the configuration file for
> hadoop.  check out the mailling list threads ... i've sent a post in
> the past explaining how to enable it.
>
> On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja  wrote:
> > Hello,
> >
> > I'm trying hadoop for the first time and I'm just trying to create a file
> > and append some text in it with the following code:
> >
> >
> > import java.io.IOException;
> >
> > import org.apache.hadoop.conf. Configuration;
> > import org.apache.hadoop.fs.FSDataOutputStream;
> > import org.apache.hadoop.fs.FileSystem;
> > import org.apache.hadoop.fs.Path;
> >
> > /**
> >  * @author olivier
> >  *
> >  */
> > public class HadoopIO {
> >
> >public static void main(String[] args) throws IOException {
> >
> >
> >String directory = "/Users/olivier/tmp/hadoop-data";
> >Configuration conf = new Configuration(true);
> >Path path = new Path(directory);
> >// Create the File system
> >FileSystem fs = path.getFileSystem(conf);
> >// Sets the working directory
> >fs.setWorkingDirectory(path);
> >
> >System.out.println(fs.getWorkingDirectory());
> >
> >// Creates a files
> >FSDataOutputStream out = fs.create(new Path("test.txt"));
> >out.writeBytes("Testing hadoop - first line");
> >out.close();
> >// then try to append something
> >out = fs.append(new Path("test.txt"));
> >out.writeBytes("Testing hadoop - second line");
> >out.close();
> >
> >fs.close();
> >
> >
> >}
> >
> > }
> >
> >
> > but I receive the following exception:
> >
> > Exception in thread "main" java.io.IOException: Not supported
> >at
> >
> org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290)
> >at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
> >at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38)
> >
> >
> > 1) Can someone tell me what am i doing wrong?
> >
> > 2) How can I update the file (for example, just update the first 10 bytes
> of
> > the file)?
> >
> >
> > Thanks,
> > Olivier
> >
>
>
>
> --
> Sasha Dolgy
> sasha.do...@gmail.com
>


Re: Appending to a file / updating a file

2009-05-28 Thread Sasha Dolgy
http://www.mail-archive.com/core-user@hadoop.apache.org/msg10002.html


On Thu, May 28, 2009 at 3:03 PM, Olivier Smadja  wrote:
> Hi Sacha!
>
> Thanks for the quick answer. Is there a simple way to search the mailing
> list? by text or by author.
>
> At http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I only see a
> browse per month...
>
> Thanks,
> Olivier
>
>
> On Thu, May 28, 2009 at 10:57 AM, Sasha Dolgy  wrote:
>
>> append isn't supported without modifying the configuration file for
>> hadoop.  check out the mailling list threads ... i've sent a post in
>> the past explaining how to enable it.
>>
>> On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja  wrote:
>> > Hello,
>> >
>> > I'm trying hadoop for the first time and I'm just trying to create a file
>> > and append some text in it with the following code:
>> >
>> >
>> > import java.io.IOException;
>> >
>> > import org.apache.hadoop.conf. Configuration;
>> > import org.apache.hadoop.fs.FSDataOutputStream;
>> > import org.apache.hadoop.fs.FileSystem;
>> > import org.apache.hadoop.fs.Path;
>> >
>> > /**
>> >  * @author olivier
>> >  *
>> >  */
>> > public class HadoopIO {
>> >
>> >    public static void main(String[] args) throws IOException {
>> >
>> >
>> >        String directory = "/Users/olivier/tmp/hadoop-data";
>> >        Configuration conf = new Configuration(true);
>> >        Path path = new Path(directory);
>> >        // Create the File system
>> >        FileSystem fs = path.getFileSystem(conf);
>> >        // Sets the working directory
>> >        fs.setWorkingDirectory(path);
>> >
>> >        System.out.println(fs.getWorkingDirectory());
>> >
>> >        // Creates a files
>> >        FSDataOutputStream out = fs.create(new Path("test.txt"));
>> >        out.writeBytes("Testing hadoop - first line");
>> >        out.close();
>> >        // then try to append something
>> >        out = fs.append(new Path("test.txt"));
>> >        out.writeBytes("Testing hadoop - second line");
>> >        out.close();
>> >
>> >        fs.close();
>> >
>> >
>> >    }
>> >
>> > }
>> >
>> >
>> > but I receive the following exception:
>> >
>> > Exception in thread "main" java.io.IOException: Not supported
>> >    at
>> >
>> org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290)
>> >    at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
>> >    at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38)
>> >
>> >
>> > 1) Can someone tell me what am i doing wrong?
>> >
>> > 2) How can I update the file (for example, just update the first 10 bytes
>> of
>> > the file)?
>> >
>> >
>> > Thanks,
>> > Olivier
>> >
>>
>>
>>
>> --
>> Sasha Dolgy
>> sasha.do...@gmail.com
>>
>



-- 
Sasha Dolgy
sasha.do...@gmail.com


Appending to a file / updating a file

2009-05-28 Thread Olivier Smadja
Hello,

I'm trying hadoop for the first time and I'm just trying to create a file
and append some text in it with the following code:


import java.io.IOException;

import org.apache.hadoop.conf. Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

/**
 * @author olivier
 *
 */
public class HadoopIO {

public static void main(String[] args) throws IOException {


String directory = "/Users/olivier/tmp/hadoop-data";
Configuration conf = new Configuration(true);
Path path = new Path(directory);
// Create the File system
FileSystem fs = path.getFileSystem(conf);
// Sets the working directory
fs.setWorkingDirectory(path);

System.out.println(fs.getWorkingDirectory());

// Creates a files
FSDataOutputStream out = fs.create(new Path("test.txt"));
out.writeBytes("Testing hadoop - first line");
out.close();
// then try to append something
out = fs.append(new Path("test.txt"));
out.writeBytes("Testing hadoop - second line");
out.close();

fs.close();


}

}


but I receive the following exception:

Exception in thread "main" java.io.IOException: Not supported
at
org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38)


1) Can someone tell me what am i doing wrong?

2) How can I update the file (for example, just update the first 10 bytes of
the file)?


Thanks,
Olivier


Re: hadoop hardware configuration

2009-05-28 Thread Brian Bockelman


On May 28, 2009, at 5:02 AM, Steve Loughran wrote:


Patrick Angeles wrote:
Sorry for cross-posting, I realized I sent the following to the  
hbase list

when it's really more a Hadoop question.


This is an interesting question. Obviously as an HP employee you  
must assume that I'm biased when I say HP DL160 servers are good   
value for the workers, though our blade systems are very good for a  
high physical density -provided you have the power to fill up the  
rack.


:)

As an HP employee, can you provide us with coupons?




2 x Hadoop Master (and Secondary NameNode)
  - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W)
  - 16GB DDR2-800 Registered ECC Memory
  - 4 x 1TB 7200rpm SATA II Drives
  - Hardware RAID controller
  - Redundant Power Supply
  - Approx. 390W power draw (1.9amps 208V)
  - Approx. $4000 per unit


I do not know the what the advantages of that many cores are on a  
NN. Someone needs to do some experiments. I do know you need enough  
RAM to hold the index in memory, and you may want to go for a bigger  
block size to keep the index size down.




Despite my trying, I've never been able to come even close to pegging  
the CPUs on our NN.


I'd recommend going for the fastest dual-cores which are affordable --  
latency is king.


Of course, with the size of your cluster, I'd spend a little less  
money here and get more disk space.





6 x Hadoop Task Nodes
  - 1 x 2.3Ghz Quad Core (Opteron 1356)
  - 8GB DDR2-800 Registered ECC Memory
  - 4 x 1TB 7200rpm SATA II Drives
  - No RAID (JBOD)
  - Non-Redundant Power Supply
  - Approx. 210W power draw (1.0amps 208V)
  - Approx. $2000 per unit
I had some specific questions regarding this configuration...




  1. Is hardware RAID necessary for the master node?



You need a good story to ensure that loss of a disk on the master  
doesn't lose the filesystem. I like RAID there, but the alternative  
is to push the stuff out over the network to other storage you  
trust. That could be NFS-mounted RAID storage, it could be NFS  
mounted JBOD. Whatever your chosen design, test it works before you  
go live by running the cluster then simulate different failures, see  
how well the hardware/ops team handles it.


Keep an eye on where that data goes, because when the NN runs out of  
file storage, the consequences can be pretty dramatic (i,e the  
cluster doesnt come up unless you edit the editlog by hand)


We do both -- push the disk image out to NFS and have a mirrored SAS  
hard drives on the namenode.  The SAS drives appear to be overkill.




  2. What is a good processor-to-storage ratio for a task node with  
4TB of

  raw storage? (The config above has 1 core per 1TB of raw storage.)



We're data hungry locally -- I'd put in bigger hard drives.  The 1.5TB  
Seagate drives seem to have passed their teething issues, and are at a  
pretty sweet price point.  They only will scale up to 60 IOPS, so make  
sure your workflows don't have lots of random I/O.


As Steve mentions below, the rest is really up to your algorithm.  Do  
you need 1 CPU second / byte?  If so, buy more CPUs.  Do you need .1  
CPU second / MB?  If so, buy more disks.


Brian



That really depends on the work you are doing...the bytes in/out to  
CPU work, and the size of any memory structures that are built up  
over the run.


With 1 core per physical disk, you get the bandwidth of a single  
disk per CPU; for some IO-intensive work you can make the case for  
two disks/CPU -one in, one out, but then you are using more power,  
and if/when you want to add more storage, you have to pull out the  
disks to stick in new ones. If you go for more CPUs, you will  
probably need more RAM to go with it.


  3. Am I better off using dual quads for a task node, with a  
higher power
  draw? Dual quad task node with 16GB RAM and 4TB storage costs  
roughly $3200,

  but draws almost 2x as much power. The tradeoffs are:
 1. I will get more CPU per dollar and per watt.
 2. I will only be able to fit 1/2 as much dual quad machines  
into a

 rack.
 3. I will get 1/2 the storage capacity per watt.
 4. I will get less I/O throughput overall (less spindles per  
core)


First there is the algorithm itself, and whether you are IO or CPU  
bound. Most MR jobs that I've encountered are fairly IO bound - 
without indexes, every lookup has to stream through all the data, so  
it's power inefficient and IO limited. but if you are trying to do  
higher level stuff than just lookup, then you will be doing more CPU- 
work


Then there is the question of where your electricity comes from,  
what the limits for the room are, whether you are billed on power  
drawn or quoted PSU draw, what the HVAC limits are, what the maximum  
allowed weight per rack is, etc, etc.


I'm a fan of low Joule work, though we don't have any benchmarks yet  
of the power efficiency of different clusters; the number of MJ used  
to do a a terasort. I'm debating doing some single-cpu t

Re: SequenceFile and streaming

2009-05-28 Thread Tom White
Hi Walter,

On Thu, May 28, 2009 at 6:52 AM, walter steffe  wrote:
> Hello
>  I am a new user and I would like to use hadoop streaming with
> SequenceFile in both input and output side.
>
> -The first difficoulty arises from the lack of a simple tool to generate
> a SequenceFile starting from a set of files in a given directory.
> I would like to have something similar to "tar -cvf file.tar foo/"
> This should work also in the opposite direction like "tar -xvf file.tar"

There's a tool for turning tar files into sequence files here:
http://stuartsierra.com/2008/04/24/a-million-little-files

>
> -An other important feature that I would like to see is the possibility
> to feed the mapper stdin with the whole content of a file (extracted
> from the file SequenceFile) disregarding the key.

Have a look at SequenceFileAsTextInputFormat which will do this for
you (except the key is the sequence file's key).

> Using each file as a tar archive I it would like to be able to do:
>
>  $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
>                  -input "/user/me/inputSequenceFile"  \
>                  -output "/user/me/outputSequenceFile"  \
>                  -inputformat SequenceFile
>                  -outputformat SequenceFile
>                  -mapper myscript.sh
>                  -reducer NONE
>
>  myscrip.sh should work as a filter which takes its input from
>  stdin and put the output on stdout:
>
>  tar -x
>  "do something on the generated dir and create an outputfile"
>  cat outputfile
>
> The output file should (automatically) go into the outputSequenceFile.
>
> I think that this would be a very usefull schema which fits well with
> the mapreduce requirements on one side and with the unix commands on the
> other side. It should not be too difficoult to implement the tools
> needed for that.

I totally agree - having more tools to better integrate sequence files
and map files with unix tools would be very handy.

Tom

>
>
> Walter
>
>
>
>
>
>
>
>


Re: Appending to a file / updating a file

2009-05-28 Thread Olivier Smadja
Thanks Sacha,

I have now my hdfs-site.xml like that : (as the hadoop-site.xml seems to be
deprecated)



dfs.support.append
true



But I continue receiving the exception.

Checking the hadoop source code, I saw

 public FSDataOutputStream append(Path f, int bufferSize,
  Progressable progress) throws IOException {
throw new IOException("Not supported");
  }

in the class org.apache.hadoop.fs.ChecksumFileSystem and this is where the
exception is thrown. my fileSystem is a LocalFileSystem instance that
inherits the ChecksumFileSystem and the append method has not beem
overriden.

Whereas in the DistributedFileSystem, the append method is defined like
this:

  /** This optional operation is not yet supported. */
  public FSDataOutputStream append(Path f, int bufferSize,
  Progressable progress) throws IOException {

DFSOutputStream op = (DFSOutputStream)dfs.append(getPathName(f),
bufferSize, progress);
return new FSDataOutputStream(op, statistics, op.getInitialLen());
  }

even if the comment still say it is not supported, it seems to do
something

So this makes me think that append is not supported on hadoop
LocalFileSystem.

Is it correct?

Thanks,
Olivier





On Thu, May 28, 2009 at 11:06 AM, Sasha Dolgy  wrote:

> http://www.mail-archive.com/core-user@hadoop.apache.org/msg10002.html
>
>
> On Thu, May 28, 2009 at 3:03 PM, Olivier Smadja  wrote:
> > Hi Sacha!
> >
> > Thanks for the quick answer. Is there a simple way to search the mailing
> > list? by text or by author.
> >
> > At http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I only see
> a
> > browse per month...
> >
> > Thanks,
> > Olivier
> >
> >
> > On Thu, May 28, 2009 at 10:57 AM, Sasha Dolgy  wrote:
> >
> >> append isn't supported without modifying the configuration file for
> >> hadoop.  check out the mailling list threads ... i've sent a post in
> >> the past explaining how to enable it.
> >>
> >> On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja 
> wrote:
> >> > Hello,
> >> >
> >> > I'm trying hadoop for the first time and I'm just trying to create a
> file
> >> > and append some text in it with the following code:
> >> >
> >> >
> >> > import java.io.IOException;
> >> >
> >> > import org.apache.hadoop.conf. Configuration;
> >> > import org.apache.hadoop.fs.FSDataOutputStream;
> >> > import org.apache.hadoop.fs.FileSystem;
> >> > import org.apache.hadoop.fs.Path;
> >> >
> >> > /**
> >> >  * @author olivier
> >> >  *
> >> >  */
> >> > public class HadoopIO {
> >> >
> >> >public static void main(String[] args) throws IOException {
> >> >
> >> >
> >> >String directory = "/Users/olivier/tmp/hadoop-data";
> >> >Configuration conf = new Configuration(true);
> >> >Path path = new Path(directory);
> >> >// Create the File system
> >> >FileSystem fs = path.getFileSystem(conf);
> >> >// Sets the working directory
> >> >fs.setWorkingDirectory(path);
> >> >
> >> >System.out.println(fs.getWorkingDirectory());
> >> >
> >> >// Creates a files
> >> >FSDataOutputStream out = fs.create(new Path("test.txt"));
> >> >out.writeBytes("Testing hadoop - first line");
> >> >out.close();
> >> >// then try to append something
> >> >out = fs.append(new Path("test.txt"));
> >> >out.writeBytes("Testing hadoop - second line");
> >> >out.close();
> >> >
> >> >fs.close();
> >> >
> >> >
> >> >}
> >> >
> >> > }
> >> >
> >> >
> >> > but I receive the following exception:
> >> >
> >> > Exception in thread "main" java.io.IOException: Not supported
> >> >at
> >> >
> >>
> org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290)
> >> >at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
> >> >at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38)
> >> >
> >> >
> >> > 1) Can someone tell me what am i doing wrong?
> >> >
> >> > 2) How can I update the file (for example, just update the first 10
> bytes
> >> of
> >> > the file)?
> >> >
> >> >
> >> > Thanks,
> >> > Olivier
> >> >
> >>
> >>
> >>
> >> --
> >> Sasha Dolgy
> >> sasha.do...@gmail.com
> >>
> >
>
>
>
> --
> Sasha Dolgy
> sasha.do...@gmail.com
>


Re: InputFormat for fixed-width records?

2009-05-28 Thread Owen O'Malley

On May 28, 2009, at 5:15 AM, Stuart White wrote:


I need to process a dataset that contains text records of fixed length
in bytes.  For example, each record may be 100 bytes in length


The update to the terasort example has an InputFormat that does  
exactly that. The key is 10 bytes and the value is the next 90 bytes.  
It is pretty easy to write, but I should upload it soon. The output  
types are Text, but they just have the binary data in them.


-- Owen


Re: Appending to a file / updating a file

2009-05-28 Thread Sasha Dolgy
did you restart hadoop?  sorry i'm stuck in the middle of something so
can't give this more attention.  i can assure you however that we have
append working in our POC ... and the code isn't that much different
to what you have posted.

-sd

On Thu, May 28, 2009 at 3:31 PM, Olivier Smadja  wrote:
> Thanks Sacha,
>
> I have now my hdfs-site.xml like that : (as the hadoop-site.xml seems to be
> deprecated)
>
> 
>    
>        dfs.support.append
>        true
>    
> 
>
> But I continue receiving the exception.
>
> Checking the hadoop source code, I saw
>
>  public FSDataOutputStream append(Path f, int bufferSize,
>      Progressable progress) throws IOException {
>    throw new IOException("Not supported");
>  }
>
> in the class org.apache.hadoop.fs.ChecksumFileSystem and this is where the
> exception is thrown. my fileSystem is a LocalFileSystem instance that
> inherits the ChecksumFileSystem and the append method has not beem
> overriden.
>
> Whereas in the DistributedFileSystem, the append method is defined like
> this:
>
>  /** This optional operation is not yet supported. */
>  public FSDataOutputStream append(Path f, int bufferSize,
>      Progressable progress) throws IOException {
>
>    DFSOutputStream op = (DFSOutputStream)dfs.append(getPathName(f),
> bufferSize, progress);
>    return new FSDataOutputStream(op, statistics, op.getInitialLen());
>  }
>
> even if the comment still say it is not supported, it seems to do
> something
>
> So this makes me think that append is not supported on hadoop
> LocalFileSystem.
>
> Is it correct?
>
> Thanks,
> Olivier
>
>
>
>
>
> On Thu, May 28, 2009 at 11:06 AM, Sasha Dolgy  wrote:
>
>> http://www.mail-archive.com/core-user@hadoop.apache.org/msg10002.html
>>
>>
>> On Thu, May 28, 2009 at 3:03 PM, Olivier Smadja  wrote:
>> > Hi Sacha!
>> >
>> > Thanks for the quick answer. Is there a simple way to search the mailing
>> > list? by text or by author.
>> >
>> > At http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I only see
>> a
>> > browse per month...
>> >
>> > Thanks,
>> > Olivier
>> >
>> >
>> > On Thu, May 28, 2009 at 10:57 AM, Sasha Dolgy  wrote:
>> >
>> >> append isn't supported without modifying the configuration file for
>> >> hadoop.  check out the mailling list threads ... i've sent a post in
>> >> the past explaining how to enable it.
>> >>
>> >> On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja 
>> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm trying hadoop for the first time and I'm just trying to create a
>> file
>> >> > and append some text in it with the following code:
>> >> >
>> >> >
>> >> > import java.io.IOException;
>> >> >
>> >> > import org.apache.hadoop.conf. Configuration;
>> >> > import org.apache.hadoop.fs.FSDataOutputStream;
>> >> > import org.apache.hadoop.fs.FileSystem;
>> >> > import org.apache.hadoop.fs.Path;
>> >> >
>> >> > /**
>> >> >  * @author olivier
>> >> >  *
>> >> >  */
>> >> > public class HadoopIO {
>> >> >
>> >> >    public static void main(String[] args) throws IOException {
>> >> >
>> >> >
>> >> >        String directory = "/Users/olivier/tmp/hadoop-data";
>> >> >        Configuration conf = new Configuration(true);
>> >> >        Path path = new Path(directory);
>> >> >        // Create the File system
>> >> >        FileSystem fs = path.getFileSystem(conf);
>> >> >        // Sets the working directory
>> >> >        fs.setWorkingDirectory(path);
>> >> >
>> >> >        System.out.println(fs.getWorkingDirectory());
>> >> >
>> >> >        // Creates a files
>> >> >        FSDataOutputStream out = fs.create(new Path("test.txt"));
>> >> >        out.writeBytes("Testing hadoop - first line");
>> >> >        out.close();
>> >> >        // then try to append something
>> >> >        out = fs.append(new Path("test.txt"));
>> >> >        out.writeBytes("Testing hadoop - second line");
>> >> >        out.close();
>> >> >
>> >> >        fs.close();
>> >> >
>> >> >
>> >> >    }
>> >> >
>> >> > }
>> >> >
>> >> >
>> >> > but I receive the following exception:
>> >> >
>> >> > Exception in thread "main" java.io.IOException: Not supported
>> >> >    at
>> >> >
>> >>
>> org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290)
>> >> >    at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
>> >> >    at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38)
>> >> >
>> >> >
>> >> > 1) Can someone tell me what am i doing wrong?
>> >> >
>> >> > 2) How can I update the file (for example, just update the first 10
>> bytes
>> >> of
>> >> > the file)?
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Olivier
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Sasha Dolgy
>> >> sasha.do...@gmail.com
>> >>
>> >
>>
>>
>>
>> --
>> Sasha Dolgy
>> sasha.do...@gmail.com
>>
>



-- 
Sasha Dolgy
sasha.do...@gmail.com


Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Todd Lipcon
Hi Stuart,

It seems to me like you have a few options.

Option 1: Just use a lot of RAM. Unless you really expect many millions of
entries on both sides of the join, you might be able to get away with
buffering despite its inefficiency.

Option 2: Use LocalDirAllocator to find some local storage to spill all of
the left table's records to disk in a MapFile format. Then as you iterate
over the right table, do lookups in the MapFile. This is really the same as
option 1, except that you're using disk as an extension of RAM.

Option 3: Convert this to a map-side merge join. Basically what you need to
do is sort both tables by the join key, and partition them with the same
partitioner into the same number of columns. This way you have an equal
number of part-N files for both tables, and within each part-N file
they're ordered by join key. In each map task, you open both tableA/part-N
and tableB/part-N and do a sequential merge to perform the join. I believe
the CompositeInputFormat class helps with this, though I've never used it.

Option 4: Perform the join in several passes. Whichever table is smaller,
break into pieces that fit in RAM. Unless my relational algebra is off, A
JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B = B1 UNION
B2.

Hope that helps
-Todd

On Thu, May 28, 2009 at 5:02 AM, Stuart White wrote:

> I need to do a reduce-side join of two datasets.  It's a many-to-many
> join; that is, each dataset can can multiple records with any given
> key.
>
> Every description of a reduce-side join I've seen involves
> constructing your keys out of your mapper such that records from one
> dataset will be presented to the reducers before records from the
> second dataset.  I should "hold on" to the value from the one dataset
> and remember it as I iterate across the values from the second
> dataset.
>
> This seems like it only works well for one-to-many joins (when one of
> your datasets will only have a single record with any given key).
> This scales well because you're only remembering one value.
>
> In a many-to-many join, if you apply this same algorithm, you'll need
> to remember all values from one dataset, which of course will be
> problematic (and won't scale) when dealing with large datasets with
> large numbers of records with the same keys.
>
> Does an efficient algorithm exist for a many-to-many reduce-side join?
>


Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Todd Lipcon
One last possible trick to consider:

If you were to subclass SequenceFileRecordReader, you'd have access to its
seek method, allowing you to rewind the reducer input. You could then
implement a block hash join with something like the following pseudocode:

ahash = new HashMap();
while (i have ram available) {
  read a record
  if the record is from table B, break
  put the record into ahash
}
nextAPos = reader.getPos()

while (current record is an A record) {
  skip to next record
}
firstBPos = reader.getPos()

while (current record has current key) {
  read and join against ahash
  process joined result
}

if firstBPos > nextAPos {
  seek(nextAPos)
  go back to top
}


On Thu, May 28, 2009 at 8:05 AM, Todd Lipcon  wrote:

> Hi Stuart,
>
> It seems to me like you have a few options.
>
> Option 1: Just use a lot of RAM. Unless you really expect many millions of
> entries on both sides of the join, you might be able to get away with
> buffering despite its inefficiency.
>
> Option 2: Use LocalDirAllocator to find some local storage to spill all of
> the left table's records to disk in a MapFile format. Then as you iterate
> over the right table, do lookups in the MapFile. This is really the same as
> option 1, except that you're using disk as an extension of RAM.
>
> Option 3: Convert this to a map-side merge join. Basically what you need to
> do is sort both tables by the join key, and partition them with the same
> partitioner into the same number of columns. This way you have an equal
> number of part-N files for both tables, and within each part-N file
> they're ordered by join key. In each map task, you open both tableA/part-N
> and tableB/part-N and do a sequential merge to perform the join. I believe
> the CompositeInputFormat class helps with this, though I've never used it.
>
> Option 4: Perform the join in several passes. Whichever table is smaller,
> break into pieces that fit in RAM. Unless my relational algebra is off, A
> JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B = B1 UNION
> B2.
>
> Hope that helps
> -Todd
>
>
> On Thu, May 28, 2009 at 5:02 AM, Stuart White wrote:
>
>> I need to do a reduce-side join of two datasets.  It's a many-to-many
>> join; that is, each dataset can can multiple records with any given
>> key.
>>
>> Every description of a reduce-side join I've seen involves
>> constructing your keys out of your mapper such that records from one
>> dataset will be presented to the reducers before records from the
>> second dataset.  I should "hold on" to the value from the one dataset
>> and remember it as I iterate across the values from the second
>> dataset.
>>
>> This seems like it only works well for one-to-many joins (when one of
>> your datasets will only have a single record with any given key).
>> This scales well because you're only remembering one value.
>>
>> In a many-to-many join, if you apply this same algorithm, you'll need
>> to remember all values from one dataset, which of course will be
>> problematic (and won't scale) when dealing with large datasets with
>> large numbers of records with the same keys.
>>
>> Does an efficient algorithm exist for a many-to-many reduce-side join?
>>
>
>


Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Chris K Wensel
I believe PIG, and I know Cascading use a kind of 'spillable' list  
that can be re-iterated across. PIG's version is a bit more  
sophisticated last I looked.


that said, if you were using either one of them, you wouldn't need to  
write your own many-to-many join.


cheers,
ckw

On May 28, 2009, at 8:14 AM, Todd Lipcon wrote:


One last possible trick to consider:

If you were to subclass SequenceFileRecordReader, you'd have access  
to its

seek method, allowing you to rewind the reducer input. You could then
implement a block hash join with something like the following  
pseudocode:


ahash = new HashMap();
while (i have ram available) {
 read a record
 if the record is from table B, break
 put the record into ahash
}
nextAPos = reader.getPos()

while (current record is an A record) {
 skip to next record
}
firstBPos = reader.getPos()

while (current record has current key) {
 read and join against ahash
 process joined result
}

if firstBPos > nextAPos {
 seek(nextAPos)
 go back to top
}


On Thu, May 28, 2009 at 8:05 AM, Todd Lipcon   
wrote:



Hi Stuart,

It seems to me like you have a few options.

Option 1: Just use a lot of RAM. Unless you really expect many  
millions of

entries on both sides of the join, you might be able to get away with
buffering despite its inefficiency.

Option 2: Use LocalDirAllocator to find some local storage to spill  
all of
the left table's records to disk in a MapFile format. Then as you  
iterate
over the right table, do lookups in the MapFile. This is really the  
same as

option 1, except that you're using disk as an extension of RAM.

Option 3: Convert this to a map-side merge join. Basically what you  
need to
do is sort both tables by the join key, and partition them with the  
same
partitioner into the same number of columns. This way you have an  
equal
number of part-N files for both tables, and within each part- 
N file
they're ordered by join key. In each map task, you open both tableA/ 
part-N
and tableB/part-N and do a sequential merge to perform the join. I  
believe
the CompositeInputFormat class helps with this, though I've never  
used it.


Option 4: Perform the join in several passes. Whichever table is  
smaller,
break into pieces that fit in RAM. Unless my relational algebra is  
off, A
JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B =  
B1 UNION

B2.

Hope that helps
-Todd


On Thu, May 28, 2009 at 5:02 AM, Stuart White >wrote:


I need to do a reduce-side join of two datasets.  It's a many-to- 
many

join; that is, each dataset can can multiple records with any given
key.

Every description of a reduce-side join I've seen involves
constructing your keys out of your mapper such that records from one
dataset will be presented to the reducers before records from the
second dataset.  I should "hold on" to the value from the one  
dataset

and remember it as I iterate across the values from the second
dataset.

This seems like it only works well for one-to-many joins (when one  
of

your datasets will only have a single record with any given key).
This scales well because you're only remembering one value.

In a many-to-many join, if you apply this same algorithm, you'll  
need

to remember all values from one dataset, which of course will be
problematic (and won't scale) when dealing with large datasets with
large numbers of records with the same keys.

Does an efficient algorithm exist for a many-to-many reduce-side  
join?







--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com



Re: hadoop hardware configuration

2009-05-28 Thread Ian Soboroff
Brian Bockelman  writes:

> Despite my trying, I've never been able to come even close to pegging
> the CPUs on our NN.
>
> I'd recommend going for the fastest dual-cores which are affordable -- 
> latency is king.

Clue?

Surely the latencies in Hadoop that dominate are not cured with faster
processors, but with more RAM and faster disks?

I've followed your posts for a while, so I know you are very experienced
with this stuff... help me out here.

Ian


Re: hadoop hardware configuration

2009-05-28 Thread Brian Bockelman


On May 28, 2009, at 10:32 AM, Ian Soboroff wrote:


Brian Bockelman  writes:


Despite my trying, I've never been able to come even close to pegging
the CPUs on our NN.

I'd recommend going for the fastest dual-cores which are affordable  
--

latency is king.


Clue?

Surely the latencies in Hadoop that dominate are not cured with faster
processors, but with more RAM and faster disks?

I've followed your posts for a while, so I know you are very  
experienced

with this stuff... help me out here.


Actually, that's more of a gut feeling than informed decision.   
Because the locking is rather coarse-grained, having many CPUs isn't  
going to win anything -- I'd rather any CPU-related portions to go as  
fast as possible.  Under the highest load, I think we've been able to  
get up to 25% CPU utilization: thus, I'm guessing any CPU-related  
improvements will come from faster ones, not more cores.


For my cluster, if I had a lot of money, I'd spend it on a hot-spare  
machine.  Then, I'd spend it on upgrading the RAM, followed by disks,  
followed by CPU.


Then again, for the cluster in the original email, I'd save money on  
the namenode and buy more datanodes.  We've got about 200 nodes and  
probably have a comparable NN.


Brian


Re: Appending to a file / updating a file

2009-05-28 Thread Damien Cooke

Olivier,
Append is not supported or recommended at this point.  You can turn it  
on via dfs.support.append in hdfs-site.xml under 0.20.0.  There have  
been some issues making it reliable.  If this is not production code  
or a production job then turning it on will probably have no  
detrimental effect, but be aware it might destroy your data as it is  
not recommended at this point.


Regards
Damien
On 28/05/2009, at 6:46 AM, Olivier Smadja wrote:


Hello,

I'm trying hadoop for the first time and I'm just trying to create a  
file

and append some text in it with the following code:


import java.io.IOException;

import org.apache.hadoop.conf. Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

/**
* @author olivier
*
*/
public class HadoopIO {

   public static void main(String[] args) throws IOException {


   String directory = "/Users/olivier/tmp/hadoop-data";
   Configuration conf = new Configuration(true);
   Path path = new Path(directory);
   // Create the File system
   FileSystem fs = path.getFileSystem(conf);
   // Sets the working directory
   fs.setWorkingDirectory(path);

   System.out.println(fs.getWorkingDirectory());

   // Creates a files
   FSDataOutputStream out = fs.create(new Path("test.txt"));
   out.writeBytes("Testing hadoop - first line");
   out.close();
   // then try to append something
   out = fs.append(new Path("test.txt"));
   out.writeBytes("Testing hadoop - second line");
   out.close();

   fs.close();


   }

}


but I receive the following exception:

Exception in thread "main" java.io.IOException: Not supported
   at
org 
.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java: 
290)

   at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
   at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38)


1) Can someone tell me what am i doing wrong?

2) How can I update the file (for example, just update the first 10  
bytes of

the file)?


Thanks,
Olivier



Damien Cooke
Open Scalable Solutions Performance
Performance & Applications Engineering

Sun Microsystems
Level 2, East Wing 50 Grenfell Street, Adelaide
SA 5000 Australia
Phone x58315 (x7058315 US callers)
Email damien.co...@sun.com



Reduce() time takes ~4x Map()

2009-05-28 Thread David Batista
Hi everyone,

I'm processing XML files, around 500MB each with several documents,
for the map() function I pass a document from the XML file, which
takes some time to process depending on the size - I'm applying NER to
texts.

Each document has a unique identifier, so I'm using that identifier as
a key and the results of parsing the document in one string as the
output:

so at the end of  the map function():
output.collect( new Text(identifier), new Text(outputString));

usually the outputString is around 1k-5k size

reduce():
public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) {
while (values.hasNext()) {
Text text = values.next();
try {
output.collect(key, text);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

I did a test using only 1 machine with 8 cores, and only 1 XML file,
it took around 3 hours to process all maps and  ~12hours for the
reduces!

the XML file has 139 945 documents

I set the jobconf for 1000 maps() and 200 reduces()

I did took a look at graphs on the web interface during the reduce
phase, and indeed its the copy phase that's taking much of the time,
the sort and reduce phase are done almost instantly.

Why does the copy phase takes so long? I understand that the copies
are made using HTTP, and the data was in really small chunks 1k-5k
size, but even so, being everything in the same physical machine
should have been faster, no?

Any suggestions on what might be causing the copies in reduce to take so long?
--
./david


org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-28 Thread ashish pareek
HI ,
 I am trying to step up a hadoop cluster on 512 MB machine and using
hadoop 0.18 and  have followed procedure given in  apache hadoop site for
hadoop cluster.
 I included  in conf/slaves two datanode i.e including the namenode
vitrual machine and other machine virtual machine  . and have set up
passwordless ssh between both virtual machines. But now problem is when
is run command >>

bin/hadoop start-all.sh

It start only one datanode on the same namenode vitrual machine but it
doesn't start the datanode on other machine.

in logs/hadoop-datanode  i get message


 INFO org.apache.hadoop.ipc.Client: Retrying
 connect to server: hadoop1/192.168.1.28:9000. Already
  tried 1 time(s).
 2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying
 connect to server: hadoop1/192.168.1.28:9000. Already tried 2 time(s).
 2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying
 connect to server: hadoop1/192.168.1.28:9000. Already tried 3 time(s).

.
.
.
.
.
.
.
.
.

.
.

.


So can any one help in solving this problem. :)

Thanks

Regards
Ashish Pareek


Re: Persistent storage on EC2

2009-05-28 Thread Kevin Peterson
On Tue, May 26, 2009 at 7:50 PM, Malcolm Matalka <
mmata...@millennialmedia.com> wrote:

> I'm using EBS volumes to have a persistent HDFS on EC2.  Do I need to keep
> the master updated on how to map the internal IPs, which change as I
> understand, to a known set of host names so it knows where the blocks are
> located each time I bring a cluster up?  If so, is keeping a mapping up to
> date in /etc/hosts sufficient?
>

I can't answer your first question of whether it's necessary. The namenode
might be able to figure it out when the DNs report their blocks.

Our staging cluster uses the setup you describe, with /etc/hosts pushed out
to all the machines, and the EBS volumes always mounted on the same
hostname. This works great.


Re: hadoop hardware configuration

2009-05-28 Thread Patrick Angeles
On Thu, May 28, 2009 at 10:24 AM, Brian Bockelman wrote:

>
> We do both -- push the disk image out to NFS and have a mirrored SAS hard
> drives on the namenode.  The SAS drives appear to be overkill.
>

This sounds like a nice approach, taking into account hardware, labor and
downtime costs... $700 for a RAID controller seems reasonable to minimize
maintenance due to a disk failure. Alex's suggestion to go JBOD and write to
all volumes would work as well, but slightly more labor intensive.


>>   2. What is a good processor-to-storage ratio for a task node with 4TB of
>>>  raw storage? (The config above has 1 core per 1TB of raw storage.)
>>>
>>
>
> We're data hungry locally -- I'd put in bigger hard drives.  The 1.5TB
> Seagate drives seem to have passed their teething issues, and are at a
> pretty sweet price point.  They only will scale up to 60 IOPS, so make sure
> your workflows don't have lots of random I/O.
>

I haven't seen too many vendors offering the 1.5TB option. What type of data
are you working with? At what volumes? I sense that at 50GB/day, we are
higher than average in terms of data volume over time.


> As Steve mentions below, the rest is really up to your algorithm.  Do you
> need 1 CPU second / byte?  If so, buy more CPUs.  Do you need .1 CPU second
> / MB?  If so, buy more disks.
>

Unfortunately, we won't know until we have a cluster to test on. Classic
catch-22. We are going to experiment with a small cluster and a small data
set, with plans to buy more appropriately sized slave nodes based on what we
learn.

- P


Re: hadoop hardware configuration

2009-05-28 Thread Brian Bockelman


On May 28, 2009, at 2:00 PM, Patrick Angeles wrote:

On Thu, May 28, 2009 at 10:24 AM, Brian Bockelman >wrote:




We do both -- push the disk image out to NFS and have a mirrored  
SAS hard

drives on the namenode.  The SAS drives appear to be overkill.



This sounds like a nice approach, taking into account hardware,  
labor and
downtime costs... $700 for a RAID controller seems reasonable to  
minimize
maintenance due to a disk failure. Alex's suggestion to go JBOD and  
write to

all volumes would work as well, but slightly more labor intensive.


Remember though that disk failure downtime is actually rather rare.   
The question is "how tight is your hardware budget": if $700 is worth  
the extra 1 day of uptime a year, then spend it.  I come from an  
academic background where (a) we don't lose money if things go down  
and (b) jobs move to another site in the US if things are down.  That  
perhaps gives you a reading into my somewhat relaxed attitude.


I'm not a hardware guy anymore, but I'd personally prefer a software  
RAID.  I've seen mirrored disks go down because the RAID controller  
decided to puke.





 2. What is a good processor-to-storage ratio for a task node with  
4TB of

raw storage? (The config above has 1 core per 1TB of raw storage.)





We're data hungry locally -- I'd put in bigger hard drives.  The  
1.5TB
Seagate drives seem to have passed their teething issues, and are  
at a
pretty sweet price point.  They only will scale up to 60 IOPS, so  
make sure

your workflows don't have lots of random I/O.



I haven't seen too many vendors offering the 1.5TB option. What type  
of data
are you working with? At what volumes? I sense that at 50GB/day, we  
are

higher than average in terms of data volume over time.



We have just short of 300TB of raw disk; our daily downloads range  
from a few GB to 10TB.


We bought 1.5TB drives separately from the nodes and sent students  
with screwdrivers at the cluster.




As Steve mentions below, the rest is really up to your algorithm.   
Do you
need 1 CPU second / byte?  If so, buy more CPUs.  Do you need .1  
CPU second

/ MB?  If so, buy more disks.



Unfortunately, we won't know until we have a cluster to test on.  
Classic
catch-22. We are going to experiment with a small cluster and a  
small data
set, with plans to buy more appropriately sized slave nodes based on  
what we

learn.



In that case, you're probably good!  24TB probably formats out to  
20TB.  With 2x replication at 50GB a day, you've got enough room for  
about half a year of data.  Hope your procurement process isn't too  
slow!


Brian



Re: hadoop hardware configuration

2009-05-28 Thread Patrick Angeles
On Thu, May 28, 2009 at 6:02 AM, Steve Loughran  wrote:

> That really depends on the work you are doing...the bytes in/out to CPU
> work, and the size of any memory structures that are built up over the run.
>
> With 1 core per physical disk, you get the bandwidth of a single disk per
> CPU; for some IO-intensive work you can make the case for two disks/CPU -one
> in, one out, but then you are using more power, and if/when you want to add
> more storage, you have to pull out the disks to stick in new ones. If you go
> for more CPUs, you will probably need more RAM to go with it.
>

Just to throw a wrench in the works, Intel's Nehalem architecture takes DDR3
memory which are paired in 3's. So for a dual quad core rig, you can get
either 6 x 2GB (12GB) or, 6 x 4GB (24GB) for an extra $500. That's a big
step up in price for extra memory in a slave node. 12GB probably won't be
enough, because the mid-range Nehalems support hyper-threading, so you
actually get up to 16 threads running on a dual quad setup.


> Then there is the question of where your electricity comes from, what the
> limits for the room are, whether you are billed on power drawn or quoted PSU
> draw, what the HVAC limits are, what the maximum allowed weight per rack is,
> etc, etc.


We're going to start with cabinets in a co-location. Most can provide 40amps
per cabinet (with up to 80% load), so you could fit around 30 single-socket
servers, or 15 dual-socket servers in a single rack.


>
> I'm a fan of low Joule work, though we don't have any benchmarks yet of the
> power efficiency of different clusters; the number of MJ used to do a a
> terasort. I'm debating doing some single-cpu tests for this on my laptop, as
> the battery knows how much gets used up by some work.
>
>4. In planning storage capacity, how much spare disk space should I take
>>   into account for 'scratch'? For now, I'm assuming 1x the input data
>> size.
>>
>
> That you should probably be able to determine on experimental work on
> smaller datasets. Some maps can throw out a lot of data, most reduces do
> actually reduce the final amount.
>
>
> -Steve
>
> (Disclaimer: I'm not making any official recommendations for hardware here,
> just making my opinions known. If you do want an official recommendation
> from HP, talk to your reseller or account manager, someone will look at your
> problem in more detail and make some suggestions. If you have any code/data
> that could be shared for benchmarking, that would help validate those
> suggestions)
>
>


Re: Appending to a file / updating a file

2009-05-28 Thread Olivier Smadja
Thanks Damien.

And can i update a file with hadoop or just create it and read it later?

Olivier

On Thu, May 28, 2009 at 1:31 PM, Damien Cooke  wrote:

> Olivier,
> Append is not supported or recommended at this point.  You can turn it on
> via dfs.support.append in hdfs-site.xml under 0.20.0.  There have been some
> issues making it reliable.  If this is not production code or a production
> job then turning it on will probably have no detrimental effect, but be
> aware it might destroy your data as it is not recommended at this point.
>
> Regards
> Damien
>
> On 28/05/2009, at 6:46 AM, Olivier Smadja wrote:
>
>  Hello,
>>
>> I'm trying hadoop for the first time and I'm just trying to create a file
>> and append some text in it with the following code:
>>
>>
>> import java.io.IOException;
>>
>> import org.apache.hadoop.conf. Configuration;
>> import org.apache.hadoop.fs.FSDataOutputStream;
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.Path;
>>
>> /**
>> * @author olivier
>> *
>> */
>> public class HadoopIO {
>>
>>   public static void main(String[] args) throws IOException {
>>
>>
>>   String directory = "/Users/olivier/tmp/hadoop-data";
>>   Configuration conf = new Configuration(true);
>>   Path path = new Path(directory);
>>   // Create the File system
>>   FileSystem fs = path.getFileSystem(conf);
>>   // Sets the working directory
>>   fs.setWorkingDirectory(path);
>>
>>   System.out.println(fs.getWorkingDirectory());
>>
>>   // Creates a files
>>   FSDataOutputStream out = fs.create(new Path("test.txt"));
>>   out.writeBytes("Testing hadoop - first line");
>>   out.close();
>>   // then try to append something
>>   out = fs.append(new Path("test.txt"));
>>   out.writeBytes("Testing hadoop - second line");
>>   out.close();
>>
>>   fs.close();
>>
>>
>>   }
>>
>> }
>>
>>
>> but I receive the following exception:
>>
>> Exception in thread "main" java.io.IOException: Not supported
>>   at
>>
>> org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:290)
>>   at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:525)
>>   at com.neodatis.odb.hadoop.HadoopIO.main(HadoopIO.java:38)
>>
>>
>> 1) Can someone tell me what am i doing wrong?
>>
>> 2) How can I update the file (for example, just update the first 10 bytes
>> of
>> the file)?
>>
>>
>> Thanks,
>> Olivier
>>
>
>
> Damien Cooke
> Open Scalable Solutions Performance
> Performance & Applications Engineering
>
> Sun Microsystems
> Level 2, East Wing 50 Grenfell Street, Adelaide
> SA 5000 Australia
> Phone x58315 (x7058315 US callers)
> Email damien.co...@sun.com
>
>


How do I convert DataInput and ResultSet to array of String?

2009-05-28 Thread dealmaker

Hi,
  How do I convert DataInput to array of String?
  How do I convert ResultSet to array of String?
Thanks.  Following is the code:

  static class Record implements Writable, DBWritable {
String [] aSAssoc;

public void write(DataOutput arg0) throws IOException {
  throw new UnsupportedOperationException("Not supported yet.");
}

public void readFields(DataInput in) throws IOException {
  this.aSAssoc = // How to convert DataInput to String Array?
}

public void write(PreparedStatement arg0) throws SQLException {
  throw new UnsupportedOperationException("Not supported yet.");
}

public void readFields(ResultSet rs) throws SQLException {
  this.aSAssoc = // How to convert ResultSet to String Array?
}
  }

-- 
View this message in context: 
http://www.nabble.com/How-do-I-convert-DataInput-and-ResultSet-to-array-of-String--tp23770747p23770747.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



New version/API stable?

2009-05-28 Thread David Rosenstrauch
Hadoop noob here, just starting to learn it, as we're planning to start
using it heavily in our processing.  Just wondering, though, which version
of the code I should start learning/working with.

It looks like the Hadoop API changed pretty significantly from 0.19 to
0.20 (e.g., org.apache.hadoop.mapred -> org.apache.hadoop.mapreduce),
which leads me to think I should start with the new API.  OTOH, since the
new release is a ".0" release after some of these major API overhauls, I'm
wondering if it's stable enough for us to start using in production.

Where'd be best for me to start learning?

TIA,

DR



Question: index package in contrib (lucene index)

2009-05-28 Thread Tenaali Ram
Hi,

I am trying to understand the code of index package to build a distributed
Lucene index. I have some very basic questions and would really appreciate
if someone can help me understand this code-

1) If I already have Lucene index (divided into shards), should I upload
these indexes into HDFS and provide its location or the code will pick these
shards from local file system ?

2) How is the code adding a document in the lucene index, I can see there is
a index selection policy. Assuming round robin policy is chosen, how is the
code adding a document in the lucene index? This is related to first
question - is the index where the new document is to be added in HDFS or in
local file system. I read in the README that the index is first created on
local file system, then copied back to HDFS. Can someone please point me to
the code that is doing this.

3) After the map reduce job finishes, where are the final indexes ? In HDFS
?

4) Correct me if I am wrong- the code builds multiple indexes, where each
index is an instance of Lucene Index having a disjoint subset of documents
from the corpus. So, if I have to search a term, I have to search each index
and then merge the result. If this is correct, then how is the IDF of a term
which is a global statistic computed and updated in each index ? I mean each
index can compute the IDF wrt. to the subset of documents that it has, but
can not compute the global IDF of a term (since it knows nothing about other
indexes, which might have the same term in other documents).

Thanks,
-T


Re: New version/API stable?

2009-05-28 Thread Alex Loddengaard
0.19 is considered unstable by us at Cloudera and by the Y! folks; they
never deployed it to their clusters.  That said, we recommend 0.18.3 as the
most stable version of Hadoop right now.  Y! has (or will soon) deploy(ed)
0.20, which implies that it's at least stable enough for them to give it a
go.  Cloudera plans to support 0.20 as soon as a few more bugs get flushed
out, which will probably happen in its next minor release.

So anyway, that said, it might make sense for you to start with 0.20.0, as
long as you understand that the first major release usually is pretty buggy,
and is basically considered a beta.  If you're not willing to take the
stability risk, then I'd recommend going with 0.18.3, though the upgrade
from 0.18.3 to 0.20.X is going to be a headache (APIs changed, configuration
files changed, etc.).

Hope this is insightful.

Alex

On Thu, May 28, 2009 at 2:59 PM, David Rosenstrauch wrote:

> Hadoop noob here, just starting to learn it, as we're planning to start
> using it heavily in our processing.  Just wondering, though, which version
> of the code I should start learning/working with.
>
> It looks like the Hadoop API changed pretty significantly from 0.19 to
> 0.20 (e.g., org.apache.hadoop.mapred -> org.apache.hadoop.mapreduce),
> which leads me to think I should start with the new API.  OTOH, since the
> new release is a ".0" release after some of these major API overhauls, I'm
> wondering if it's stable enough for us to start using in production.
>
> Where'd be best for me to start learning?
>
> TIA,
>
> DR
>
>


MultipleOutputs or MultipleTextOutputFormat?

2009-05-28 Thread Kevin Peterson
I am trying to figure out the best way to split output into different
directories. My goal is to have a directory structure allowing me to add the
content from each batch into the right bucket, like this:

...
/content/200904/batch_20090429
/content/200904/batch_20090430
/content/200904/batch_20090501
/content/200904/batch_20090502
/content/200905/batch_20090430
/content/200905/batch_20090501
/content/200905/batch_20090502
...

I would then run my nightly jobs to build the index on /content/200904/* for
the April index and /content/200905/* for the May index.

I'm not sure whether I would be better off using MultipleOutputs or
MultipleTextOutputFormat. I'm having trouble understanding how I set the
output path for these two classes. It seems like MultipleTextOutputFormat is
about partitioning data to different files within the same directory on the
key, rather than into different directories as I need. Could I get the
behavior I want by specifying date/batch as my filename, set output path to
some temporary work directory, then move /work/* to /content?

MultipleOutputs seems to be more about outputting all the data in different
formats, but it's supposed to be simpler to use. Reading it, it seems to be
better documented and the API makes more sense (choosing the output
explicitly in the map or reduce, rather than hiding this decision in the
output format), but I don't see any way to set a file name. If am using
textoutputformat, I see no way to put these into different directories.


Re: InputFormat for fixed-width records?

2009-05-28 Thread Stuart White
On Thu, May 28, 2009 at 9:50 AM, Owen O'Malley  wrote:

>
> The update to the terasort example has an InputFormat that does exactly
> that. The key is 10 bytes and the value is the next 90 bytes. It is pretty
> easy to write, but I should upload it soon. The output types are Text, but
> they just have the binary data in them.
>

Would you mind uploading it or sending it to the list?


Re: Reduce() time takes ~4x Map()

2009-05-28 Thread Jothi Padmanabhan
Hi David,

If you go to JobTrackerHistory and then click on this job and then do
Analyse This Job, you should be able to get the split up timings for the
individual phases of the map and reduce tasks, including the average, best
and worst times. Could you provide those numbers so that we can get a better
idea of how the job progressed.

Jothi


On 5/28/09 10:11 PM, "David Batista"  wrote:

> Hi everyone,
> 
> I'm processing XML files, around 500MB each with several documents,
> for the map() function I pass a document from the XML file, which
> takes some time to process depending on the size - I'm applying NER to
> texts.
> 
> Each document has a unique identifier, so I'm using that identifier as
> a key and the results of parsing the document in one string as the
> output:
> 
> so at the end of  the map function():
> output.collect( new Text(identifier), new Text(outputString));
> 
> usually the outputString is around 1k-5k size
> 
> reduce():
> public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter) {
> while (values.hasNext()) {
> Text text = values.next();
> try {
> output.collect(key, text);
> } catch (IOException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> }
> }
> 
> I did a test using only 1 machine with 8 cores, and only 1 XML file,
> it took around 3 hours to process all maps and  ~12hours for the
> reduces!
> 
> the XML file has 139 945 documents
> 
> I set the jobconf for 1000 maps() and 200 reduces()
> 
> I did took a look at graphs on the web interface during the reduce
> phase, and indeed its the copy phase that's taking much of the time,
> the sort and reduce phase are done almost instantly.
> 
> Why does the copy phase takes so long? I understand that the copies
> are made using HTTP, and the data was in really small chunks 1k-5k
> size, but even so, being everything in the same physical machine
> should have been faster, no?
> 
> Any suggestions on what might be causing the copies in reduce to take so long?
> --
> ./david



Re: Reduce() time takes ~4x Map()

2009-05-28 Thread jason hadoop
At the minimal level, enable map output compression, it may make some
difference, mapred.compress.map.output.
Sorting is very expensive when there are many keys and the values are large.
Are you quite certain your keys are unique.
Also, do you need them sorted by document id?


On Thu, May 28, 2009 at 8:51 PM, Jothi Padmanabhan wrote:

> Hi David,
>
> If you go to JobTrackerHistory and then click on this job and then do
> Analyse This Job, you should be able to get the split up timings for the
> individual phases of the map and reduce tasks, including the average, best
> and worst times. Could you provide those numbers so that we can get a
> better
> idea of how the job progressed.
>
> Jothi
>
>
> On 5/28/09 10:11 PM, "David Batista"  wrote:
>
> > Hi everyone,
> >
> > I'm processing XML files, around 500MB each with several documents,
> > for the map() function I pass a document from the XML file, which
> > takes some time to process depending on the size - I'm applying NER to
> > texts.
> >
> > Each document has a unique identifier, so I'm using that identifier as
> > a key and the results of parsing the document in one string as the
> > output:
> >
> > so at the end of  the map function():
> > output.collect( new Text(identifier), new Text(outputString));
> >
> > usually the outputString is around 1k-5k size
> >
> > reduce():
> > public void reduce(Text key, Iterator values,
> > OutputCollector output, Reporter reporter) {
> > while (values.hasNext()) {
> > Text text = values.next();
> > try {
> > output.collect(key, text);
> > } catch (IOException e) {
> > // TODO Auto-generated catch block
> > e.printStackTrace();
> > }
> > }
> > }
> >
> > I did a test using only 1 machine with 8 cores, and only 1 XML file,
> > it took around 3 hours to process all maps and  ~12hours for the
> > reduces!
> >
> > the XML file has 139 945 documents
> >
> > I set the jobconf for 1000 maps() and 200 reduces()
> >
> > I did took a look at graphs on the web interface during the reduce
> > phase, and indeed its the copy phase that's taking much of the time,
> > the sort and reduce phase are done almost instantly.
> >
> > Why does the copy phase takes so long? I understand that the copies
> > are made using HTTP, and the data was in really small chunks 1k-5k
> > size, but even so, being everything in the same physical machine
> > should have been faster, no?
> >
> > Any suggestions on what might be causing the copies in reduce to take so
> long?
> > --
> > ./david
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-28 Thread ashish pareek
Hi some help me out 



On Thu, May 28, 2009 at 10:32 PM, ashish pareek  wrote:

> HI ,
>  I am trying to step up a hadoop cluster on 512 MB machine and using
> hadoop 0.18 and  have followed procedure given in  apache hadoop site for
> hadoop cluster.
>  I included  in conf/slaves two datanode i.e including the namenode
> vitrual machine and other machine virtual machine  . and have set up
> passwordless ssh between both virtual machines. But now problem is when
> is run command >>
>
> bin/hadoop start-all.sh
>
> It start only one datanode on the same namenode vitrual machine but it
> doesn't start the datanode on other machine.
>
> in logs/hadoop-datanode  i get message
>
>
>  INFO org.apache.hadoop.ipc.Client: Retrying
>  connect to server: hadoop1/192.168.1.28:9000. Already
>   tried 1 time(s).
>
>  2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying
>  connect to server: hadoop1/192.168.1.28:9000. Already tried 2 time(s).
>  2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying
>  connect to server: hadoop1/192.168.1.28:9000. Already tried 3 time(s).
>
> .
> .
> .
> .
> .
> .
> .
> .
> .
>
> .
> .
>
> .
>
>
> So can any one help in solving this problem. :)
>
> Thanks
>
> Regards
>
> Ashish Pareek
>
>
>


Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread jason hadoop
Use the mapside join stuff, if I understand your problem it provides a good
solution but requires getting over the learning hurdle.
Well described in chapter 8 of my book :)


On Thu, May 28, 2009 at 8:29 AM, Chris K Wensel  wrote:

> I believe PIG, and I know Cascading use a kind of 'spillable' list that can
> be re-iterated across. PIG's version is a bit more sophisticated last I
> looked.
>
> that said, if you were using either one of them, you wouldn't need to write
> your own many-to-many join.
>
> cheers,
> ckw
>
>
> On May 28, 2009, at 8:14 AM, Todd Lipcon wrote:
>
>  One last possible trick to consider:
>>
>> If you were to subclass SequenceFileRecordReader, you'd have access to its
>> seek method, allowing you to rewind the reducer input. You could then
>> implement a block hash join with something like the following pseudocode:
>>
>> ahash = new HashMap();
>> while (i have ram available) {
>>  read a record
>>  if the record is from table B, break
>>  put the record into ahash
>> }
>> nextAPos = reader.getPos()
>>
>> while (current record is an A record) {
>>  skip to next record
>> }
>> firstBPos = reader.getPos()
>>
>> while (current record has current key) {
>>  read and join against ahash
>>  process joined result
>> }
>>
>> if firstBPos > nextAPos {
>>  seek(nextAPos)
>>  go back to top
>> }
>>
>>
>> On Thu, May 28, 2009 at 8:05 AM, Todd Lipcon  wrote:
>>
>>  Hi Stuart,
>>>
>>> It seems to me like you have a few options.
>>>
>>> Option 1: Just use a lot of RAM. Unless you really expect many millions
>>> of
>>> entries on both sides of the join, you might be able to get away with
>>> buffering despite its inefficiency.
>>>
>>> Option 2: Use LocalDirAllocator to find some local storage to spill all
>>> of
>>> the left table's records to disk in a MapFile format. Then as you iterate
>>> over the right table, do lookups in the MapFile. This is really the same
>>> as
>>> option 1, except that you're using disk as an extension of RAM.
>>>
>>> Option 3: Convert this to a map-side merge join. Basically what you need
>>> to
>>> do is sort both tables by the join key, and partition them with the same
>>> partitioner into the same number of columns. This way you have an equal
>>> number of part-N files for both tables, and within each part-N
>>> file
>>> they're ordered by join key. In each map task, you open both
>>> tableA/part-N
>>> and tableB/part-N and do a sequential merge to perform the join. I
>>> believe
>>> the CompositeInputFormat class helps with this, though I've never used
>>> it.
>>>
>>> Option 4: Perform the join in several passes. Whichever table is smaller,
>>> break into pieces that fit in RAM. Unless my relational algebra is off, A
>>> JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B = B1
>>> UNION
>>> B2.
>>>
>>> Hope that helps
>>> -Todd
>>>
>>>
>>> On Thu, May 28, 2009 at 5:02 AM, Stuart White >> >wrote:
>>>
>>>  I need to do a reduce-side join of two datasets.  It's a many-to-many
 join; that is, each dataset can can multiple records with any given
 key.

 Every description of a reduce-side join I've seen involves
 constructing your keys out of your mapper such that records from one
 dataset will be presented to the reducers before records from the
 second dataset.  I should "hold on" to the value from the one dataset
 and remember it as I iterate across the values from the second
 dataset.

 This seems like it only works well for one-to-many joins (when one of
 your datasets will only have a single record with any given key).
 This scales well because you're only remembering one value.

 In a many-to-many join, if you apply this same algorithm, you'll need
 to remember all values from one dataset, which of course will be
 problematic (and won't scale) when dealing with large datasets with
 large numbers of records with the same keys.

 Does an efficient algorithm exist for a many-to-many reduce-side join?


>>>
>>>
> --
> Chris K Wensel
> ch...@concurrentinc.com
> http://www.concurrentinc.com
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-28 Thread Pankil Doshi
make sure u can ping that data node and ssh it.


On Thu, May 28, 2009 at 12:02 PM, ashish pareek  wrote:

> HI ,
> I am trying to step up a hadoop cluster on 512 MB machine and using
> hadoop 0.18 and  have followed procedure given in  apache hadoop site for
> hadoop cluster.
> I included  in conf/slaves two datanode i.e including the namenode
> vitrual machine and other machine virtual machine  . and have set up
> passwordless ssh between both virtual machines. But now problem is when
> is run command >>
>
> bin/hadoop start-all.sh
>
> It start only one datanode on the same namenode vitrual machine but it
> doesn't start the datanode on other machine.
>
> in logs/hadoop-datanode  i get message
>
>
>  INFO org.apache.hadoop.ipc.Client: Retrying
>  connect to server: hadoop1/192.168.1.28:9000. Already
>  tried 1 time(s).
>  2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying
>  connect to server: hadoop1/192.168.1.28:9000. Already tried 2 time(s).
>  2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying
>  connect to server: hadoop1/192.168.1.28:9000. Already tried 3 time(s).
>
> .
> .
> .
> .
> .
> .
> .
> .
> .
>
> .
> .
>
> .
>
>
> So can any one help in solving this problem. :)
>
> Thanks
>
> Regards
> Ashish Pareek
>


Re: org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-28 Thread ashish pareek
Yes I am able to ping and ssh between two virtual machine and even i
have set ip address of both the virtual machines in their respective
/etc/hosts file ...

thanx for reply .. if you suggest some other thing which i could
have missed or any remedy 

Regards,
Ashish Pareek.


On Fri, May 29, 2009 at 10:04 AM, Pankil Doshi  wrote:

> make sure u can ping that data node and ssh it.
>
>
> On Thu, May 28, 2009 at 12:02 PM, ashish pareek 
> wrote:
>
> > HI ,
> > I am trying to step up a hadoop cluster on 512 MB machine and using
> > hadoop 0.18 and  have followed procedure given in  apache hadoop site for
> > hadoop cluster.
> > I included  in conf/slaves two datanode i.e including the namenode
> > vitrual machine and other machine virtual machine  . and have set up
> > passwordless ssh between both virtual machines. But now problem is
> when
> > is run command >>
> >
> > bin/hadoop start-all.sh
> >
> > It start only one datanode on the same namenode vitrual machine but it
> > doesn't start the datanode on other machine.
> >
> > in logs/hadoop-datanode  i get message
> >
> >
> >  INFO org.apache.hadoop.ipc.Client: Retrying
> >  connect to server: hadoop1/192.168.1.28:9000. Already
> >  tried 1 time(s).
> >  2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying
> >  connect to server: hadoop1/192.168.1.28:9000. Already tried 2 time(s).
> >  2009-05-09 18:35:14,266 INFO org.apache.hadoop.ipc.Client: Retrying
> >  connect to server: hadoop1/192.168.1.28:9000. Already tried 3 time(s).
> >
> > .
> > .
> > .
> > .
> > .
> > .
> > .
> > .
> > .
> >
> > .
> > .
> >
> > .
> >
> >
> > So can any one help in solving this problem. :)
> >
> > Thanks
> >
> > Regards
> > Ashish Pareek
> >
>


RE:SequenceFile and streaming

2009-05-28 Thread walter steffe
Hi Tom,

  i have seen the tar-to-seq tool but the person who made it says it is
very slow: 
"It took about an hour and a half to convert a 615MB tar.bz2 file to an
868MB sequence file". To me it is not acceptable.
Normally to generate a tar file from 615MB od data it take s less then
one minute. And, in my view the generatin of a sequence file should be 
even simper. You have just to append files and headers without worring
about hierarchy.

Regarding the SequenceFileAsTextInputFormat I am not sure it will do the
job I am looking for.
The hadoop documentation says: SequenceFileAsTextInputFormat generates
SequenceFileAsTextRecordReader which converts the input keys and values
to their String forms by calling toString() method.
Let we suppose that the keys and values were generated using tar-to-seq
on a tar archive. Each value is a bytearray that stores the content of a
file which can be any kind of data (in example a jpeg picture). It
doesn't make sense to convert this data into a string.

What is needed is a tool to simply extract the file as with 
tar -xf archive.tar filename. The hadoop framework can be used to
extract a Java class and you have to do that within a java program. The
streaming package is meant to be used in a unix shell without the need
of java programming. But I think it is not very usefull if the
sequencefile (which is the principal data structure of hadoop) is not
accessible from a shell command.


Walter




Re: MultipleOutputs or MultipleTextOutputFormat?

2009-05-28 Thread Ankur Goel
One way of doing what you need is to extend MultipleTextOutputFormat and 
override the following APIs

- generateFileNameForKeyValue()
- generateActualKey()
- generateActualValue()

You will need to prefix the directory and file-name of your choice to the 
key/value depending upon your needs. Assuming key and value types to be Text 
here is some sample code for reference

 public String generateFileNameForKeyValue(Text key, Text v, String name) {
/*
 * split the default name (for e.x. part-0 into ['part', '0'] )
 */
String[] nameparts = name.split("-");

String keyStr = key.toString();
/**
 * assuming desired filename is prefixed to the key and separated from the
 * actual key contents by '\t'
 */
int idx = keyStr.indexOf("\t");

/*
 * get the file name
 */
name = keyStr.substring(0, idx);

/**
 * return the path of the form 'fileName/fileName-'
 * This makes sure that fileName dir is created under job's output dir
 * and all the keys with that prefix go into reducer-specific files under
 * that dir.
 */
return new Path(name, name + "-" + nameparts[1]).toString();
  }

  public Text generateActualKey(Text key, Text value) {
String keyStr = key.toString();
int idx = keyStr.indexOf("\t") + 1;
return new Text(keyStr.substring(idx));
  }

 Hope that helps.

-Ankur

- Original Message -
From: "Kevin Peterson" 
To: core-user@hadoop.apache.org
Sent: Friday, May 29, 2009 4:55:22 AM GMT +05:30 Chennai, Kolkata, Mumbai, New 
Delhi
Subject: MultipleOutputs or MultipleTextOutputFormat?

I am trying to figure out the best way to split output into different
directories. My goal is to have a directory structure allowing me to add the
content from each batch into the right bucket, like this:

...
/content/200904/batch_20090429
/content/200904/batch_20090430
/content/200904/batch_20090501
/content/200904/batch_20090502
/content/200905/batch_20090430
/content/200905/batch_20090501
/content/200905/batch_20090502
...

I would then run my nightly jobs to build the index on /content/200904/* for
the April index and /content/200905/* for the May index.

I'm not sure whether I would be better off using MultipleOutputs or
MultipleTextOutputFormat. I'm having trouble understanding how I set the
output path for these two classes. It seems like MultipleTextOutputFormat is
about partitioning data to different files within the same directory on the
key, rather than into different directories as I need. Could I get the
behavior I want by specifying date/batch as my filename, set output path to
some temporary work directory, then move /work/* to /content?

MultipleOutputs seems to be more about outputting all the data in different
formats, but it's supposed to be simpler to use. Reading it, it seems to be
better documented and the API makes more sense (choosing the output
explicitly in the map or reduce, rather than hiding this decision in the
output format), but I don't see any way to set a file name. If am using
textoutputformat, I see no way to put these into different directories.