Re: Reverse Indexing Programming Help

2011-03-31 Thread Ted Dunning
It would help to get a good book.  There are several.

For your program, there are several things that will trip you up:

a) lots of little files is going to be slow.  You want input that is >100MB
per file if you want speed.

b) That file format is a bit cheesy since it is hard to tell URL's from text
if you concatenate lots of files.  Better to use a format like protobufs or
Avro or even sequence files to separate the key and the data unambiguously.

c) I suspect that what you are asking for is to run a mapper so that each
invocation of map gets the URL as key and the text as data.  That map
invocation can then tokenize the data and emit records with the URL as key
and each word as data.  That isn't much use since the reducer will get the
URL and all the words that were emitted for that URL.  If each URL appears
exactly once, then the input already had that.  Perhaps you mean to emit the
word as key and URL as data.  Then the reducer will see the word as key and
an iterator over all the URLs that mentioned the word.

On Thu, Mar 31, 2011 at 9:48 PM, DoomUs  wrote:

>
> I'm just starting out using Hadoop.  I've looked through the java examples,
> and have an idea about what's going on, but don't really get it.
>
> I'd like to write a program that takes a directory of files.  Contained in
> those files are a URL to a website on the first line, and the second line
> is
> the TEXT from that website.
>
> The mapper should create a map for each word in the text to that URL, so
> every word found on the website would map to the URL.
>
> The reducer then, would collect all of the URLs that are mapped to via a
> given word.
>
> Each Word->URL is then written to a file.
>
> So, it's "simple" as a program designed to run on a single system, but I
> want to be able to distribute the computation and whatnot using Hadoop.
>
> I'm extremely new to Hadoop,  I'm not even sure how to ask all of the
> questions I'd like answers for, I have zero experience in MapReduce, and
> limited experience in functional programming at all.  Any programming tips,
> or if I have my "Mapper" or "Reducer" defined incorrectly, corrections, etc
> would be greatly appreciated.
>
> Questions:
> How do I read (and write) files from hdfs?
> Once I've read them, How do I distribute the files to be mapped?
> I know I need a class to implement the mapper, and one to implement the
> reducer, but how does the class have a return type to output the map?
>
> Thanks a lot for your help.
> --
> View this message in context:
> http://old.nabble.com/Reverse-Indexing-Programming-Help-tp31292449p31292449.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Reverse Indexing Programming Help

2011-03-31 Thread DoomUs

I'm just starting out using Hadoop.  I've looked through the java examples,
and have an idea about what's going on, but don't really get it.

I'd like to write a program that takes a directory of files.  Contained in
those files are a URL to a website on the first line, and the second line is
the TEXT from that website.

The mapper should create a map for each word in the text to that URL, so
every word found on the website would map to the URL.

The reducer then, would collect all of the URLs that are mapped to via a
given word.

Each Word->URL is then written to a file.

So, it's "simple" as a program designed to run on a single system, but I
want to be able to distribute the computation and whatnot using Hadoop.

I'm extremely new to Hadoop,  I'm not even sure how to ask all of the
questions I'd like answers for, I have zero experience in MapReduce, and
limited experience in functional programming at all.  Any programming tips,
or if I have my "Mapper" or "Reducer" defined incorrectly, corrections, etc
would be greatly appreciated.

Questions:
How do I read (and write) files from hdfs?
Once I've read them, How do I distribute the files to be mapped?
I know I need a class to implement the mapper, and one to implement the
reducer, but how does the class have a return type to output the map?

Thanks a lot for your help.
-- 
View this message in context: 
http://old.nabble.com/Reverse-Indexing-Programming-Help-tp31292449p31292449.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Is anyone running Hadoop 0.21.0 on Solaris 10 X64?

2011-03-31 Thread Edward Capriolo
On Thu, Mar 31, 2011 at 10:43 AM, XiaoboGu  wrote:
> I have trouble browsing the file system vi namenode web interface, namenode 
> saying in log file that th –G option is invalid to get the groups for the 
> user.
>
>

I thought this was not the case any more but hadoop forks to the 'id'
command to figure out the groups for a user. You need to make sure the
output is what hadoop is expecting.


Reading Records from a Sequence File

2011-03-31 Thread maha
Hello Everyone,

As far as I know, when my java program opens a sequence file for a map 
calculations, from hdfs. Using SequenceFile.Reader(key,value) will actually 
read the file in dfs.block.size then grabes record-by-record from memory.

  Is that right? 

.. I tried a simple program with input about 6 MB, but the memory allocated was 
13 MB!  .. which might be a fragmentation problem, but I doubt it.

 Thank you,
 Maha

Re: What does "Too many fetch-failures" mean? How do I debug it?

2011-03-31 Thread David Rosenstrauch

On 03/31/2011 05:13 PM, W.P. McNeill wrote:

I'm running a big job on my cluster and a handful of attempts are failing
with a "Too many fetch-failures" error message. They're all on the same
node, but that node doesn't appear to be down. Subsequent attempts succeed,
so this looks like a transient stress issue rather than a problem with my
code. I'm guessing it's something like HDFS not being able to keep up, but
I'm not sure, and Googling only turns up people just as confused as I am.

What does this error mean and how do I dig into it more?

Thanks.


We've seen that happen in a number of situations, and it's a bit tricky 
to debug.


In the general sense it means that a machine wasn't able to fetch a 
block from HDFS - i.e., there was a network problem that prevented the 
machine from communicating with the other machine and fetch the block. 
The reasons why this could happen though are numerous.  We've seen this 
in at least 2 situations:  1) the HDFS machine was having a huge load 
spike and so didn't respond, and 2) we accidentally gave several nodes 
the same name, so Hadoop wasn't able to correctly contact the "real" 
node for that name.


Your specific issue may be different, though, so you'll need to debug 
the network error yourself.


HTH,

DR


What does "Too many fetch-failures" mean? How do I debug it?

2011-03-31 Thread W.P. McNeill
I'm running a big job on my cluster and a handful of attempts are failing
with a "Too many fetch-failures" error message. They're all on the same
node, but that node doesn't appear to be down. Subsequent attempts succeed,
so this looks like a transient stress issue rather than a problem with my
code. I'm guessing it's something like HDFS not being able to keep up, but
I'm not sure, and Googling only turns up people just as confused as I am.

What does this error mean and how do I dig into it more?

Thanks.


Re: Is anyone running Hadoop 0.21.0 on Solaris 10 X64?

2011-03-31 Thread Allen Wittenauer

On Mar 31, 2011, at 7:43 AM, XiaoboGu wrote:

> I have trouble browsing the file system vi namenode web interface, namenode 
> saying in log file that th –G option is invalid to get the groups for the 
> user.
> 


I don't but I suspect you'll need to enable one of the POSIX 
personalities before launching the namenode.  In particular, this means putting 
/usr/xpg4/bin or /usr/xpg6/bin in the PATH prior to the SysV /usr/bin entry.




questions on map-side spills

2011-03-31 Thread Shrinivas Joshi
I am trying TeraSort with Apache 0.21.0 build. io.sort.mb is 360M,
map.sort.spill.percent is 0.8, dfs.blocksize is 256M. I am having some
difficulty understanding spill related decisions from the log files. Here
are the relevant log lines:

2011-03-30 13:46:51,591 INFO org.apache.hadoop.mapred.MapTask: (EQUATOR) 0
kvi 94371836(377487344)
2011-03-30 13:46:51,592 INFO org.apache.hadoop.mapred.MapTask:
mapreduce.task.io.sort.mb: 360
2011-03-30 13:46:51,592 INFO org.apache.hadoop.mapred.MapTask: soft limit at
301989888
2011-03-30 13:46:51,592 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0;
bufvoid = 377487360
2011-03-30 13:46:51,592 INFO org.apache.hadoop.mapred.MapTask: kvstart =
94371836; length = 23592960
2011-03-30 13:47:05,528 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output
2011-03-30 13:47:05,528 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0;
bufend = 261042174; bufvoid = 377487360
2011-03-30 13:47:05,528 INFO org.apache.hadoop.mapred.MapTask: kvstart =
94371836(377487344); kvend = 84134892(336539568); length = 10236945/23592960
2011-03-30 13:47:05,529 INFO org.apache.hadoop.mapred.MapTask: (EQUATOR)
271279102 kvi 67819768(271279072)
2011-03-30 13:47:06,355 INFO org.apache.hadoop.mapred.MapTask: Starting
flush of map output
2011-03-30 13:47:20,822 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded
the native-hadoop library
2011-03-30 13:47:20,824 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory:
Successfully loaded & initialized native-zlib library
2011-03-30 13:47:20,825 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2011-03-30 13:47:54,317 INFO org.apache.hadoop.mapred.MapTask: *Finished
spill 0*
2011-03-30 13:47:54,318 INFO org.apache.hadoop.mapred.MapTask: (RESET)
equator 271279102 kv 67819768(271279072) kvi 66442776(265771104)
2011-03-30 13:47:54,318 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output
2011-03-30 13:47:54,318 INFO org.apache.hadoop.mapred.MapTask: bufstart =
271279102; bufend = 306392398; bufvoid = 377487360
2011-03-30 13:47:54,318 INFO org.apache.hadoop.mapred.MapTask: kvstart =
67819768(271279072); kvend = 66442780(265771120); length = 1376989/23592960
2011-03-30 13:48:00,198 INFO org.apache.hadoop.mapred.MapTask: *Finished
spill 1*

Couple of questions:

   - It says length = 23592960 for records. Does it mean it is setting aside
   23592960 * 4 bytes (90M) for storing spilled records meta-data? OR is it
   23592960/(1024*1024) = 22.5M?
   - Why is it triggering 2 spills? By the first spill it looks like 248.94M
   (bufend = 261042174) of intermediate map output is generated. If 90M is
   reserved for record meta-data then (360M - 90M) * 0.8 = 216M is less than
   map output size and the spill should have been triggered earlier. If 22.5M
   is reserved for record meta-data then (360M - 22.5M) * 0.8 = 270M still has
   more room for in io.sort buffer. May be changes in
   https://issues.apache.org/jira/browse/MAPREDUCE-64 rely on dynamic info
   and the straigh forward calculations that I am using here are incorrect?
   - Is there any value in simplifying spill decisions related debug output
   for general user who might not necessarily have insight in to Hadoop source
   code?

Thanks,
-Shrinivas


Is anyone running Hadoop 0.21.0 on Solaris 10 X64?

2011-03-31 Thread XiaoboGu
I have trouble browsing the file system vi namenode web interface, namenode 
saying in log file that th –G option is invalid to get the groups for the user.



How to avoid receiving threads send by other people.

2011-03-31 Thread XiaoboGu
Hi,

I have subscribed to the digest mode, but I still get all the messages 
instantly from other people in the list. But other mailing list won’t do this, 
they will send all the messages during a time frame in one mail. 

How can I achieve this with Apache mailing lists?

 

Regards,

 

Xiaobo Gu

 



sorting reducer input numerically in hadoop streaming

2011-03-31 Thread Dieter Plaetinck
hi,
I use hadoop 0.20.2, more specifically hadoop-streaming, on Debian 6.0
(squeeze) nodes.

My question is: how do I make sure input keys being fed to the reducer
are sorted numerically rather then alphabetically?

example:
- standard behavior:
#1 some-value1
#10 some-value10
#100 some-value100
#2 some-value2
#3 some-value3

- what I want:
#1 some-value1
#2 some-value2
#3 some-value3
#10 some-value10
#100 some-value100

I found
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/KeyFieldBasedComparator.html,
which supposedly supports GNU sed-like numeric sorting,
there are also some examples of jobconf parameters at
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html,
however that seems to be meant for key-value configuration flags,
whereas I somehow need to instruct streamer I want to use that specific
java class with that specific option for numeric sorting, and I
couldn't find how I should do that.

Thanks,
Dieter


DFSIO benchmark

2011-03-31 Thread Matthew John
Can some one provide pointers/ links for DFSio Benchmarks to check the IO
performance of HDFS ??

Thanks,
Matthew John


Re: Hadoop Pipes Error

2011-03-31 Thread Adarsh Sharma

Thanks Amareshwari, I find it & I'm sorry it results in another error:

bash-3.2$ bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D 
hadoop.pipes.java.recordwriter=true -libjars 
/home/hadoop/project/hadoop-0.20.2/hadoop-0.20.2-test.jar -inputformat 
org.apache.hadoop.mapred.pipes.WordCountInputFormat -input gutenberg 
-output gutenberg-out101  -program bin/wordcount-nopipe
11/03/31 16:36:26 WARN mapred.JobClient: No job jar file set.  User 
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
hdfs://ws-test:54310/user/hadoop/gutenberg, expected: file:

   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:273)

   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:721)
   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:465)
   at 
org.apache.hadoop.mapred.pipes.WordCountInputFormat.getSplits(WordCountInputFormat.java:57)
   at 
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
   at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)

   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
   at 
org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)

   at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
   at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)

Best regards,  Adarsh


Amareshwari Sri Ramadasu wrote:

Adarsh,

The inputformat is present in test jar. So, pass -libjars  to your command. libjars option should be passed before program specific 
options. So, it should be just after your -D parameters.

-Amareshwari

On 3/31/11 3:45 PM, "Adarsh Sharma"  wrote:

Amareshwari Sri Ramadasu wrote:
Re: Hadoop Pipes Error You can not run it with TextInputFormat. You should run 
it with org.apache.hadoop.mapred.pipes .WordCountInputFormat. You can pass the 
input format by passing it in -inputformat option.
I did not try it myself, but it should work.




Here is the command that I am trying and it results in exception:

bash-3.2$ bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D 
hadoop.pipes.java.recordwriter=true  -inputformat 
org.apache.hadoop.mapred.pipes.WordCountInputFormat -input gutenberg -output 
gutenberg-out101 -program bin/wordcount-nopipe
Exception in thread "main" java.lang.ClassNotFoundException: 
org.apache.hadoop.mapred.pipes.WordCountInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
at org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:372)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:421)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)


Thanks , Adarsh


  




Re: Hadoop Pipes Error

2011-03-31 Thread Amareshwari Sri Ramadasu
Also see TestPipes.java for more details.


On 3/31/11 4:29 PM, "Amareshwari Sriramadasu"  wrote:

Adarsh,

The inputformat is present in test jar. So, pass -libjars  to your command. libjars option should be passed before program 
specific options. So, it should be just after your -D parameters.

-Amareshwari

On 3/31/11 3:45 PM, "Adarsh Sharma"  wrote:

Amareshwari Sri Ramadasu wrote:
Re: Hadoop Pipes Error You can not run it with TextInputFormat. You should run 
it with org.apache.hadoop.mapred.pipes .WordCountInputFormat. You can pass the 
input format by passing it in -inputformat option.
I did not try it myself, but it should work.




Here is the command that I am trying and it results in exception:

bash-3.2$ bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D 
hadoop.pipes.java.recordwriter=true  -inputformat 
org.apache.hadoop.mapred.pipes.WordCountInputFormat -input gutenberg -output 
gutenberg-out101 -program bin/wordcount-nopipe
Exception in thread "main" java.lang.ClassNotFoundException: 
org.apache.hadoop.mapred.pipes.WordCountInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
at org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:372)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:421)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)


Thanks , Adarsh



Re: Hadoop Pipes Error

2011-03-31 Thread Amareshwari Sri Ramadasu
Adarsh,

The inputformat is present in test jar. So, pass -libjars  to your command. libjars option should be passed before program 
specific options. So, it should be just after your -D parameters.

-Amareshwari

On 3/31/11 3:45 PM, "Adarsh Sharma"  wrote:

Amareshwari Sri Ramadasu wrote:
Re: Hadoop Pipes Error You can not run it with TextInputFormat. You should run 
it with org.apache.hadoop.mapred.pipes .WordCountInputFormat. You can pass the 
input format by passing it in -inputformat option.
I did not try it myself, but it should work.




Here is the command that I am trying and it results in exception:

bash-3.2$ bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D 
hadoop.pipes.java.recordwriter=true  -inputformat 
org.apache.hadoop.mapred.pipes.WordCountInputFormat -input gutenberg -output 
gutenberg-out101 -program bin/wordcount-nopipe
Exception in thread "main" java.lang.ClassNotFoundException: 
org.apache.hadoop.mapred.pipes.WordCountInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
at org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:372)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:421)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)


Thanks , Adarsh



hadoop streaming shebang line for python and mappers jumping to 100% completion right away

2011-03-31 Thread Dieter Plaetinck
Hi,
I use 0.20.2 on Debian 6.0 (squeeze) nodes.
I have 2 problems with my streaming jobs:
1) I start the job like so:
hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
-file /proj/Search/wall/experiment/ \
-mapper './nolog.sh mapper' \
-reducer './nolog.sh reducer' \
-input sim-input -output sim-output

nolog.sh is just a simple wrapper for my python program,
it calls build-models.py with --mapper or --reducer, depending on which 
argument it got,
and it removes any bogus logging output using grep.
it looks like this:

#!/bin/sh
python $(dirname $0)/build-models.py --$1 | egrep -v 'INFO|DEBUG|WARN'

build-models.py is a python 2 program containing all mapper/reducer/etc logic, 
it has the executable flag set for owner/group/other.
(I even added `chmod +x` on it in nolog.sh to be really sure)

The problems:
When I use this shebang for build-models.py: "#!/usr/bin/python" or 
"#!/usr/bin/env python" (I would expect the last to work for sure?),
and 
$(dirname $0)/build-models.py in nolog.sh
I get this error: 
/tmp/hadoop-dplaetin/mapred/local/taskTracker/jobcache/job_201103311017_0008/attempt_201103311017_0008_m_00_0/work/././nolog.sh:
9: 
/tmp/hadoop-dplaetin/mapred/local/taskTracker/jobcache/job_201103311017_0008/attempt_201103311017_0008_m_00_0/work/././build-models.py:
Permission denied
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1


So, despite not understanding why it's needed (python is installed correctly, 
executable flags set, etc), I can "solve" this by using the invocation in 
nolog.sh as shown above (`python `).
Since, if you invoke a python program like that, you can just as well remove 
the shebang because it's not needed (I verified this manually).
However when running it in hadoop it tries to execute the python file as a bash 
file, and yields a bunch of "command not found" errors.
What is going on? Why can't I just execute the file and rely on the shebang? 
And if I invoke the file as argument to the python program, why is the shebang 
still needed?


2) the second problem is somewhat related: I notice my mappers jump to "100% 
completion" right away - but they take about an hour to complete, so I see them 
running for an hour in 'RUNNING' with 100% completion, then they really finish.
this is probably an issue with the reading of stdin, as python uses
buffering by default (see
http://stackoverflow.com/questions/3670323/setting-smaller-buffer-size-for-sys-stdin
 )
In my code I iterate over stdin like this: `for line in sys.stdin:`, so I 
process line by line, but apparently python reads the entire stdin right away, 
my hdfs blocksize is 20KiB (which according to the thread above happens to be 
pretty much the size of the python buffer)

Now, why is this related? -> Because I can invoke python in a different way to 
keep it from doing the buffering.
apparently using the -u flag should do the trick, or setting the environment 
variable PYTHONUNBUFFERED to a nonempty string.
However:
- putting `python -u` in nolog.sh doesn't do it, why?
- neither does putting `export PYTHONUNBUFFERED=true` in nolog.sh before the 
invocation, why?
- in build-models.py shebang:
  putting `/usr/bin/env python -u` or '/usr/bin/env 'python -u'` gives:
  /usr/bin/env: python -u: No such file or directory, why?
I did find a working variant, that is, I can use this shebang:
`#!/usr/bin/env PYTHONUNBUFFERED=true python2`, however since I use the same 
file for multiple things, this made i/o for a bunch of other things way too 
slow, so I tried solving this in the python code (as per the tip in the above 
link), but to no avail. (I know, my final question is a bit less related)

So I tried remapping sys.stdin (before iterating it) with these two attemptst:
( see http://docs.python.org/library/os.html#os.fdopen )
newin = os.fdopen(sys.stdin.fileno(), 'r', 100) # should make buffersize +- 
100bytes
newin = os.fdopen(sys.stdin.fileno(), 'r', 1) # should make python buffer line 
by line

however, neither of those worked..

Any help/input is welcome.
I'm usually pretty good at figuring out issues with these kinds of issues of 
invocation, but this one blows my mind :/

Dieter


Re: Hadoop Pipes Error

2011-03-31 Thread Adarsh Sharma

Amareshwari Sri Ramadasu wrote:
You can not run it with TextInputFormat. You should run it with 
org.apache.hadoop.mapred.pipes .*WordCountInputFormat. *You can pass 
the input format by passing it in --inputformat option.

I did not try it myself, but it should work.




Here is the command that I am trying and it results in exception:

bash-3.2$ bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D 
hadoop.pipes.java.recordwriter=true  -inputformat 
org.apache.hadoop.mapred.pipes.WordCountInputFormat -input gutenberg 
-output gutenberg-out101 -program bin/wordcount-nopipe
Exception in thread "main" java.lang.ClassNotFoundException: 
org.apache.hadoop.mapred.pipes.WordCountInputFormat

   at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:247)
   at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
   at 
org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:372)

   at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:421)
   at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)


Thanks , Adarsh


Re: Hadoop Pipes Error

2011-03-31 Thread Amareshwari Sri Ramadasu
You can not run it with TextInputFormat. You should run it with 
org.apache.hadoop.mapred.pipes .WordCountInputFormat. You can pass the input 
format by passing it in -inputformat option.
I did not try it myself, but it should work.

-Amareshwari

On 3/31/11 12:23 PM, "Adarsh Sharma"  wrote:

Thanks Amareshwari,

here is the posting :
The nopipe example needs more documentation.  It assumes that it is
run with the InputFormat from src/test/org/apache/hadoop/mapred/pipes/
WordCountInputFormat.java, which has a very specific input split
format. By running with a TextInputFormat, it will send binary bytes
as the input split and won't work right. The nopipe example should
probably be recoded to use libhdfs too, but that is more complicated
to get running as a unit test. Also note that since the C++ example
is using local file reads, it will only work on a cluster if you have
nfs or something working across the cluster.

Please need if I'm wrong.

I need to run it with TextInputFormat.

If posiible Please explain the above post more clearly.


Thanks & best Regards,
Adarsh Sharma



Amareshwari Sri Ramadasu wrote:

Here is an answer for your question in old mail archive:
http://lucene.472066.n3.nabble.com/pipe-application-error-td650185.html

On 3/31/11 10:15 AM, "Adarsh Sharma"  
  wrote:

Any update on the below error.

Please guide.


Thanks & best Regards,
Adarsh Sharma



Adarsh Sharma wrote:



Dear all,

Today I faced a problem while running a map-reduce job in C++. I am
not able to understand to find the reason of the below error :


11/03/30 12:09:02 INFO mapred.JobClient: Task Id :
attempt_201103301130_0011_m_00_0, Status : FAILED
java.io.IOException: pipe child exception
at
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
at
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:101)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at
org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at
org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:114)

attempt_201103301130_0011_m_00_0: Hadoop Pipes Exception: failed
to open  at wordcount-nopipe.cc:82 in
WordCountReader::WordCountReader(HadoopPipes::MapContext&)
11/03/30 12:09:02 INFO mapred.JobClient: Task Id :
attempt_201103301130_0011_m_01_0, Status : FAILED
java.io.IOException: pipe child exception
at
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
at
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:101)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at
org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at
org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:114)

attempt_201103301130_0011_m_01_0: Hadoop Pipes Exception: failed
to open  at wordcount-nopipe.cc:82 in
WordCountReader::WordCountReader(HadoopPipes::MapContext&)
11/03/30 12:09:02 INFO mapred.JobClient: Task Id :
attempt_201103301130_0011_m_02_0, Status : FAILED
java.io.IOException: pipe child exception
at
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
at
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:101)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at
org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at
org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:114)
attempt_201103301130_0011_m_02_1: Hadoop Pipes Exception: failed
to open  at wordcount-nopipe.cc:82 in
WordCountReader::WordCountReader(HadoopPipes::MapContext&)
11/03/30 12:09:15 INFO mapred.JobClient: Task Id :
attempt_201103301130_0011_m_00_2, Status : FAILED
java.io.IOException: pipe child exception
at
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
at
org.apac

Re: Hadoop Pipes Error

2011-03-31 Thread Steve Loughran

On 31/03/11 07:53, Adarsh Sharma wrote:

Thanks Amareshwari,

here is the posting :
The *nopipe* example needs more documentation. It assumes that it is run
with the InputFormat from src/test/org/apache/*hadoop*/mapred/*pipes*/
*WordCountInputFormat*.java, which has a very specific input split
format. By running with a TextInputFormat, it will send binary bytes as
the input split and won't work right. The *nopipe* example should
probably be recoded *to* use libhdfs *too*, but that is more complicated
*to* get running as a unit test. Also note that since the C++ example is
using local file reads, it will only work on a cluster if you have nfs
or something working across the cluster.

Please need if I'm wrong.

I need to run it with TextInputFormat.

If posiible Please explain the above post more clearly.



Here goes.

1.
> The *nopipe* example needs more documentation. It assumes that it is run
> with the InputFormat from src/test/org/apache/*hadoop*/mapred/*pipes*/
> *WordCountInputFormat*.java, which has a very specific input split
> format. By running with a TextInputFormat, it will send binary bytes as
> the input split and won't work right.

The input for the pipe is the content generated by
src/test/org/apache/hadoop/mapred/pipes/WordCountInputFormat.java

This is covered here.
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Example%3A+WordCount+v1.0

I would recommend following the tutorial here, or either of the books 
"Hadoop the definitive guide" or "Hadoop in Action". Both authors earn 
their money by explaining how to use Hadoop, which is why both books are 
good explanations of it.


2.
>The *nopipe* example should
> probably be recoded *to* use libhdfs *too*, but that is more complicated
> *to* get running as a unit test.

Ignore that -it's irrelevant for your problem as owen is discussing 
automated testing.


3.

> Also note that since the C++ example is
> using local file reads, it will only work on a cluster if you have nfs
> or something working across the cluster.

unless your cluster has a shared filesystem at the OS level it won't 
work. Either have a shared filesystem like NFS, or run it on a single 
machine.


-Steve






Re: How to apply Patch

2011-03-31 Thread Adarsh Sharma

Thanks Steve , U helped me to clear my doubts several times.

I explain U What my Problem is :

I am trying to run *wordcount-nopipe.cc* program in 
*/home/hadoop/project/hadoop-0.20.2/src/examples/pipes/impl* directory.
I am able to run a simple wordcount.cpp program in Hadoop Cluster but 
whebn I am going to run this program, ifaced below exception :


*bash-3.2$ bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D 
hadoop.pipes.java.recordwriter=true -input gutenberg -output 
gutenberg-out1101 -program bin/wordcount-nopipe2*
11/03/31 14:59:07 WARN mapred.JobClient: No job jar file set.  User 
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
11/03/31 14:59:07 INFO mapred.FileInputFormat: Total input paths to 
process : 3

11/03/31 14:59:08 INFO mapred.JobClient: Running job: job_201103310903_0007
11/03/31 14:59:09 INFO mapred.JobClient:  map 0% reduce 0%
11/03/31 14:59:18 INFO mapred.JobClient: Task Id : 
attempt_201103310903_0007_m_00_0, Status : FAILED

java.io.IOException: pipe child exception
   at 
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
   at 
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:101)

   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.net.SocketException: Broken pipe
   at java.net.SocketOutputStream.socketWrite0(Native Method)
   at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)

   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
   at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)

   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
   at java.io.DataOutputStream.flush(DataOutputStream.java:106)
   at 
org.apache.hadoop.mapred.pipes.BinaryProtocol.flush(BinaryProtocol.java:316)
   at 
org.apache.hadoop.mapred.pipes.Application.waitForFinish(Application.java:129)
   at 
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:99)

   ... 3 more

attempt_201103310903_0007_m_00_0: Hadoop Pipes Exception: failed to 
open  at wordcount-nopipe2.cc:86 in 
WordCountReader::WordCountReader(HadoopPipes::MapContext&)
11/03/31 14:59:18 INFO mapred.JobClient: Task Id : 
attempt_201103310903_0007_m_01_0, Status : FAILED

java.io.IOException: pipe child exception
   at 
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
   at 
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:101)

   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.net.SocketException: Broken pipe
   at java.net.SocketOutputStream.socketWrite0(Native Method)
   at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)

   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
   at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)

   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
   at java.io.DataOutputStream.flush(DataOutputStream.java:106)
   at 
org.apache.hadoop.mapred.pipes.BinaryProtocol.flush(BinaryProtocol.java:316)
   at 
org.apache.hadoop.mapred.pipes.Application.waitForFinish(Application.java:129)
   at 
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:99)

   ... 3 more

After some R&D , i find the below links quite useful :

http://lucene.472066.n3.nabble.com/pipe-application-error-td650185.html
http://stackoverflow.com/questions/4395140/eofexception-thrown-by-a-hadoop-pipes-program

But don't know how to resolve this. I think my program try to open the 
file as file://gutenberg but it requires as hdfs://.


Here is the contents of my Makefile :

CC = g++
HADOOP_INSTALL =/home/hadoop/project/hadoop-0.20.2
PLATFORM = Linux-amd64-64
CPPFLAGS = -m64 
-I/home/hadoop/project/hadoop-0.20.2/c++/Linux-amd64-64/include 
-I/usr/local/cuda/include


wordcount-nopipe2 : wordcount-nopipe2.cc
   $(CC) $(CPPFLAGS) $< -Wall 
-L/home/hadoop/project/hadoop-0.20.2/c++/Linux-amd64-64/lib 
-L/usr/local/cuda/lib64 -lhadooppipes \

   -lhadooputils -lpthread -g -O2 -o $@

Would it be a bug in hadoop-0.20.2 and if not Please guide me how to 
debug it.




Thanks & best Regards,
Adarsh Sharma












Steve Loughran wrote:

On 31/03/11 07:37, Adarsh Sharma wrote:

Thanks a lot for such deep explanation :

I have done it now, but it doesn't help me in my original problem for
which I'm doing this.

Please if you have some idea comment on it. I attached the problem.



Sadly. Matt's deep explanation is what you need, low-level that it is

-patches are designed to be applied to source, so you need the a

Re: How to apply Patch

2011-03-31 Thread Steve Loughran

On 31/03/11 07:37, Adarsh Sharma wrote:

Thanks a lot for such deep explanation :

I have done it now, but it doesn't help me in my original problem for
which I'm doing this.

Please if you have some idea comment on it. I attached the problem.



Sadly. Matt's deep explanation is what you need, low-level that it is

-patches are designed to be applied to source, so you need the apache 
source tree, not any binary installations.


-you need to be sure that the source version you have matches that the 
patch is designed to be applied against, unless you want to get into the 
problem of understanding the source enough to fix inconsistencies.


-you need to rebuild hadoop afterwards.

Because Apache code is open source, patches and the like are all bits of 
source. This is not windows where the source is secret and all os 
updates are bits of binary code. The view is that if you want to apply 
patches, then yes, you do have to play at the source level.


The good news is once you can do that for one patch, you can apply 
others, and you will be in a position to find and fix bugs yourself.


-steve


Re: Hadoop Pipes Error

2011-03-31 Thread Adarsh Sharma
What are the steps needed to debug the error & make worcount-nopipe.cc 
running properly.


Please if possible guide in steps.

Thanks & best  Regards,
Adarsh Sharma


Amareshwari Sri Ramadasu wrote:

Here is an answer for your question in old mail archive:
http://lucene.472066.n3.nabble.com/pipe-application-error-td650185.html

On 3/31/11 10:15 AM, "Adarsh Sharma"  wrote:

Any update on the below error.

Please guide.


Thanks & best Regards,
Adarsh Sharma



Adarsh Sharma wrote:
  

Dear all,

Today I faced a problem while running a map-reduce job in C++. I am
not able to understand to find the reason of the below error :


11/03/30 12:09:02 INFO mapred.JobClient: Task Id :
attempt_201103301130_0011_m_00_0, Status : FAILED
java.io.IOException: pipe child exception
at
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
at
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:101)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at
org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at
org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:114)

attempt_201103301130_0011_m_00_0: Hadoop Pipes Exception: failed
to open  at wordcount-nopipe.cc:82 in
WordCountReader::WordCountReader(HadoopPipes::MapContext&)
11/03/30 12:09:02 INFO mapred.JobClient: Task Id :
attempt_201103301130_0011_m_01_0, Status : FAILED
java.io.IOException: pipe child exception
at
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
at
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:101)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at
org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at
org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:114)

attempt_201103301130_0011_m_01_0: Hadoop Pipes Exception: failed
to open  at wordcount-nopipe.cc:82 in
WordCountReader::WordCountReader(HadoopPipes::MapContext&)
11/03/30 12:09:02 INFO mapred.JobClient: Task Id :
attempt_201103301130_0011_m_02_0, Status : FAILED
java.io.IOException: pipe child exception
at
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
at
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:101)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at
org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at
org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:114)
attempt_201103301130_0011_m_02_1: Hadoop Pipes Exception: failed
to open  at wordcount-nopipe.cc:82 in
WordCountReader::WordCountReader(HadoopPipes::MapContext&)
11/03/30 12:09:15 INFO mapred.JobClient: Task Id :
attempt_201103301130_0011_m_00_2, Status : FAILED
java.io.IOException: pipe child exception
at
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
at
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:101)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:35

I tried to run *wordcount-nopipe.cc* program in
*/home/hadoop/project/hadoop-0.20.2/src/examples/pipes/impl* directory.


make  wordcount-nopipe
bin/hadoop fs -put wordcount-nopipe   bin/wordcount-nopipe
bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D
hadoop.pipes.java.recordwriter=true -input gutenberg -output
gutenberg-out11 -program bin/wordcount-nopipe

 or
bin/hadoop pipes -D hadoop.pipes.java.recordreader=false -D
hadoop.pipes.java.recordwriter=false -input gutenberg -output
gutenberg-out11 -program bin/wordcount-nopipe

but error remains the same. I attached my Makefile also.
Please have some comments on it.

I am able to wun a simple wordcount.cpp program in Hadoop Cluster but
don't know why this program fails in Broken Pipe error.



Thanks & best regards
Ada

Re: # of keys per reducer invocation (streaming api)

2011-03-31 Thread Dieter Plaetinck
On Tue, 29 Mar 2011 23:17:13 +0530
Harsh J  wrote:

> Hello,
> 
> On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck
>  wrote:
> > Hi, I'm using the streaming API and I notice my reducer gets - in
> > the same invocation - a bunch of different keys, and I wonder why.
> > I would expect to get one key per reducer run, as with the "normal"
> > hadoop.
> >
> > Is this to limit the amount of spawned processes, assuming creating
> > and destroying processes is usually expensive compared to the
> > amount of work they'll need to do (not much, if you have many keys
> > with each a handful of values)?
> >
> > OTOH if you have a high number of values over a small number of
> > keys, I would rather stick to one-key-per-reducer-invocation, then
> > I don't need to worry about supporting (and allocating memory for)
> > multiple input keys.  Is there a config setting to enable such
> > behavior?
> >
> > Maybe I'm missing something, but this seems like a big difference in
> > comparison to the default way of working, and should maybe be added
> > to the FAQ at
> > http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions
> >
> > thanks,
> > Dieter
> >
> 
> I think it would make more sense to think of streaming programs as
> complete map/reduce 'tasks', instead of trying to apply the Map/Reduce
> functional concept. Both of the programs need to be written from the
> reading level onwards, which in map's case each line is record input
> and in reduce's case it is one uniquely grouped key and all values
> associated to it. One would need to handle the reading-loop
> themselves.
> 
> Some non-Java libraries that provide abstractions atop the
> streaming/etc. layer allow for more fluent representations of the
> map() and reduce() functions, hiding away the other fine details (like
> the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce
> programs, for example.
> 
> A FAQ entry on this should do good too! You can file a ticket for an
> addition of this observation to the streaming docs' FAQ.
> 
> [1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial
> 

Thanks,
this makes it a little clearer.
I made a ticket @ https://issues.apache.org/jira/browse/MAPREDUCE-2410

Dieter


RE: Hadoop for Bioinformatics

2011-03-31 Thread Evert Lammerts
> The short answer is yes!  At CRS4 we are working on this very problem.
>
> We have implemented a Hadoop-based workflow to perform short read
> alignment to
> support DNA sequencing activities in our lab.  Its alignment operation
> is
> based on (and therefore equivalent to) BWA.  We have written a paper
> about it
> which will appear in the coming months, and we are working on an open
> source
> release, but alas we haven't completed that task yet.
>
> We have also implemented a Hadoop-based distributed blast alignment
> program,
> in case you're working with long fragments.  It's currently being used
> by our
> collaborators to align viral DNA segments.
>
>
> In either case, if you're interested we can let you have an advance
> release of
> either program so you can try them out.

Hi Luca,

Could you send me an advanced release of your software? I work for the Dutch 
national center for scientific computing, and I will give a workshop on Hadoop 
to BioInformatics on a large BI conference 
(http://www.nbic.nl/about-nbic/nbic-conferences/nbic-conference-2011/). Lots of 
people there work with BWA and BLAST type applications (among others in the 
BBMRI project, which I think CRS4 is involved in as well). So BWA on Hadoop 
could be a great case study.

Let me know!
Cheers,
Evert

>
>
> --
> Luca Pireddu
> CRS4 - Distributed Computing Group
> Loc. Pixina Manna Edificio 1
> Pula 09010 (CA), Italy
> Tel:  +39 0709250452