Re: Issue distcp'ing from 0.19.2 to 0.18.3

2009-04-08 Thread Todd Lipcon
Hey Bryan,

Any chance you can get a tshark trace on the 0.19 namenode? Maybe tshark -s
10 -w nndump.pcap port 7276

Also, are the clocks synced on the two machines? The failure of your distcp
is at 23:32:39, but the namenode log message you posted was 23:29:09. Did
those messages actually pop out at the same time?

Thanks
-Todd

On Wed, Apr 8, 2009 at 11:39 PM, Bryan Duxbury  wrote:

> Hey all,
>
> I was trying to copy some data from our cluster on 0.19.2 to a new cluster
> on 0.18.3 by using disctp and the hftp:// filesystem. Everything seemed to
> be going fine for a few hours, but then a few tasks failed because a few
> files got 500 errors when trying to be read from the 19 cluster. As a result
> the job died. Now that I'm trying to restart it, I get this error:
>
> [rapl...@ds-nn2 ~]$ hadoop distcp hftp://ds-nn1:7276/
> hdfs://ds-nn2:7276/cluster-a
> 09/04/08 23:32:39 INFO tools.DistCp: srcPaths=[hftp://ds-nn1:7276/]
> 09/04/08 23:32:39 INFO tools.DistCp: destPath=hdfs://ds-nn2:7276/cluster-a
> With failures, global counters are inaccurate; consider running with -i
> Copy failed: java.net.SocketException: Unexpected end of file from server
>at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:769)
>at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
>at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:766)
>at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
>at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1000)
>at
> org.apache.hadoop.dfs.HftpFileSystem$LsParser.fetchList(HftpFileSystem.java:183)
>at
> org.apache.hadoop.dfs.HftpFileSystem$LsParser.getFileStatus(HftpFileSystem.java:193)
>at
> org.apache.hadoop.dfs.HftpFileSystem.getFileStatus(HftpFileSystem.java:222)
>at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
>at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:588)
>at org.apache.hadoop.tools.DistCp.copy(DistCp.java:609)
>at org.apache.hadoop.tools.DistCp.run(DistCp.java:768)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>at org.apache.hadoop.tools.DistCp.main(DistCp.java:788)
>
> I changed nothing at all between the first attempt and the subsequent
> failed attempts. The only clues in the namenode log for the 19 cluster are:
>
> 2009-04-08 23:29:09,786 WARN org.apache.hadoop.ipc.Server: Incorrect header
> or version mismatch from 10.100.50.252:47733 got version 47 expected
> version 2
>
> Anyone have any ideas?
>
> -Bryan
>


Issue distcp'ing from 0.19.2 to 0.18.3

2009-04-08 Thread Bryan Duxbury

Hey all,

I was trying to copy some data from our cluster on 0.19.2 to a new  
cluster on 0.18.3 by using disctp and the hftp:// filesystem.  
Everything seemed to be going fine for a few hours, but then a few  
tasks failed because a few files got 500 errors when trying to be  
read from the 19 cluster. As a result the job died. Now that I'm  
trying to restart it, I get this error:


[rapl...@ds-nn2 ~]$ hadoop distcp hftp://ds-nn1:7276/ hdfs://ds- 
nn2:7276/cluster-a

09/04/08 23:32:39 INFO tools.DistCp: srcPaths=[hftp://ds-nn1:7276/]
09/04/08 23:32:39 INFO tools.DistCp: destPath=hdfs://ds-nn2:7276/ 
cluster-a

With failures, global counters are inaccurate; consider running with -i
Copy failed: java.net.SocketException: Unexpected end of file from  
server
at sun.net.www.http.HttpClient.parseHTTPHeader 
(HttpClient.java:769)

at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.http.HttpClient.parseHTTPHeader 
(HttpClient.java:766)

at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream 
(HttpURLConnection.java:1000)
at org.apache.hadoop.dfs.HftpFileSystem$LsParser.fetchList 
(HftpFileSystem.java:183)
at org.apache.hadoop.dfs.HftpFileSystem 
$LsParser.getFileStatus(HftpFileSystem.java:193)
at org.apache.hadoop.dfs.HftpFileSystem.getFileStatus 
(HftpFileSystem.java:222)

at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667)
at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:588)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:609)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:768)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:788)

I changed nothing at all between the first attempt and the subsequent  
failed attempts. The only clues in the namenode log for the 19  
cluster are:


2009-04-08 23:29:09,786 WARN org.apache.hadoop.ipc.Server: Incorrect  
header or version mismatch from 10.100.50.252:47733 got version 47  
expected version 2


Anyone have any ideas?

-Bryan


Re: Polymorphic behavior of Maps in One Job?

2009-04-08 Thread Sharad Agarwal

>
>
> MultipleInputs.addInputPath(JobConf conf, Path path, Class InputFormat> inputFormatClass, Class mapperClass)
> to add the mappers and my I/P format.
Right, and then you can use DelegatingInputFormat and DelegatingMapper.
>
> And use MultipleOutputs class to configure the O/P from the mappers.
> IF this is right where do i add the multiple implementations for the
> reducers in the JobConf??
Unlike mappers, multiple reducers can't be set. Multiple mappers are set 
based on different input paths, however same doesn't hold for reducers.

- Sharad


RE: CloudBurst: Hadoop for DNA Sequence Analysis

2009-04-08 Thread Dmitry Pushkarev
As a matter of fact it is nowhere near close to being data intensive, it
does take gigabytes of input data to process, however it is mostly RAM and
CPU intensive. Although post-processing of alignment files is exactly where
hadoop excels. At least as far as I understand majority of time is spent on
DP alignment whereas navigation in seed space and N*log(n) sort requires
only a fraction of that time - that was my experience applying hadoop
cluster to sequencing human genomes.



---
Dmitry Pushkarev
+1-650-644-8988

-Original Message-
From: michael.sch...@gmail.com [mailto:michael.sch...@gmail.com] On Behalf
Of Michael Schatz
Sent: Wednesday, April 08, 2009 9:19 PM
To: core-user@hadoop.apache.org
Subject: CloudBurst: Hadoop for DNA Sequence Analysis

Hadoop Users,

I just wanted to announce my Hadoop application 'CloudBurst' is available
open source at:
http://cloudburst-bio.sourceforge.net

In a nutshell, it is an application for mapping millions of short DNA
sequences to a reference genome to, for example, map out differences in one
individual's genome compared to the reference genome. As you might imagine,
this is a very data intense problem, but Hadoop enables the application to
scale up linearly to large clusters.

A full description of the program is available in the journal
Bioinformatics:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236

I also wanted to take this opportunity to thank everyone on this mailing
list. The discussions posted were essential for navigating the ins and outs
of hadoop during the development of CloudBurst.

Thanks everyone!

Michael Schatz

http://www.cbcb.umd.edu/~mschatz



CloudBurst: Hadoop for DNA Sequence Analysis

2009-04-08 Thread Michael Schatz
Hadoop Users,

I just wanted to announce my Hadoop application 'CloudBurst' is available
open source at:
http://cloudburst-bio.sourceforge.net

In a nutshell, it is an application for mapping millions of short DNA
sequences to a reference genome to, for example, map out differences in one
individual's genome compared to the reference genome. As you might imagine,
this is a very data intense problem, but Hadoop enables the application to
scale up linearly to large clusters.

A full description of the program is available in the journal
Bioinformatics:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236

I also wanted to take this opportunity to thank everyone on this mailing
list. The discussions posted were essential for navigating the ins and outs
of hadoop during the development of CloudBurst.

Thanks everyone!

Michael Schatz

http://www.cbcb.umd.edu/~mschatz


Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread jason hadoop
Chapter 8 of my book covers this in detail, the alpha chapter should be
available at the apress web site
Chain mapping rules!
http://www.apress.com/book/view/1430219424

On Wed, Apr 8, 2009 at 3:30 PM, Nathan Marz  wrote:

> You can also try decreasing the replication factor for the intermediate
> files between jobs. This will make writing those files faster.
>
>
> On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote:
>
>  Hi,
>> by far I am not an Hadoop expert but I think you can not start Map task
>> until the previous Reduce is finished. Saying this it means that you
>> probably have to store the Map output to the disk first (because a] it may
>> not fit into memory and b] you would risk data loss if the system
>> crashes).
>> As for the job chaining you can check JobControl class (
>>
>> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
>> )<
>> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
>> >
>>
>> Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702
>>
>> Regards,
>> Lukas
>>
>> On Wed, Apr 8, 2009 at 11:30 PM, asif md  wrote:
>>
>>  hi everyone,
>>>
>>> i have to chain multiple map reduce jobs < actually 2 to 4 jobs >, each
>>> of
>>> the jobs depends on the o/p of preceding job. In the reducer of each job
>>> I'm
>>> doing very little < just grouping by key from the maps>. I want to give
>>> the
>>> output of one MapReduce job to the next job without having to go to the
>>> disk. Does anyone have any ideas on how to do this?
>>>
>>> Thanx.
>>>
>>>
>>
>>
>> --
>> http://blog.lukas-vlcek.com/
>>
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread Alex Loddengaard
FYI: this (open) JIRA might be interesting to you:



Alex

On Wed, Apr 8, 2009 at 7:18 PM, Todd Lipcon  wrote:

> On Wed, Apr 8, 2009 at 7:14 PM, bzheng  wrote:
>
> >
> > Thanks for the clarification.  Though I still find it strange why not
> have
> > the get() method return what's actually stored regardless of buffer size.
> > Is there any reason why you'd want to use/examine what's in the buffer?
> >
>
> Because doing so requires an array copy. It's important for hadoop
> performance to avoid needless copies of data when they're unnecessary. Most
> APIs that take byte[] arrays have a version that includes an offset and
> length.
>
> -Todd
>
>
>
> >
> >
> > Todd Lipcon-4 wrote:
> > >
> > > Hi Bing,
> > >
> > > The issue here is that BytesWritable uses an internal buffer which is
> > > grown
> > > but not shrunk. The cause of this is that Writables in general are
> single
> > > instances that are shared across multiple input records. If you look at
> > > the
> > > internals of the input reader, you'll see that a single BytesWritable
> is
> > > instantiated, and then each time a record is read, it's read into that
> > > same
> > > instance. The purpose here is to avoid the allocation cost for each
> row.
> > >
> > > The end result is, as you've seen, that getBytes() returns an array
> which
> > > may be larger than the actual amount of data. In fact, the extra bytes
> > > (between .getSize() and .get().length) have undefined contents, not
> zero.
> > >
> > > Unfortunately, if the protobuffer API doesn't allow you to deserialize
> > out
> > > of a smaller portion of a byte array, you're out of luck and will have
> to
> > > do
> > > the copy like you've mentioned. I imagine, though, that there's some
> way
> > > around this in the protobuffer API - perhaps you can use a
> > > ByteArrayInputStream here to your advantage.
> > >
> > > Hope that helps
> > > -Todd
> > >
> > > On Wed, Apr 8, 2009 at 4:59 PM, bzheng  wrote:
> > >
> > >>
> > >> I tried to store protocolbuffer as BytesWritable in a sequence file
> > >>  > >> BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key),
> > new
> > >> BytesWritable(protobuf.convertToBytes())).  When reading the values
> from
> > >> key/value pairs using value.get(), it returns more then what's stored.
> > >> However, value.getSize() returns the correct number.  This means in
> > order
> > >> to
> > >> convert the byte[] to protocol buffer again, I have to do
> > >> Arrays.copyOf(value.get(), value.getSize()).  This happens on both
> > >> version
> > >> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes
> for
> > >> a
> > >> few entries in the sequence file below.  The extra bytes in
> value.get()
> > >> all
> > >> have values of zero.
> > >>
> > >> value.getSize(): 7066   value.get().length: 10599
> > >> value.getSize(): 36456  value.get().length: 54684
> > >> value.getSize(): 32275  value.get().length: 54684
> > >> value.getSize(): 40561  value.get().length: 54684
> > >> value.getSize(): 16855  value.get().length: 54684
> > >> value.getSize(): 66304  value.get().length: 99456
> > >> value.getSize(): 26488  value.get().length: 99456
> > >> value.getSize(): 59327  value.get().length: 99456
> > >> value.getSize(): 36865  value.get().length: 99456
> > >>
> > >> --
> > >> View this message in context:
> > >>
> >
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
> > >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >>
> > >>
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>


Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread Todd Lipcon
On Wed, Apr 8, 2009 at 7:14 PM, bzheng  wrote:

>
> Thanks for the clarification.  Though I still find it strange why not have
> the get() method return what's actually stored regardless of buffer size.
> Is there any reason why you'd want to use/examine what's in the buffer?
>

Because doing so requires an array copy. It's important for hadoop
performance to avoid needless copies of data when they're unnecessary. Most
APIs that take byte[] arrays have a version that includes an offset and
length.

-Todd



>
>
> Todd Lipcon-4 wrote:
> >
> > Hi Bing,
> >
> > The issue here is that BytesWritable uses an internal buffer which is
> > grown
> > but not shrunk. The cause of this is that Writables in general are single
> > instances that are shared across multiple input records. If you look at
> > the
> > internals of the input reader, you'll see that a single BytesWritable is
> > instantiated, and then each time a record is read, it's read into that
> > same
> > instance. The purpose here is to avoid the allocation cost for each row.
> >
> > The end result is, as you've seen, that getBytes() returns an array which
> > may be larger than the actual amount of data. In fact, the extra bytes
> > (between .getSize() and .get().length) have undefined contents, not zero.
> >
> > Unfortunately, if the protobuffer API doesn't allow you to deserialize
> out
> > of a smaller portion of a byte array, you're out of luck and will have to
> > do
> > the copy like you've mentioned. I imagine, though, that there's some way
> > around this in the protobuffer API - perhaps you can use a
> > ByteArrayInputStream here to your advantage.
> >
> > Hope that helps
> > -Todd
> >
> > On Wed, Apr 8, 2009 at 4:59 PM, bzheng  wrote:
> >
> >>
> >> I tried to store protocolbuffer as BytesWritable in a sequence file
> >>  >> BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key),
> new
> >> BytesWritable(protobuf.convertToBytes())).  When reading the values from
> >> key/value pairs using value.get(), it returns more then what's stored.
> >> However, value.getSize() returns the correct number.  This means in
> order
> >> to
> >> convert the byte[] to protocol buffer again, I have to do
> >> Arrays.copyOf(value.get(), value.getSize()).  This happens on both
> >> version
> >> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for
> >> a
> >> few entries in the sequence file below.  The extra bytes in value.get()
> >> all
> >> have values of zero.
> >>
> >> value.getSize(): 7066   value.get().length: 10599
> >> value.getSize(): 36456  value.get().length: 54684
> >> value.getSize(): 32275  value.get().length: 54684
> >> value.getSize(): 40561  value.get().length: 54684
> >> value.getSize(): 16855  value.get().length: 54684
> >> value.getSize(): 66304  value.get().length: 99456
> >> value.getSize(): 26488  value.get().length: 99456
> >> value.getSize(): 59327  value.get().length: 99456
> >> value.getSize(): 36865  value.get().length: 99456
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread bzheng

Thanks for the clarification.  Though I still find it strange why not have
the get() method return what's actually stored regardless of buffer size. 
Is there any reason why you'd want to use/examine what's in the buffer?


Todd Lipcon-4 wrote:
> 
> Hi Bing,
> 
> The issue here is that BytesWritable uses an internal buffer which is
> grown
> but not shrunk. The cause of this is that Writables in general are single
> instances that are shared across multiple input records. If you look at
> the
> internals of the input reader, you'll see that a single BytesWritable is
> instantiated, and then each time a record is read, it's read into that
> same
> instance. The purpose here is to avoid the allocation cost for each row.
> 
> The end result is, as you've seen, that getBytes() returns an array which
> may be larger than the actual amount of data. In fact, the extra bytes
> (between .getSize() and .get().length) have undefined contents, not zero.
> 
> Unfortunately, if the protobuffer API doesn't allow you to deserialize out
> of a smaller portion of a byte array, you're out of luck and will have to
> do
> the copy like you've mentioned. I imagine, though, that there's some way
> around this in the protobuffer API - perhaps you can use a
> ByteArrayInputStream here to your advantage.
> 
> Hope that helps
> -Todd
> 
> On Wed, Apr 8, 2009 at 4:59 PM, bzheng  wrote:
> 
>>
>> I tried to store protocolbuffer as BytesWritable in a sequence file
>> > BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key), new
>> BytesWritable(protobuf.convertToBytes())).  When reading the values from
>> key/value pairs using value.get(), it returns more then what's stored.
>> However, value.getSize() returns the correct number.  This means in order
>> to
>> convert the byte[] to protocol buffer again, I have to do
>> Arrays.copyOf(value.get(), value.getSize()).  This happens on both
>> version
>> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for
>> a
>> few entries in the sequence file below.  The extra bytes in value.get()
>> all
>> have values of zero.
>>
>> value.getSize(): 7066   value.get().length: 10599
>> value.getSize(): 36456  value.get().length: 54684
>> value.getSize(): 32275  value.get().length: 54684
>> value.getSize(): 40561  value.get().length: 54684
>> value.getSize(): 16855  value.get().length: 54684
>> value.getSize(): 66304  value.get().length: 99456
>> value.getSize(): 26488  value.get().length: 99456
>> value.getSize(): 59327  value.get().length: 99456
>> value.getSize(): 36865  value.get().length: 99456
>>
>> --
>> View this message in context:
>> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread Gaurav Chandalia
Arrays.copyOf isn't required, protocol buffer has a method to merge  
from bytes. you can do:


protobuf.newBuilder().mergeFrom(value.getBytes(), 0, value.getLength())

the above is for hadoop 0.19.1, the corresponding method names for  
BytesWritable for earlier version of hadoop might be slightly different.


--
gaurav



On Apr 8, 2009, at 7:13 PM, Todd Lipcon wrote:


Hi Bing,

The issue here is that BytesWritable uses an internal buffer which  
is grown
but not shrunk. The cause of this is that Writables in general are  
single
instances that are shared across multiple input records. If you look  
at the
internals of the input reader, you'll see that a single  
BytesWritable is
instantiated, and then each time a record is read, it's read into  
that same
instance. The purpose here is to avoid the allocation cost for each  
row.


The end result is, as you've seen, that getBytes() returns an array  
which

may be larger than the actual amount of data. In fact, the extra bytes
(between .getSize() and .get().length) have undefined contents, not  
zero.


Unfortunately, if the protobuffer API doesn't allow you to  
deserialize out
of a smaller portion of a byte array, you're out of luck and will  
have to do
the copy like you've mentioned. I imagine, though, that there's some  
way

around this in the protobuffer API - perhaps you can use a
ByteArrayInputStream here to your advantage.

Hope that helps
-Todd

On Wed, Apr 8, 2009 at 4:59 PM, bzheng  wrote:



I tried to store protocolbuffer as BytesWritable in a sequence file  
BytesWritable>.  It's stored using SequenceFile.Writer(new  
Text(key), new
BytesWritable(protobuf.convertToBytes())).  When reading the values  
from
key/value pairs using value.get(), it returns more then what's  
stored.
However, value.getSize() returns the correct number.  This means in  
order

to
convert the byte[] to protocol buffer again, I have to do
Arrays.copyOf(value.get(), value.getSize()).  This happens on both  
version
0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample  
sizes for a
few entries in the sequence file below.  The extra bytes in  
value.get() all

have values of zero.

value.getSize(): 7066   value.get().length: 10599
value.getSize(): 36456  value.get().length: 54684
value.getSize(): 32275  value.get().length: 54684
value.getSize(): 40561  value.get().length: 54684
value.getSize(): 16855  value.get().length: 54684
value.getSize(): 66304  value.get().length: 99456
value.getSize(): 26488  value.get().length: 99456
value.getSize(): 59327  value.get().length: 99456
value.getSize(): 36865  value.get().length: 99456

--
View this message in context:
http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.






Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread Todd Lipcon
Hi Bing,

The issue here is that BytesWritable uses an internal buffer which is grown
but not shrunk. The cause of this is that Writables in general are single
instances that are shared across multiple input records. If you look at the
internals of the input reader, you'll see that a single BytesWritable is
instantiated, and then each time a record is read, it's read into that same
instance. The purpose here is to avoid the allocation cost for each row.

The end result is, as you've seen, that getBytes() returns an array which
may be larger than the actual amount of data. In fact, the extra bytes
(between .getSize() and .get().length) have undefined contents, not zero.

Unfortunately, if the protobuffer API doesn't allow you to deserialize out
of a smaller portion of a byte array, you're out of luck and will have to do
the copy like you've mentioned. I imagine, though, that there's some way
around this in the protobuffer API - perhaps you can use a
ByteArrayInputStream here to your advantage.

Hope that helps
-Todd

On Wed, Apr 8, 2009 at 4:59 PM, bzheng  wrote:

>
> I tried to store protocolbuffer as BytesWritable in a sequence file  BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key), new
> BytesWritable(protobuf.convertToBytes())).  When reading the values from
> key/value pairs using value.get(), it returns more then what's stored.
> However, value.getSize() returns the correct number.  This means in order
> to
> convert the byte[] to protocol buffer again, I have to do
> Arrays.copyOf(value.get(), value.getSize()).  This happens on both version
> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for a
> few entries in the sequence file below.  The extra bytes in value.get() all
> have values of zero.
>
> value.getSize(): 7066   value.get().length: 10599
> value.getSize(): 36456  value.get().length: 54684
> value.getSize(): 32275  value.get().length: 54684
> value.getSize(): 40561  value.get().length: 54684
> value.getSize(): 16855  value.get().length: 54684
> value.getSize(): 66304  value.get().length: 99456
> value.getSize(): 26488  value.get().length: 99456
> value.getSize(): 59327  value.get().length: 99456
> value.getSize(): 36865  value.get().length: 99456
>
> --
> View this message in context:
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread bzheng

I tried to store protocolbuffer as BytesWritable in a sequence file .  It's stored using SequenceFile.Writer(new Text(key), new
BytesWritable(protobuf.convertToBytes())).  When reading the values from
key/value pairs using value.get(), it returns more then what's stored. 
However, value.getSize() returns the correct number.  This means in order to
convert the byte[] to protocol buffer again, I have to do
Arrays.copyOf(value.get(), value.getSize()).  This happens on both version
0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for a
few entries in the sequence file below.  The extra bytes in value.get() all
have values of zero.  

value.getSize(): 7066   value.get().length: 10599
value.getSize(): 36456  value.get().length: 54684
value.getSize(): 32275  value.get().length: 54684
value.getSize(): 40561  value.get().length: 54684
value.getSize(): 16855  value.get().length: 54684
value.getSize(): 66304  value.get().length: 99456
value.getSize(): 26488  value.get().length: 99456
value.getSize(): 59327  value.get().length: 99456
value.getSize(): 36865  value.get().length: 99456

-- 
View this message in context: 
http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread Nathan Marz
You can also try decreasing the replication factor for the  
intermediate files between jobs. This will make writing those files  
faster.


On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote:


Hi,
by far I am not an Hadoop expert but I think you can not start Map  
task

until the previous Reduce is finished. Saying this it means that you
probably have to store the Map output to the disk first (because a]  
it may
not fit into memory and b] you would risk data loss if the system  
crashes).

As for the job chaining you can check JobControl class (
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html) 



Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

Regards,
Lukas

On Wed, Apr 8, 2009 at 11:30 PM, asif md  wrote:


hi everyone,

i have to chain multiple map reduce jobs < actually 2 to 4 jobs >,  
each of
the jobs depends on the o/p of preceding job. In the reducer of  
each job

I'm
doing very little < just grouping by key from the maps>. I want to  
give the
output of one MapReduce job to the next job without having to go to  
the

disk. Does anyone have any ideas on how to do this?

Thanx.





--
http://blog.lukas-vlcek.com/




Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread Lukáš Vlček
Hi,
by far I am not an Hadoop expert but I think you can not start Map task
until the previous Reduce is finished. Saying this it means that you
probably have to store the Map output to the disk first (because a] it may
not fit into memory and b] you would risk data loss if the system crashes).
As for the job chaining you can check JobControl class (
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html)

Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

Regards,
Lukas

On Wed, Apr 8, 2009 at 11:30 PM, asif md  wrote:

> hi everyone,
>
> i have to chain multiple map reduce jobs < actually 2 to 4 jobs >, each of
> the jobs depends on the o/p of preceding job. In the reducer of each job
> I'm
> doing very little < just grouping by key from the maps>. I want to give the
> output of one MapReduce job to the next job without having to go to the
> disk. Does anyone have any ideas on how to do this?
>
> Thanx.
>



-- 
http://blog.lukas-vlcek.com/


Re: Example of deploying jars through DistributedCache?

2009-04-08 Thread Aaron Kimball
Ooh. The other DCache-based operations assume that you're dcaching files
already resident in HDFS. I guess this assumes that the filenames are on the
local filesystem.

- Aaron

On Wed, Apr 8, 2009 at 8:32 AM, Brian MacKay wrote:

>
> I use addArchiveToClassPath, and it works for me.
>
> DistributedCache.addArchiveToClassPath(new Path(path), conf);
>
> I was curious about this block of code.  Why are you coping to tmp?
>
> >FileSystem fs = FileSystem.get(conf);
> >fs.copyFromLocalFile(new Path("aaronTest2.jar"), new
> > Path("tmp/aaronTest2.jar"));
>
> -Original Message-
> From: Tom White [mailto:t...@cloudera.com]
> Sent: Wednesday, April 08, 2009 9:36 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Example of deploying jars through DistributedCache?
>
> Does it work if you use addArchiveToClassPath()?
>
> Also, it may be more convenient to use GenericOptionsParser's -libjars
> option.
>
> Tom
>
> On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball  wrote:
> > Hi all,
> >
> > I'm stumped as to how to use the distributed cache's classpath feature. I
> > have a library of Java classes I'd like to distribute to jobs and use in
> my
> > mapper; I figured the DCache's addFileToClassPath() method was the
> correct
> > means, given the example at
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
> .
> >
> >
> > I've boiled it down to the following non-working example:
> >
> > in TestDriver.java:
> >
> >
> >  private void runJob() throws IOException {
> >JobConf conf = new JobConf(getConf(), TestDriver.class);
> >
> >// do standard job configuration.
> >FileInputFormat.addInputPath(conf, new Path("input"));
> >FileOutputFormat.setOutputPath(conf, new Path("output"));
> >
> >conf.setMapperClass(TestMapper.class);
> >conf.setNumReduceTasks(0);
> >
> >// load aaronTest2.jar into the dcache; this contains the class
> > ValueProvider
> >FileSystem fs = FileSystem.get(conf);
> >fs.copyFromLocalFile(new Path("aaronTest2.jar"), new
> > Path("tmp/aaronTest2.jar"));
> >DistributedCache.addFileToClassPath(new Path("tmp/aaronTest2.jar"),
> > conf);
> >
> >// run the job.
> >JobClient.runJob(conf);
> >  }
> >
> >
> >  and then in TestMapper:
> >
> >  public void map(LongWritable key, Text value,
> > OutputCollector output,
> >  Reporter reporter) throws IOException {
> >
> >try {
> >  ValueProvider vp = (ValueProvider)
> > Class.forName("ValueProvider").newInstance();
> >  Text val = vp.getValue();
> >  output.collect(new LongWritable(1), val);
> >} catch (ClassNotFoundException e) {
> >  throw new IOException("not found: " + e.toString()); //
> newInstance()
> > throws to here.
> >} catch (Exception e) {
> >  throw new IOException("Exception:" + e.toString());
> >}
> >  }
> >
> >
> > The class "ValueProvider" is to be loaded from aaronTest2.jar. I can
> verify
> > that this code works if I put ValueProvider into the main jar I deploy. I
> > can verify that aaronTest2.jar makes it into the
> > ${mapred.local.dir}/taskTracker/archive/
> >
> > But when run with ValueProvider in aaronTest2.jar, the job fails with:
> >
> > $ bin/hadoop jar aaronTest1.jar TestDriver
> > 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to
> process
> > : 10
> > 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to
> process
> > : 10
> > 09/03/01 22:36:04 INFO mapred.JobClient: Running job:
> job_200903012210_0005
> > 09/03/01 22:36:05 INFO mapred.JobClient:  map 0% reduce 0%
> > 09/03/01 22:36:14 INFO mapred.JobClient: Task Id :
> > attempt_200903012210_0005_m_00_0, Status : FAILED
> > java.io.IOException: not found: java.lang.ClassNotFoundException:
> > ValueProvider
> >at TestMapper.map(Unknown Source)
> >at TestMapper.map(Unknown Source)
> >at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> >at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> >at
> > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> >
> >
> > Do I need to do something else (maybe in Mapper.configure()?) to actually
> > classload the jar? The documentation makes me believe it should already
> be
> > in the classpath by doing only what I've done above. I'm on Hadoop
> 0.18.3.
> >
> > Thanks,
> > - Aaron
> >
>
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this message in error, please contact the sender and delete the material
> from any computer.
>
>
>


Re: How many people is using Hadoop Streaming ?

2009-04-08 Thread Aaron Kimball
Excellent. Thanks
- A

On Tue, Apr 7, 2009 at 2:16 PM, Owen O'Malley  wrote:

>
> On Apr 7, 2009, at 11:41 AM, Aaron Kimball wrote:
>
>  Owen,
>>
>> Is binary streaming actually readily available?
>>
>
> https://issues.apache.org/jira/browse/HADOOP-1722
>
>


Re: Getting free and used space

2009-04-08 Thread Aaron Kimball
You can insert this propery into the jobconf, or specify it on the command
line e.g.: -D hadoop.job.ugi=username,group,group,group.

- Aaron

On Wed, Apr 8, 2009 at 7:04 AM, Brian Bockelman wrote:

> Hey Stas,
>
> What we do locally is apply the latest patch for this issue:
> https://issues.apache.org/jira/browse/HADOOP-4368
>
> This makes getUsed (actually, it switches to FileSystem.getStatus) not a
> privileged action.
>
> As far as specifying the user ... gee, I can't think of it off the top of
> my head.  It's a variable you can insert into the JobConf, but I'd have to
> poke around google or the code to remember which one (I try to not override
> it if possible).
>
> Brian
>
>
> On Apr 8, 2009, at 8:51 AM, Stas Oskin wrote:
>
>  Hi.
>>
>> Thanks for the explanation.
>>
>> Now for the easier part - how do I specify the user when connecting? :)
>>
>> Is it a config file level, or run-time level setting?
>>
>> Regards.
>>
>> 2009/4/8 Brian Bockelman 
>>
>>  Hey Stas,
>>>
>>> Did you try this as a privileged user?  There might be some permission
>>> errors... in most of the released versions, getUsed() is only available
>>> to
>>> the Hadoop superuser.  It may be that the exception isn't propagating
>>> correctly.
>>>
>>> Brian
>>>
>>>
>>> On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote:
>>>
>>> Hi.
>>>

 I'm trying to use the API to get the overall used and free spaces.

 I tried this function getUsed(), but it always returns 0.

 Any idea?

 Thanks.


>>>
>>>
>


Chaining Multiple Map reduce jobs.

2009-04-08 Thread asif md
hi everyone,

i have to chain multiple map reduce jobs < actually 2 to 4 jobs >, each of
the jobs depends on the o/p of preceding job. In the reducer of each job I'm
doing very little < just grouping by key from the maps>. I want to give the
output of one MapReduce job to the next job without having to go to the
disk. Does anyone have any ideas on how to do this?

Thanx.


RE: Hadoop data nodes failing to start

2009-04-08 Thread Kevin Eppinger
Unfortunately not.  I don't have much leeway to experiment with this cluster.

-kevin

-Original Message-
From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-Daniel 
Cryans
Sent: Wednesday, April 08, 2009 8:30 AM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop data nodes failing to start

Kevin,

I'm glad it worked for you.

We talked a bit about 5114 yesterday, any chance of trying 0.18 branch
on that same cluster without the socket timeout thing?

Thx,

J-D

On Wed, Apr 8, 2009 at 9:24 AM, Kevin Eppinger
 wrote:
> FYI:  Problem fixed.  It was apparently a timeout condition present in 0.18.3 
> that only popped up when the additional nodes were added.  The solution was 
> to put the following entry in hadoop-site.xml:
>
> 
>   dfs.datanode.socket.write.timeout
>   0
> 
>
> Thanks to 'jdcryans' and 'digarok' from IRC for the help.
>
> -kevin
>
> -Original Message-
> From: Kevin Eppinger [mailto:keppin...@adknowledge.com]
> Sent: Tuesday, April 07, 2009 1:05 PM
> To: core-user@hadoop.apache.org
> Subject: Hadoop data nodes failing to start
>
> Hello everyone-
>
> So I have a 5 node cluster that I've been running for a few weeks with no 
> problems.  Today I decided to add nodes and double its size to 10.  After 
> doing all the setup and starting the cluster, I discovered that four out of 
> the 10 nodes had failed to startup.  Specifically, the data nodes didn't 
> start.  The task trackers seemed to start fine.  Thinking it was something I 
> did incorrectly with the expansion, I then reverted back to the 5 node 
> configuration but I'm experiencing the same problem...with only 2 of 5 nodes 
> starting correctly.  Here is what I'm seeing in the hadoop-*-datanode*.log 
> files:
>
> 2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting 
> Periodic block scanner.
> 2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 
> 9269 blocks got processed in 1128 msecs
> 2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: 
> DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
> 1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: 
> Exiting due to:java.nio.channels.ClosedSelectorException
>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
>        at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
>        at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
>        at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
>        at 
> org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
>        at java.lang.Thread.run(Thread.java:619)
>
> After this the data node shuts down.  This same message is appearing on all 
> the failed nodes.  Help!
>
> -kevin
>


Re: Web ui

2009-04-08 Thread Rasit OZDAS
@Nick, I'm using ajax very often and previously done projects with ZK
and JQuery, I can easily say that GWT was the easiest of them.
Javascript is only needed where core features aren't enough. I can
easily assume that we won't need any inline javascript.

@Philip,
Thanks for the point. That is a better solution than I imagine,
actually, and I won't have to wait since it's a resolved issue.

-- 
M. Raşit ÖZDAŞ


RE: Example of deploying jars through DistributedCache?

2009-04-08 Thread Brian MacKay

I use addArchiveToClassPath, and it works for me.

DistributedCache.addArchiveToClassPath(new Path(path), conf);

I was curious about this block of code.  Why are you coping to tmp?

>FileSystem fs = FileSystem.get(conf);
>fs.copyFromLocalFile(new Path("aaronTest2.jar"), new
> Path("tmp/aaronTest2.jar"));

-Original Message-
From: Tom White [mailto:t...@cloudera.com] 
Sent: Wednesday, April 08, 2009 9:36 AM
To: core-user@hadoop.apache.org
Subject: Re: Example of deploying jars through DistributedCache?

Does it work if you use addArchiveToClassPath()?

Also, it may be more convenient to use GenericOptionsParser's -libjars option.

Tom

On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball  wrote:
> Hi all,
>
> I'm stumped as to how to use the distributed cache's classpath feature. I
> have a library of Java classes I'd like to distribute to jobs and use in my
> mapper; I figured the DCache's addFileToClassPath() method was the correct
> means, given the example at
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html.
>
>
> I've boiled it down to the following non-working example:
>
> in TestDriver.java:
>
>
>  private void runJob() throws IOException {
>    JobConf conf = new JobConf(getConf(), TestDriver.class);
>
>    // do standard job configuration.
>    FileInputFormat.addInputPath(conf, new Path("input"));
>    FileOutputFormat.setOutputPath(conf, new Path("output"));
>
>    conf.setMapperClass(TestMapper.class);
>    conf.setNumReduceTasks(0);
>
>    // load aaronTest2.jar into the dcache; this contains the class
> ValueProvider
>    FileSystem fs = FileSystem.get(conf);
>    fs.copyFromLocalFile(new Path("aaronTest2.jar"), new
> Path("tmp/aaronTest2.jar"));
>    DistributedCache.addFileToClassPath(new Path("tmp/aaronTest2.jar"),
> conf);
>
>    // run the job.
>    JobClient.runJob(conf);
>  }
>
>
>  and then in TestMapper:
>
>  public void map(LongWritable key, Text value,
> OutputCollector output,
>      Reporter reporter) throws IOException {
>
>    try {
>      ValueProvider vp = (ValueProvider)
> Class.forName("ValueProvider").newInstance();
>      Text val = vp.getValue();
>      output.collect(new LongWritable(1), val);
>    } catch (ClassNotFoundException e) {
>      throw new IOException("not found: " + e.toString()); // newInstance()
> throws to here.
>    } catch (Exception e) {
>      throw new IOException("Exception:" + e.toString());
>    }
>  }
>
>
> The class "ValueProvider" is to be loaded from aaronTest2.jar. I can verify
> that this code works if I put ValueProvider into the main jar I deploy. I
> can verify that aaronTest2.jar makes it into the
> ${mapred.local.dir}/taskTracker/archive/
>
> But when run with ValueProvider in aaronTest2.jar, the job fails with:
>
> $ bin/hadoop jar aaronTest1.jar TestDriver
> 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
> : 10
> 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
> : 10
> 09/03/01 22:36:04 INFO mapred.JobClient: Running job: job_200903012210_0005
> 09/03/01 22:36:05 INFO mapred.JobClient:  map 0% reduce 0%
> 09/03/01 22:36:14 INFO mapred.JobClient: Task Id :
> attempt_200903012210_0005_m_00_0, Status : FAILED
> java.io.IOException: not found: java.lang.ClassNotFoundException:
> ValueProvider
>    at TestMapper.map(Unknown Source)
>    at TestMapper.map(Unknown Source)
>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>    at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>
>
> Do I need to do something else (maybe in Mapper.configure()?) to actually
> classload the jar? The documentation makes me believe it should already be
> in the classpath by doing only what I've done above. I'm on Hadoop 0.18.3.
>
> Thanks,
> - Aaron
>

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please contact the sender and delete the material 
from any computer.




run the pipes wordcount example with nopipe

2009-04-08 Thread Jianmin Woo
Hi, 

With several days investigation, the wordcount-nopipe example in the 
hadoop-0.19.1 package can be run finally. However, there are some changes I did 
but not sure if this is proper/correct way. Could anyone please help to verify?

1. start up the job with the -inputformat argument with value 
"org.apache.hadoop.mapred.pipes.WordCountInputFormat"
2. since the RecordReader/Writer in C++ is used, so no Java based 
RecordReader/Writer should be used. However, I faced the error like 
"RecordReader defined while not needed" from the pipes. After checking the 
org.apache.hadoop.mapred.pipes.Submitter.java, I found this code snippet: 

if (results.hasOption("-inputformat")) {
setIsJavaRecordReader(job, true);
job.setInputFormat(getClass(results, "-inputformat", job,
 InputFormat.class));
}

So it seems that with -inputformat specified, the JavaRecordReader will be 
enabled. This caused the error before. Then I comment the line 
"setIsJavaRecordReader(job, true);". Then the examples can be run. Is this the 
proper way to make the wordcount-nopipe example works? I see from the code that 
there is a commented line "//cli.addArgument("javareader", false, "is the 
RecordReader in Java");" . Should this line be uncommented to support the 
disable of JavaRecordReader in command line option?

3. It seems that the wordcount-nopipe only works for input/output with local 
URI  file:///home/...  Is this true?

Thanks,
Jianmin


  

Re: Getting free and used space

2009-04-08 Thread Brian Bockelman

Hey Stas,

What we do locally is apply the latest patch for this issue: 
https://issues.apache.org/jira/browse/HADOOP-4368

This makes getUsed (actually, it switches to FileSystem.getStatus) not  
a privileged action.


As far as specifying the user ... gee, I can't think of it off the top  
of my head.  It's a variable you can insert into the JobConf, but I'd  
have to poke around google or the code to remember which one (I try to  
not override it if possible).


Brian

On Apr 8, 2009, at 8:51 AM, Stas Oskin wrote:


Hi.

Thanks for the explanation.

Now for the easier part - how do I specify the user when  
connecting? :)


Is it a config file level, or run-time level setting?

Regards.

2009/4/8 Brian Bockelman 


Hey Stas,

Did you try this as a privileged user?  There might be some  
permission
errors... in most of the released versions, getUsed() is only  
available to

the Hadoop superuser.  It may be that the exception isn't propagating
correctly.

Brian


On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote:

Hi.


I'm trying to use the API to get the overall used and free spaces.

I tried this function getUsed(), but it always returns 0.

Any idea?

Thanks.








Re: Getting free and used space

2009-04-08 Thread Stas Oskin
Hi.

Thanks for the explanation.

Now for the easier part - how do I specify the user when connecting? :)

Is it a config file level, or run-time level setting?

Regards.

2009/4/8 Brian Bockelman 

> Hey Stas,
>
> Did you try this as a privileged user?  There might be some permission
> errors... in most of the released versions, getUsed() is only available to
> the Hadoop superuser.  It may be that the exception isn't propagating
> correctly.
>
> Brian
>
>
> On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote:
>
>  Hi.
>>
>> I'm trying to use the API to get the overall used and free spaces.
>>
>> I tried this function getUsed(), but it always returns 0.
>>
>> Any idea?
>>
>> Thanks.
>>
>
>


Re: using cascading fro map-reduce

2009-04-08 Thread Erik Holstad
Hi!
If you are interested in Cascading I recommend you to ask on the Cascading
mailing list or come ask in the irc channel.
The mailing list can be found at the bottom left corner of www.cascading.org
.

Regards Erik


Re: Example of deploying jars through DistributedCache?

2009-04-08 Thread Tom White
Does it work if you use addArchiveToClassPath()?

Also, it may be more convenient to use GenericOptionsParser's -libjars option.

Tom

On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball  wrote:
> Hi all,
>
> I'm stumped as to how to use the distributed cache's classpath feature. I
> have a library of Java classes I'd like to distribute to jobs and use in my
> mapper; I figured the DCache's addFileToClassPath() method was the correct
> means, given the example at
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html.
>
>
> I've boiled it down to the following non-working example:
>
> in TestDriver.java:
>
>
>  private void runJob() throws IOException {
>    JobConf conf = new JobConf(getConf(), TestDriver.class);
>
>    // do standard job configuration.
>    FileInputFormat.addInputPath(conf, new Path("input"));
>    FileOutputFormat.setOutputPath(conf, new Path("output"));
>
>    conf.setMapperClass(TestMapper.class);
>    conf.setNumReduceTasks(0);
>
>    // load aaronTest2.jar into the dcache; this contains the class
> ValueProvider
>    FileSystem fs = FileSystem.get(conf);
>    fs.copyFromLocalFile(new Path("aaronTest2.jar"), new
> Path("tmp/aaronTest2.jar"));
>    DistributedCache.addFileToClassPath(new Path("tmp/aaronTest2.jar"),
> conf);
>
>    // run the job.
>    JobClient.runJob(conf);
>  }
>
>
>  and then in TestMapper:
>
>  public void map(LongWritable key, Text value,
> OutputCollector output,
>      Reporter reporter) throws IOException {
>
>    try {
>      ValueProvider vp = (ValueProvider)
> Class.forName("ValueProvider").newInstance();
>      Text val = vp.getValue();
>      output.collect(new LongWritable(1), val);
>    } catch (ClassNotFoundException e) {
>      throw new IOException("not found: " + e.toString()); // newInstance()
> throws to here.
>    } catch (Exception e) {
>      throw new IOException("Exception:" + e.toString());
>    }
>  }
>
>
> The class "ValueProvider" is to be loaded from aaronTest2.jar. I can verify
> that this code works if I put ValueProvider into the main jar I deploy. I
> can verify that aaronTest2.jar makes it into the
> ${mapred.local.dir}/taskTracker/archive/
>
> But when run with ValueProvider in aaronTest2.jar, the job fails with:
>
> $ bin/hadoop jar aaronTest1.jar TestDriver
> 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
> : 10
> 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
> : 10
> 09/03/01 22:36:04 INFO mapred.JobClient: Running job: job_200903012210_0005
> 09/03/01 22:36:05 INFO mapred.JobClient:  map 0% reduce 0%
> 09/03/01 22:36:14 INFO mapred.JobClient: Task Id :
> attempt_200903012210_0005_m_00_0, Status : FAILED
> java.io.IOException: not found: java.lang.ClassNotFoundException:
> ValueProvider
>    at TestMapper.map(Unknown Source)
>    at TestMapper.map(Unknown Source)
>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>    at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>
>
> Do I need to do something else (maybe in Mapper.configure()?) to actually
> classload the jar? The documentation makes me believe it should already be
> in the classpath by doing only what I've done above. I'm on Hadoop 0.18.3.
>
> Thanks,
> - Aaron
>


Re: Hadoop data nodes failing to start

2009-04-08 Thread Jean-Daniel Cryans
Kevin,

I'm glad it worked for you.

We talked a bit about 5114 yesterday, any chance of trying 0.18 branch
on that same cluster without the socket timeout thing?

Thx,

J-D

On Wed, Apr 8, 2009 at 9:24 AM, Kevin Eppinger
 wrote:
> FYI:  Problem fixed.  It was apparently a timeout condition present in 0.18.3 
> that only popped up when the additional nodes were added.  The solution was 
> to put the following entry in hadoop-site.xml:
>
> 
>   dfs.datanode.socket.write.timeout
>   0
> 
>
> Thanks to 'jdcryans' and 'digarok' from IRC for the help.
>
> -kevin
>
> -Original Message-
> From: Kevin Eppinger [mailto:keppin...@adknowledge.com]
> Sent: Tuesday, April 07, 2009 1:05 PM
> To: core-user@hadoop.apache.org
> Subject: Hadoop data nodes failing to start
>
> Hello everyone-
>
> So I have a 5 node cluster that I've been running for a few weeks with no 
> problems.  Today I decided to add nodes and double its size to 10.  After 
> doing all the setup and starting the cluster, I discovered that four out of 
> the 10 nodes had failed to startup.  Specifically, the data nodes didn't 
> start.  The task trackers seemed to start fine.  Thinking it was something I 
> did incorrectly with the expansion, I then reverted back to the 5 node 
> configuration but I'm experiencing the same problem...with only 2 of 5 nodes 
> starting correctly.  Here is what I'm seeing in the hadoop-*-datanode*.log 
> files:
>
> 2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting 
> Periodic block scanner.
> 2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 
> 9269 blocks got processed in 1128 msecs
> 2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: 
> DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
> 1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: 
> Exiting due to:java.nio.channels.ClosedSelectorException
>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
>        at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
>        at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
>        at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
>        at 
> org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
>        at java.lang.Thread.run(Thread.java:619)
>
> After this the data node shuts down.  This same message is appearing on all 
> the failed nodes.  Help!
>
> -kevin
>


RE: Hadoop data nodes failing to start

2009-04-08 Thread Kevin Eppinger
FYI:  Problem fixed.  It was apparently a timeout condition present in 0.18.3 
that only popped up when the additional nodes were added.  The solution was to 
put the following entry in hadoop-site.xml:


   dfs.datanode.socket.write.timeout
   0


Thanks to 'jdcryans' and 'digarok' from IRC for the help.

-kevin

-Original Message-
From: Kevin Eppinger [mailto:keppin...@adknowledge.com] 
Sent: Tuesday, April 07, 2009 1:05 PM
To: core-user@hadoop.apache.org
Subject: Hadoop data nodes failing to start

Hello everyone-

So I have a 5 node cluster that I've been running for a few weeks with no 
problems.  Today I decided to add nodes and double its size to 10.  After doing 
all the setup and starting the cluster, I discovered that four out of the 10 
nodes had failed to startup.  Specifically, the data nodes didn't start.  The 
task trackers seemed to start fine.  Thinking it was something I did 
incorrectly with the expansion, I then reverted back to the 5 node 
configuration but I'm experiencing the same problem...with only 2 of 5 nodes 
starting correctly.  Here is what I'm seeing in the hadoop-*-datanode*.log 
files:

2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic 
block scanner.
2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 
9269 blocks got processed in 1128 msecs
2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: 
DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: 
Exiting due to:java.nio.channels.ClosedSelectorException
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
at 
org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
at java.lang.Thread.run(Thread.java:619)

After this the data node shuts down.  This same message is appearing on all the 
failed nodes.  Help!

-kevin


using cascading

2009-04-08 Thread vishal s. ghawate
hi,
I am  new to the cascading concept.can anybody help me on how to run the 
cascading example 
I followed the http://cascading.org link but its not helping me please can 
anybody explain me the steps needed to be followed to run the cascading example

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


using cascading fro map-reduce

2009-04-08 Thread vishal s. ghawate
hi,
I am trying to use cascading API for replacing map-reduce.
can anybody give me the detailed description on how to run it.

i have followed the instruction given on http://www.cascading.org/ but 
couldn't do it so if anybody had run that can please guide me on this.

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Too many fetch errors

2009-04-08 Thread xiaolin guo
Fixed the problem
The problem is that the one of the nodes can not resolve the name of the
other node.  Even if I use ip address in the masters and slaves , hadoop
will use the name of the node instead of the ip address ...

On Wed, Apr 8, 2009 at 7:26 PM, xiaolin guo  wrote:

> I have checked the log and found that  for each map task , there are 3
> failures which look like machin1(failed) -> machine2(failed) ->
> machine1(failed) -> machine2(succeeded). All failures are "Too many fetch
> failures". And i am sure there is no firewall between the two nodes , at
> least port 50060 can be accessed from web browser.
>
> How can I check whether two nodes can fetch mapper outputs from one
> another?  I have no idea how reducers fetch these data ...
>
> Thanks!
>
>
> On Wed, Apr 8, 2009 at 2:21 AM, Aaron Kimball  wrote:
>
>> Xiaolin,
>>
>> Are you certain that the two nodes can fetch mapper outputs from one
>> another? If it's taking that long to complete, it might be the case that
>> what makes it "complete" is just that eventually it abandons one of your
>> two
>> nodes and runs everything on a single node where it succeeds -- defeating
>> the point, of course.
>>
>> Might there be a firewall between the two nodes that blocks the port used
>> by
>> the reducer to fetch the mapper outputs? (I think this is on 50060 by
>> default.)
>>
>> - Aaron
>>
>> On Tue, Apr 7, 2009 at 8:08 AM, xiaolin guo  wrote:
>>
>> > This simple map-recude application will take nearly 1 hour to finish
>> > running
>> > on the two-node cluster ,due to lots of Failed/Killed task attempts,
>> while
>> > in the single node cluster this application only takes 1 minite ... I am
>> > quite confusing why there are so many Failed/Killed attempts ..
>> >
>> > On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo  wrote:
>> >
>> > > I am trying to setup a small hadoop cluster , everything was ok before
>> I
>> > > moved from single node cluster to two-node cluster. I followed the
>> > article
>> > >
>> >
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>> <
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>> >
>> > <
>> >
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>> >to
>> > config master and slaves.However, when I tried to run the example
>> > > wordcount map-reduce application , the reduce task got stuck in 19%
>> for a
>> > > log time . Then I got a notice:"INFO mapred.JobClient: TaskId :
>> > > attempt_200904072219_0001_m_02_0, Status : FAILED too many fetch
>> > > errors"  and an error message : Error reading task outputslave.
>> > >
>> > > All map tasks in both task nodes had been finished which could be
>> > verified
>> > > in task tracker pages.
>> > >
>> > > Both nodes work well in single node mode . And the Hadoop file system
>> > seems
>> > > to be healthy in multi-node mode.
>> > >
>> > > Can anyone help me with this issue?  Have already got entangled in
>> this
>> > > issue for a long time ...
>> > >
>> > > Thanks very much!
>> > >
>> >
>>
>
>


Re: Getting free and used space

2009-04-08 Thread Brian Bockelman

Hey Stas,

Did you try this as a privileged user?  There might be some permission  
errors... in most of the released versions, getUsed() is only  
available to the Hadoop superuser.  It may be that the exception isn't  
propagating correctly.


Brian

On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote:


Hi.

I'm trying to use the API to get the overall used and free spaces.

I tried this function getUsed(), but it always returns 0.

Any idea?

Thanks.




Re: Too many fetch errors

2009-04-08 Thread xiaolin guo
I have checked the log and found that  for each map task , there are 3
failures which look like machin1(failed) -> machine2(failed) ->
machine1(failed) -> machine2(succeeded). All failures are "Too many fetch
failures". And i am sure there is no firewall between the two nodes , at
least port 50060 can be accessed from web browser.

How can I check whether two nodes can fetch mapper outputs from one
another?  I have no idea how reducers fetch these data ...

Thanks!

On Wed, Apr 8, 2009 at 2:21 AM, Aaron Kimball  wrote:

> Xiaolin,
>
> Are you certain that the two nodes can fetch mapper outputs from one
> another? If it's taking that long to complete, it might be the case that
> what makes it "complete" is just that eventually it abandons one of your
> two
> nodes and runs everything on a single node where it succeeds -- defeating
> the point, of course.
>
> Might there be a firewall between the two nodes that blocks the port used
> by
> the reducer to fetch the mapper outputs? (I think this is on 50060 by
> default.)
>
> - Aaron
>
> On Tue, Apr 7, 2009 at 8:08 AM, xiaolin guo  wrote:
>
> > This simple map-recude application will take nearly 1 hour to finish
> > running
> > on the two-node cluster ,due to lots of Failed/Killed task attempts,
> while
> > in the single node cluster this application only takes 1 minite ... I am
> > quite confusing why there are so many Failed/Killed attempts ..
> >
> > On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo  wrote:
> >
> > > I am trying to setup a small hadoop cluster , everything was ok before
> I
> > > moved from single node cluster to two-node cluster. I followed the
> > article
> > >
> >
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
> <
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> >
> > <
> >
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> >to
> > config master and slaves.However, when I tried to run the example
> > > wordcount map-reduce application , the reduce task got stuck in 19% for
> a
> > > log time . Then I got a notice:"INFO mapred.JobClient: TaskId :
> > > attempt_200904072219_0001_m_02_0, Status : FAILED too many fetch
> > > errors"  and an error message : Error reading task outputslave.
> > >
> > > All map tasks in both task nodes had been finished which could be
> > verified
> > > in task tracker pages.
> > >
> > > Both nodes work well in single node mode . And the Hadoop file system
> > seems
> > > to be healthy in multi-node mode.
> > >
> > > Can anyone help me with this issue?  Have already got entangled in this
> > > issue for a long time ...
> > >
> > > Thanks very much!
> > >
> >
>


Getting free and used space

2009-04-08 Thread Stas Oskin
Hi.

I'm trying to use the API to get the overall used and free spaces.

I tried this function getUsed(), but it always returns 0.

Any idea?

Thanks.