Re: How often Name node fails?

2008-04-14 Thread Michael Bieniosek
>From my experience, the namenode almost never crashes.  It does use a lot of 
>RAM, but not nearly as much as it used to.  You might have problems if you 
>have many many small files (maybe 100,000s?).

I have had more problems with datanode failures, particularly when there is 
mapreduce-generated load on the machines running the datanodes.

-Michael

On 4/14/08 3:14 PM, "Cagdas Gerede" <[EMAIL PROTECTED]> wrote:

>From your experience with Hadoop Distributed File System,
how reliable is the namenode? How often does it fail? Are there mechanisms
developed outside of Hadoop to make namenode more fault tolerant?

Thanks for your feedback,



--

Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info




Hadoop Pipes Question

2008-04-14 Thread Rui Shi
Hi,

I just started using Pipes to port my C++ application to hadoop. 

It is fine for me to run the wordcount-simple example in the 0.16.2 
distribution. But when I try to run 'wordcount-nopipe',  I always get the 
"failed to open " error. Looks like 'context.getInputSplit()' always returns 
empty string in WordCountReader(context). 

Does anyone have ideas on how to correctly run 'wordcount-nopipe' with its own 
RecordReader?

Thanks,

Rui 



  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

FW: streaming + binary input/output data?

2008-04-14 Thread Runping Qi
 

Observing a few emails on this list, I think the following email
exchange between me and john may be of interest to a broader audience.

 

Runping

 

 



From: Runping Qi 
Sent: Sunday, April 13, 2008 8:58 AM
To: 'JJ'
Subject: RE: streaming + binary input/output data?

 

 

 

That is basically what I envisioned originally.

 

One issue is the data format of streaming mapper output and the format
of streaming reducer output.

Those data are parsed by the streaming framework into key/value pairs.
The framework  assumes that the key and values are separated by tab
char, and the key/value pairs are separated by newline "\n".

That means the keys and values cannot have those two chars. If the
mapper and the reducer can encodet hose chars, then it will be fine.

Encoding the values with base64 will do it. Things related to keys are a
bit tricky, since the framework need will apply compare function on them
in order to do the sorting (and partition).

However, in most cases,  it will be acceptable to avoid binary data for
keys.

 

Another issue is to read binary input data and write binary data to dfs.

This issue can be addressed by implementing customer InputFormat and
OutputFormat classes (only the users know how to parse a specific binary
data format).

For each input key/value pair, the streaming framework basically writes
the following to the stdin  of the streaming mapper:

Key.toString() + "\t" + value.toString() " \n"

 

As long as you implement the toString methods to ensure proper base64
encoding for the value (and the key if necessary), then you will be
fine.

 

So, in summary, all these issues can be addressed by the user's code.

Initially, I was wondering whether the framework can be extended somehow
so that the user may only need to set some configuration variables to
handle binary data.

However, it seems that it is still unclear what extension should be for
a broad classes of applications.

Maybe it is the best approach for each user to do something like what I
outlined above to address his/her specific problem.

 

Hope this helps.

 

Runping

 

 

 



From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf
Of JJ
Sent: Sunday, April 13, 2008 8:18 AM
To: Runping Qi
Subject: Re: streaming + binary input/output data?

 

thx for the info,
what do you think about the idea of encoding the binary data with base64
to text before streaming it with hadoop?

John

2008/4/13, Runping Qi <[EMAIL PROTECTED]>:


No implementation/solution yet.
If there are more real use cases/user interests, then somebody may have
enough interest to provide a patch.

Runping


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 13, 2008 7:30 AM
> To: Runping Qi
> Subject: RE: streaming + binary input/output data?
>
> i just read the jira. these are interestin suggestions, but how do
they
> translate into a solution for my problem/question? has all or at least
> some of this been implemented or not?
>
> thx
> John
>
> Runping Qi wrote:
> >
> >
> > Actually, there is an old jira about the same issue:
> > https://issues.apache.org/jira/browse/HADOOP-1722
> >
> > Runping
> >
> >
> >> -Original Message-
> >> From: John Menzer [mailto:[EMAIL PROTECTED]
> >> Sent: Saturday, April 12, 2008 2:45 PM
> >> To: core-user@hadoop.apache.org
> >> Subject: RE: streaming + binary input/output data?
> >>
> >>
> >> so you mean you changed the hadoop streaming source code?
> >> actually i am not really willing to change the source code if it's
not
> >> necessary.
> >>
> >> so i thought about simply encoding the input binary data to txt
(e.g.
> > with
> >> base64) and then adding a '\n' after each line to make it
splittable
> > for
> >> streaming.
> >> after reading from stdin my C programm would just have to decode it
> >> map/reduce it and then encode it back to base64 so write to stdout.
> >>
> >> what do you think about that? worth a try?
> >>
> >>
> >>
> >> Joydeep Sen Sarma wrote:
> >> >
> >> > actually - this is possible - but changes to streaming are
required.
> >> >
> >> > at one point - we had gotten rid of the '\n' and '\t' separators
> > between
> >> > the keys and the values in the streaming code and streamed byte
> > arrays
> >> > directly to scripts (and then decoded them in the script). it
worked
> >> > perfectly fine. (in fact we were streaming thrift generated byte
> > streams
> >> -
> >> > encoded in java land and decoded in python land :-))
> >> >
> >> > the binary data on hdfs is best stored as sequencefiles (if u
store
> >> binary
> >> > data in (what looks to hadoop as) a text file - then bad things
will
> >> > happen). if stored this way - hadoop doesn't care about newlines
and
> >> tabs
> >> > - those are purely artifacts of streaming.
> >> >
> >> > also - the streaming code (for unknown reasons) doesn't allow a
> >> > SequencefileInputFormat. there were minor tweaks we had to ma

RE: streaming + binary input/output data?

2008-04-14 Thread Prasan Ary
John,
   
  My meaning didn't come through. 
   
  If you encode binary data and treat it like any peice of text going through 
hadoop's default input format, at some point your binary data might have a 
piece that looks like 1010, in hex it might be 0A, and in ascii, might it 
not be interpreted at \N?
   
  Wouldn't you need to insure that throughout all of your binary data, that you 
don't have a piece of data that might be interpreted as a \N?  
   
  You may need to define your own input format for this to work.
   
  

John Menzer <[EMAIL PROTECTED]> wrote:
  
Sure! Some equivalent should be possible. 
And like Runping already postet there have been some ideas about
implementing binary data processing in hadoop streaming:
https://issues.apache.org/jira/browse/HADOOP-1722
However this hasn't happened yet.

That's why I am looking for a minimum-effort-work-around.

Reading binary data (in my case the data are floats being processed by a
C-coded mapper) just seems to be much faster than parsing them from txt (to
float). 

I am going to implement a base64 version to find out whether it's still
faster than text-parsing.

John



Pra wrote:
> 
> John,
> 
> That's an interesting approach, but isn't it possible that an equivalent
> \n might get encoded in the binary data?
> 
> John Menzer wrote:
> 
> so you mean you changed the hadoop streaming source code?
> actually i am not really willing to change the source code if it's not
> necessary.
> 
> so i thought about simply encoding the input binary data to txt (e.g. with
> base64) and then adding a '\n' after each line to make it splittable for
> streaming.
> after reading from stdin my C programm would just have to decode it
> map/reduce it and then encode it back to base64 so write to stdout.
> 
> what do you think about that? worth a try?
> 
> 
> 
> Joydeep Sen Sarma wrote:
>> 
>> actually - this is possible - but changes to streaming are required.
>> 
>> at one point - we had gotten rid of the '\n' and '\t' separators between
>> the keys and the values in the streaming code and streamed byte arrays
>> directly to scripts (and then decoded them in the script). it worked
>> perfectly fine. (in fact we were streaming thrift generated byte streams
>> -
>> encoded in java land and decoded in python land :-))
>> 
>> the binary data on hdfs is best stored as sequencefiles (if u store
>> binary
>> data in (what looks to hadoop as) a text file - then bad things will
>> happen). if stored this way - hadoop doesn't care about newlines and tabs
>> - those are purely artifacts of streaming.
>> 
>> also - the streaming code (for unknown reasons) doesn't allow a
>> SequencefileInputFormat. there were minor tweaks we had to make to the
>> streaming driver to allow this stuff ..
>> 
>> 
>> -Original Message-
>> From: Ted Dunning [mailto:[EMAIL PROTECTED]
>> Sent: Mon 4/7/2008 7:43 AM
>> To: core-user@hadoop.apache.org
>> Subject: Re: streaming + binary input/output data?
>> 
>> 
>> I don't think that binary input works with streaming because of the
>> assumption of one record per line.
>> 
>> If you want to script map-reduce programs, would you be open to a Groovy
>> implementation that avoids these problems?
>> 
>> 
>> On 4/7/08 6:42 AM, "John Menzer" wrote:
>> 
>>> 
>>> hi,
>>> 
>>> i would like to use binary input and output data in combination with
>>> hadoop
>>> streaming.
>>> 
>>> the reason why i want to use binary data is, that parsing text to float
>>> seems to consume a big lot of time compared to directly reading the
>>> binary
>>> floats.
>>> 
>>> i am using a C-coded mapper (getting streaming data from stdin and
>>> writing
>>> to stdout) and no reducer.
>>> 
>>> so my question is: how do i implement binary input output in this
>>> context?
>>> as far as i understand i need to put an '\n' char at the end of each
>>> binary-'line'. so hadoop knows how to split/distribute the input data
>>> among
>>> the nodes and how to collect it for output(??)
>>> 
>>> is this approach reasonable?
>>> 
>>> thanks,
>>> john
>> 
>> 
>> 
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16656661.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16691343.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.





RE: streaming + binary input/output data?

2008-04-14 Thread John Menzer

Sure! Some equivalent should be possible. 
And like Runping already postet there have been some ideas about
implementing binary data processing in hadoop streaming:
https://issues.apache.org/jira/browse/HADOOP-1722
However this hasn't happened yet.

That's why I am looking for a minimum-effort-work-around.

Reading binary data (in my case the data are floats being processed by a
C-coded mapper) just seems to be much faster than parsing them from txt (to
float). 

I am going to implement a base64 version to find out whether it's still
faster than text-parsing.

John



Pra wrote:
> 
> John,
>
>   That's an interesting approach, but isn't it possible that an equivalent
> \n might get encoded in the binary data?
> 
> John Menzer <[EMAIL PROTECTED]> wrote:
>   
> so you mean you changed the hadoop streaming source code?
> actually i am not really willing to change the source code if it's not
> necessary.
> 
> so i thought about simply encoding the input binary data to txt (e.g. with
> base64) and then adding a '\n' after each line to make it splittable for
> streaming.
> after reading from stdin my C programm would just have to decode it
> map/reduce it and then encode it back to base64 so write to stdout.
> 
> what do you think about that? worth a try?
> 
> 
> 
> Joydeep Sen Sarma wrote:
>> 
>> actually - this is possible - but changes to streaming are required.
>> 
>> at one point - we had gotten rid of the '\n' and '\t' separators between
>> the keys and the values in the streaming code and streamed byte arrays
>> directly to scripts (and then decoded them in the script). it worked
>> perfectly fine. (in fact we were streaming thrift generated byte streams
>> -
>> encoded in java land and decoded in python land :-))
>> 
>> the binary data on hdfs is best stored as sequencefiles (if u store
>> binary
>> data in (what looks to hadoop as) a text file - then bad things will
>> happen). if stored this way - hadoop doesn't care about newlines and tabs
>> - those are purely artifacts of streaming.
>> 
>> also - the streaming code (for unknown reasons) doesn't allow a
>> SequencefileInputFormat. there were minor tweaks we had to make to the
>> streaming driver to allow this stuff ..
>> 
>> 
>> -Original Message-
>> From: Ted Dunning [mailto:[EMAIL PROTECTED]
>> Sent: Mon 4/7/2008 7:43 AM
>> To: core-user@hadoop.apache.org
>> Subject: Re: streaming + binary input/output data?
>> 
>> 
>> I don't think that binary input works with streaming because of the
>> assumption of one record per line.
>> 
>> If you want to script map-reduce programs, would you be open to a Groovy
>> implementation that avoids these problems?
>> 
>> 
>> On 4/7/08 6:42 AM, "John Menzer" wrote:
>> 
>>> 
>>> hi,
>>> 
>>> i would like to use binary input and output data in combination with
>>> hadoop
>>> streaming.
>>> 
>>> the reason why i want to use binary data is, that parsing text to float
>>> seems to consume a big lot of time compared to directly reading the
>>> binary
>>> floats.
>>> 
>>> i am using a C-coded mapper (getting streaming data from stdin and
>>> writing
>>> to stdout) and no reducer.
>>> 
>>> so my question is: how do i implement binary input output in this
>>> context?
>>> as far as i understand i need to put an '\n' char at the end of each
>>> binary-'line'. so hadoop knows how to split/distribute the input data
>>> among
>>> the nodes and how to collect it for output(??)
>>> 
>>> is this approach reasonable?
>>> 
>>> thanks,
>>> john
>> 
>> 
>> 
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16656661.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16691343.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



RE: streaming + binary input/output data?

2008-04-14 Thread Prasan Ary
John,
   
  That's an interesting approach, but isn't it possible that an equivalent \n 
might get encoded in the binary data?

John Menzer <[EMAIL PROTECTED]> wrote:
  
so you mean you changed the hadoop streaming source code?
actually i am not really willing to change the source code if it's not
necessary.

so i thought about simply encoding the input binary data to txt (e.g. with
base64) and then adding a '\n' after each line to make it splittable for
streaming.
after reading from stdin my C programm would just have to decode it
map/reduce it and then encode it back to base64 so write to stdout.

what do you think about that? worth a try?



Joydeep Sen Sarma wrote:
> 
> actually - this is possible - but changes to streaming are required.
> 
> at one point - we had gotten rid of the '\n' and '\t' separators between
> the keys and the values in the streaming code and streamed byte arrays
> directly to scripts (and then decoded them in the script). it worked
> perfectly fine. (in fact we were streaming thrift generated byte streams -
> encoded in java land and decoded in python land :-))
> 
> the binary data on hdfs is best stored as sequencefiles (if u store binary
> data in (what looks to hadoop as) a text file - then bad things will
> happen). if stored this way - hadoop doesn't care about newlines and tabs
> - those are purely artifacts of streaming.
> 
> also - the streaming code (for unknown reasons) doesn't allow a
> SequencefileInputFormat. there were minor tweaks we had to make to the
> streaming driver to allow this stuff ..
> 
> 
> -Original Message-
> From: Ted Dunning [mailto:[EMAIL PROTECTED]
> Sent: Mon 4/7/2008 7:43 AM
> To: core-user@hadoop.apache.org
> Subject: Re: streaming + binary input/output data?
> 
> 
> I don't think that binary input works with streaming because of the
> assumption of one record per line.
> 
> If you want to script map-reduce programs, would you be open to a Groovy
> implementation that avoids these problems?
> 
> 
> On 4/7/08 6:42 AM, "John Menzer" wrote:
> 
>> 
>> hi,
>> 
>> i would like to use binary input and output data in combination with
>> hadoop
>> streaming.
>> 
>> the reason why i want to use binary data is, that parsing text to float
>> seems to consume a big lot of time compared to directly reading the
>> binary
>> floats.
>> 
>> i am using a C-coded mapper (getting streaming data from stdin and
>> writing
>> to stdout) and no reducer.
>> 
>> so my question is: how do i implement binary input output in this
>> context?
>> as far as i understand i need to put an '\n' char at the end of each
>> binary-'line'. so hadoop knows how to split/distribute the input data
>> among
>> the nodes and how to collect it for output(??)
>> 
>> is this approach reasonable?
>> 
>> thanks,
>> john
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16656661.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.





Increasing Heap Size libhdfs

2008-04-14 Thread Anisio Mendes Lacerda
Hi,

I would like to know how can I increase the heap size of jv once
that I am using libhdfs.

Thanks in advance.

-- 
[]s

Anisio Mendes Lacerda


Re: Reduce Output

2008-04-14 Thread Ted Dunning


Try using Text, Text as the output type and use something like a
StringBuffer or Formatter to construct a tab-separated list.


On 4/14/08 11:13 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:

> Could you please let me know or point out how to store the output of reduce in
> this format
> K1  v1 v2
> K2v1 v2 v3 v4
> K3  v1
> K4  v1 v2
> 
> Right now I am getting this format
> K1  v1v2
> K2v1v2v3v4
> K3  v1
> K4  v1v2
> 
> Here is the Reduce class, what needs to be changed here?
> 
> public static class Reduce extends MapReduceBase implements Reducer IntWritable, Text, IntWritable> {
>   public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter) throws
> IOException {
>  int sum = 0;
> while (values.hasNext()) {
>   sum += values.next().get() ;
> }
> output.collect(key, new IntWritable(sum));
>   }
> 
> }
> 
> 
> 
> -Original Message-
> From: Ted Dunning [mailto:[EMAIL PROTECTED]
> Sent: Monday, April 14, 2008 1:49 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Reduce Output
> 
> 
> The format of the reduce output is the responsibility of the reducer.  You
> can store the output any way you like.
> 
> 
> On 4/14/08 10:17 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:
> 
>> Thanks Ted.
>> 
>> Actually I was trying to do the third option by myself before posting this
>> question.
>> Problem is I couldn't get the Reduce output like this
>> 
>> 1.0.2.92206 475
>> 1.0.2.9 316 475 847
>> 
>> If the values separated by space or something so that I can use sequential
>> script to iterate.
>> 
>> But the problem is the values are like this in the reduce output
>> 1.0.2.92206475
>> 1.0.2.9 316475847
>> 
>> So do you know any class or method that I can use to have the values
>> separated
>> by space or any other separator.
>> 
>> Thanks,
>> Senthil
>> 
>> -Original Message-
>> From: Ted Dunning [mailto:[EMAIL PROTECTED]
>> Sent: Monday, April 14, 2008 12:47 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: Reduce Output
>> 
>> 
>> Write an additional map-reduce step to join the data items together by
>> treating different input files differently.
>> 
>> OR
>> 
>> Write an additional map-reduce step that reads in your string values in the
>> map configuration method and keeps them in memory for looking up as you pass
>> over the output of your previous reduce step.  You won't need a reducer for
>> this approach, but your conversion table will have to fit into memory.
>> 
>> OR
>> 
>> Write a sequential script to read your string values and iterate over the
>> reduce output using conventional methods.  This works very well if you can
>> process your data in less time than hadoop takes to start your job.
>> 
>> 
>> 
>> 
>> On 4/14/08 9:42 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi,
>>> 
>>> I have the reduce output like this.
>>> 
>>> 1.0.2.92206475
>>> 
>>> 1.0.2.9   316475847
>>> 
>>> 1.0.3.933846495
>>> 
>>> 1.0.4.93316975
>>> 
>>> 
>>> 
>>> But I want to display like this...
>>> 
>>> 1.0.2.92206 475
>>> 
>>> 1.0.2.9   316 475 847
>>> 
>>> 1.0.3.93384 6495
>>> 
>>> 1.0.4.93316 975
>>> 
>>> 
>>> 
>>> And each value has description associated with it something like this
>>> 
>>> 
>>> 
>>> 206 ->TextDesp206
>>> 
>>> 475 ->TextDesp475
>>> 
>>> 316 ->TextDesp316
>>> 
>>> 847 ->TextDesp847
>>> 
>>> 
>>> 
>>> So eventually I would like to see my output look like this
>>> 
>>> 
>>> 
>>> 1.0.2.92TextDesp206 -> TextDesp475
>>> 1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847
>>> 
>>> How to do this, I tried different ways, but no luck.
>>> 
>>> public static class Reduce extends MapReduceBase implements Reducer>> IntWritable, Text, IntWritable> {
>>> 
>>>   public void reduce(Text key, Iterator values,
>>> OutputCollector output, Reporter reporter) throws
>>> IOException {
>>> 
>>>  Text word = new Text();
>>> 
>>> String sum = "";
>>> 
>>> while (values.hasNext()) {
>>> 
>>>sum += values.next().get() + " ";
>>> 
>>> }
>>> 
>>> //output.collect(key, new IntWritable(Integer.parseInt(sum)));
>>> 
>>> word.set(sum);
>>> 
>>> output.collect(word, new
>>> IntWritable(Integer.parseInt(key.toString(;
>>> 
>>>   }
>>> 
>>> 
>>> 
>>> }
>>> 
>>> 
>>> 
>>> Is there any way to use Reducer and OutputCollector or any other classes to
>>> output like this
>>> 
>>> 
>>> 
>>> 1.0.2.92TextDesp206 -> TextDesp475
>>> 
>>> 1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Thanks,
>>> Senthil
>> 
> 



RE: Reduce Output

2008-04-14 Thread Natarajan, Senthil
Could you please let me know or point out how to store the output of reduce in 
this format
K1  v1 v2
K2v1 v2 v3 v4
K3  v1
K4  v1 v2

Right now I am getting this format
K1  v1v2
K2v1v2v3v4
K3  v1
K4  v1v2

Here is the Reduce class, what needs to be changed here?

public static class Reduce extends MapReduceBase implements Reducer {
  public void reduce(Text key, Iterator values, 
OutputCollector output, Reporter reporter) throws 
IOException {
 int sum = 0;
while (values.hasNext()) {
  sum += values.next().get() ;
}
output.collect(key, new IntWritable(sum));
  }

}



-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Monday, April 14, 2008 1:49 PM
To: core-user@hadoop.apache.org
Subject: Re: Reduce Output


The format of the reduce output is the responsibility of the reducer.  You
can store the output any way you like.


On 4/14/08 10:17 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:

> Thanks Ted.
>
> Actually I was trying to do the third option by myself before posting this
> question.
> Problem is I couldn't get the Reduce output like this
>
> 1.0.2.92206 475
> 1.0.2.9 316 475 847
>
> If the values separated by space or something so that I can use sequential
> script to iterate.
>
> But the problem is the values are like this in the reduce output
> 1.0.2.92206475
> 1.0.2.9 316475847
>
> So do you know any class or method that I can use to have the values separated
> by space or any other separator.
>
> Thanks,
> Senthil
>
> -Original Message-
> From: Ted Dunning [mailto:[EMAIL PROTECTED]
> Sent: Monday, April 14, 2008 12:47 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Reduce Output
>
>
> Write an additional map-reduce step to join the data items together by
> treating different input files differently.
>
> OR
>
> Write an additional map-reduce step that reads in your string values in the
> map configuration method and keeps them in memory for looking up as you pass
> over the output of your previous reduce step.  You won't need a reducer for
> this approach, but your conversion table will have to fit into memory.
>
> OR
>
> Write a sequential script to read your string values and iterate over the
> reduce output using conventional methods.  This works very well if you can
> process your data in less time than hadoop takes to start your job.
>
>
>
>
> On 4/14/08 9:42 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> I have the reduce output like this.
>>
>> 1.0.2.92206475
>>
>> 1.0.2.9   316475847
>>
>> 1.0.3.933846495
>>
>> 1.0.4.93316975
>>
>>
>>
>> But I want to display like this...
>>
>> 1.0.2.92206 475
>>
>> 1.0.2.9   316 475 847
>>
>> 1.0.3.93384 6495
>>
>> 1.0.4.93316 975
>>
>>
>>
>> And each value has description associated with it something like this
>>
>>
>>
>> 206 ->TextDesp206
>>
>> 475 ->TextDesp475
>>
>> 316 ->TextDesp316
>>
>> 847 ->TextDesp847
>>
>>
>>
>> So eventually I would like to see my output look like this
>>
>>
>>
>> 1.0.2.92TextDesp206 -> TextDesp475
>> 1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847
>>
>> How to do this, I tried different ways, but no luck.
>>
>> public static class Reduce extends MapReduceBase implements Reducer> IntWritable, Text, IntWritable> {
>>
>>   public void reduce(Text key, Iterator values,
>> OutputCollector output, Reporter reporter) throws
>> IOException {
>>
>>  Text word = new Text();
>>
>> String sum = "";
>>
>> while (values.hasNext()) {
>>
>>sum += values.next().get() + " ";
>>
>> }
>>
>> //output.collect(key, new IntWritable(Integer.parseInt(sum)));
>>
>> word.set(sum);
>>
>> output.collect(word, new
>> IntWritable(Integer.parseInt(key.toString(;
>>
>>   }
>>
>>
>>
>> }
>>
>>
>>
>> Is there any way to use Reducer and OutputCollector or any other classes to
>> output like this
>>
>>
>>
>> 1.0.2.92TextDesp206 -> TextDesp475
>>
>> 1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847
>>
>>
>>
>>
>>
>> Thanks,
>> Senthil
>



Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-14 Thread Ted Dunning

You don't really need a custom input format, I don't think.

You should be able to just add multiple inputs, one at a time after
filtering them outside hadoop.


On 4/14/08 10:59 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

> ok thanks for the info :)
> 
> On 11/04/2008, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>> 
>>  On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:
>> 
>> 
>>> A simpler way is to use
>> FileInputFormat.setInputPathFilter(JobConf, PathFilter).
>> Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface.
>>> 
>> 
>>  +1, although FileInputFormat.setInputPathFilter is
>> available only in hadoop-0.17 and above... like Amar mentioned previously,
>> you'd have to have a custom InputFormat prior to hadoop-0.17.
>> 
>>  Arun
>> 
>> 
>> 
>>> Amar
>>> Alfonso Olias Sanz wrote:
>>> 
 Hi
 I have a general purpose input folder that it is used as input in a
 Map/Reduce task. That folder contains files grouped by names.
 
 I want to configure the JobConf in a way I can filter the files that
 have to be processed from that pass (ie  files which name starts by
 Elementary, or Source etc)  So the task function will only process
 those files.  So if the folder contains 1000 files and only 50 start
 by Elementary. Only those 50 will be processed by my task.
 
 I could set up different input folders and those containing the
 different files, but I cannot do that.
 
 
 Any idea?
 
 thanks
 
 
>>> 
>>> 
>> 
>> 



Re: Where are passed the JobConf?

2008-04-14 Thread Hairong Kuang
JobConf gets passed to a mapper in Mapper.configure(JobConf job). Check
http://hadoop.apache.org/core/docs/r0.16.1/api/org/apache/hadoop/mapred/MapR
educeBase.html#configure(org.apache.hadoop.mapred.JobConf)

Hairong


On 4/13/08 11:44 PM, "Steve Han" <[EMAIL PROTECTED]> wrote:

> I am reading Map/Reduce tutorial in official site of hadoop core.It said
> that "Overall, Mapper implementations are passed the JobConf for the job via
> the 
> JobConfigurable.configure(JobConf) api/org/apache/hadoop/mapred/JobConfigurable.html#configure%28org.apache.hadoo
> p.mapred.JobConf%29>method
> and override it to initialize themselves".Where  is  the  place  in
> the  code  JobConf  is  passed to  Mapper implementation(in WordCount. v1.0
> or v2.0)?Any idea?Thanks  a  lot.



Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-14 Thread Alfonso Olias Sanz
ok thanks for the info :)

On 11/04/2008, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>
>  On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:
>
>
> > A simpler way is to use
> FileInputFormat.setInputPathFilter(JobConf, PathFilter).
> Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface.
> >
>
>  +1, although FileInputFormat.setInputPathFilter is
> available only in hadoop-0.17 and above... like Amar mentioned previously,
> you'd have to have a custom InputFormat prior to hadoop-0.17.
>
>  Arun
>
>
>
> > Amar
> > Alfonso Olias Sanz wrote:
> >
> > > Hi
> > > I have a general purpose input folder that it is used as input in a
> > > Map/Reduce task. That folder contains files grouped by names.
> > >
> > > I want to configure the JobConf in a way I can filter the files that
> > > have to be processed from that pass (ie  files which name starts by
> > > Elementary, or Source etc)  So the task function will only process
> > > those files.  So if the folder contains 1000 files and only 50 start
> > > by Elementary. Only those 50 will be processed by my task.
> > >
> > > I could set up different input folders and those containing the
> > > different files, but I cannot do that.
> > >
> > >
> > > Any idea?
> > >
> > > thanks
> > >
> > >
> >
> >
>
>


Re: Reduce Output

2008-04-14 Thread Ted Dunning

The format of the reduce output is the responsibility of the reducer.  You
can store the output any way you like.


On 4/14/08 10:17 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:

> Thanks Ted.
> 
> Actually I was trying to do the third option by myself before posting this
> question.
> Problem is I couldn't get the Reduce output like this
> 
> 1.0.2.92206 475
> 1.0.2.9 316 475 847
> 
> If the values separated by space or something so that I can use sequential
> script to iterate.
> 
> But the problem is the values are like this in the reduce output
> 1.0.2.92206475
> 1.0.2.9 316475847
> 
> So do you know any class or method that I can use to have the values separated
> by space or any other separator.
> 
> Thanks,
> Senthil
> 
> -Original Message-
> From: Ted Dunning [mailto:[EMAIL PROTECTED]
> Sent: Monday, April 14, 2008 12:47 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Reduce Output
> 
> 
> Write an additional map-reduce step to join the data items together by
> treating different input files differently.
> 
> OR
> 
> Write an additional map-reduce step that reads in your string values in the
> map configuration method and keeps them in memory for looking up as you pass
> over the output of your previous reduce step.  You won't need a reducer for
> this approach, but your conversion table will have to fit into memory.
> 
> OR
> 
> Write a sequential script to read your string values and iterate over the
> reduce output using conventional methods.  This works very well if you can
> process your data in less time than hadoop takes to start your job.
> 
> 
> 
> 
> On 4/14/08 9:42 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:
> 
>> Hi,
>> 
>> I have the reduce output like this.
>> 
>> 1.0.2.92206475
>> 
>> 1.0.2.9   316475847
>> 
>> 1.0.3.933846495
>> 
>> 1.0.4.93316975
>> 
>> 
>> 
>> But I want to display like this...
>> 
>> 1.0.2.92206 475
>> 
>> 1.0.2.9   316 475 847
>> 
>> 1.0.3.93384 6495
>> 
>> 1.0.4.93316 975
>> 
>> 
>> 
>> And each value has description associated with it something like this
>> 
>> 
>> 
>> 206 ->TextDesp206
>> 
>> 475 ->TextDesp475
>> 
>> 316 ->TextDesp316
>> 
>> 847 ->TextDesp847
>> 
>> 
>> 
>> So eventually I would like to see my output look like this
>> 
>> 
>> 
>> 1.0.2.92TextDesp206 -> TextDesp475
>> 1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847
>> 
>> How to do this, I tried different ways, but no luck.
>> 
>> public static class Reduce extends MapReduceBase implements Reducer> IntWritable, Text, IntWritable> {
>> 
>>   public void reduce(Text key, Iterator values,
>> OutputCollector output, Reporter reporter) throws
>> IOException {
>> 
>>  Text word = new Text();
>> 
>> String sum = "";
>> 
>> while (values.hasNext()) {
>> 
>>sum += values.next().get() + " ";
>> 
>> }
>> 
>> //output.collect(key, new IntWritable(Integer.parseInt(sum)));
>> 
>> word.set(sum);
>> 
>> output.collect(word, new
>> IntWritable(Integer.parseInt(key.toString(;
>> 
>>   }
>> 
>> 
>> 
>> }
>> 
>> 
>> 
>> Is there any way to use Reducer and OutputCollector or any other classes to
>> output like this
>> 
>> 
>> 
>> 1.0.2.92TextDesp206 -> TextDesp475
>> 
>> 1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847
>> 
>> 
>> 
>> 
>> 
>> Thanks,
>> Senthil
> 



RE: Reduce Output

2008-04-14 Thread Natarajan, Senthil
Thanks Ted.

Actually I was trying to do the third option by myself before posting this 
question.
Problem is I couldn't get the Reduce output like this

1.0.2.92206 475
1.0.2.9 316 475 847

If the values separated by space or something so that I can use sequential 
script to iterate.

But the problem is the values are like this in the reduce output
1.0.2.92206475
1.0.2.9 316475847

So do you know any class or method that I can use to have the values separated 
by space or any other separator.

Thanks,
Senthil

-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Monday, April 14, 2008 12:47 PM
To: core-user@hadoop.apache.org
Subject: Re: Reduce Output


Write an additional map-reduce step to join the data items together by
treating different input files differently.

OR

Write an additional map-reduce step that reads in your string values in the
map configuration method and keeps them in memory for looking up as you pass
over the output of your previous reduce step.  You won't need a reducer for
this approach, but your conversion table will have to fit into memory.

OR

Write a sequential script to read your string values and iterate over the
reduce output using conventional methods.  This works very well if you can
process your data in less time than hadoop takes to start your job.




On 4/14/08 9:42 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I have the reduce output like this.
>
> 1.0.2.92206475
>
> 1.0.2.9   316475847
>
> 1.0.3.933846495
>
> 1.0.4.93316975
>
>
>
> But I want to display like this...
>
> 1.0.2.92206 475
>
> 1.0.2.9   316 475 847
>
> 1.0.3.93384 6495
>
> 1.0.4.93316 975
>
>
>
> And each value has description associated with it something like this
>
>
>
> 206 ->TextDesp206
>
> 475 ->TextDesp475
>
> 316 ->TextDesp316
>
> 847 ->TextDesp847
>
>
>
> So eventually I would like to see my output look like this
>
>
>
> 1.0.2.92TextDesp206 -> TextDesp475
> 1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847
>
> How to do this, I tried different ways, but no luck.
>
> public static class Reduce extends MapReduceBase implements Reducer IntWritable, Text, IntWritable> {
>
>   public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter) throws
> IOException {
>
>  Text word = new Text();
>
> String sum = "";
>
> while (values.hasNext()) {
>
>sum += values.next().get() + " ";
>
> }
>
> //output.collect(key, new IntWritable(Integer.parseInt(sum)));
>
> word.set(sum);
>
> output.collect(word, new
> IntWritable(Integer.parseInt(key.toString(;
>
>   }
>
>
>
> }
>
>
>
> Is there any way to use Reducer and OutputCollector or any other classes to
> output like this
>
>
>
> 1.0.2.92TextDesp206 -> TextDesp475
>
> 1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847
>
>
>
>
>
> Thanks,
> Senthil



changing master node?

2008-04-14 Thread Colin Freas
i changed the master node on my cluster because the original crashed hard.

my nodes share an nfs mounted /conf.  i changed all the ip's appropriately,
starting and stopping seems to work fine.

when i do a bin/hadoop dfs -ls i get this message repeating itself over and
over:

08/04/14 06:01:10 INFO ipc.Client: Retrying connect to server: /
10.0.2.13:54310. Already tried 1 time(s).

is there something more i need to do to reconfigure the system?  do i need
to reformat hdfs, with all the accompanying headaches or is there a simpler
solution?

-colin


Re: Reduce Output

2008-04-14 Thread Ted Dunning

Write an additional map-reduce step to join the data items together by
treating different input files differently.

OR

Write an additional map-reduce step that reads in your string values in the
map configuration method and keeps them in memory for looking up as you pass
over the output of your previous reduce step.  You won't need a reducer for
this approach, but your conversion table will have to fit into memory.

OR

Write a sequential script to read your string values and iterate over the
reduce output using conventional methods.  This works very well if you can
process your data in less time than hadoop takes to start your job.




On 4/14/08 9:42 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I have the reduce output like this.
> 
> 1.0.2.92206475
> 
> 1.0.2.9   316475847
> 
> 1.0.3.933846495
> 
> 1.0.4.93316975
> 
> 
> 
> But I want to display like this...
> 
> 1.0.2.92206 475
> 
> 1.0.2.9   316 475 847
> 
> 1.0.3.93384 6495
> 
> 1.0.4.93316 975
> 
> 
> 
> And each value has description associated with it something like this
> 
> 
> 
> 206 ->TextDesp206
> 
> 475 ->TextDesp475
> 
> 316 ->TextDesp316
> 
> 847 ->TextDesp847
> 
> 
> 
> So eventually I would like to see my output look like this
> 
> 
> 
> 1.0.2.92TextDesp206 -> TextDesp475
> 1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847
> 
> How to do this, I tried different ways, but no luck.
> 
> public static class Reduce extends MapReduceBase implements Reducer IntWritable, Text, IntWritable> {
> 
>   public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter) throws
> IOException {
> 
>  Text word = new Text();
> 
> String sum = "";
> 
> while (values.hasNext()) {
> 
>sum += values.next().get() + " ";
> 
> }
> 
> //output.collect(key, new IntWritable(Integer.parseInt(sum)));
> 
> word.set(sum);
> 
> output.collect(word, new
> IntWritable(Integer.parseInt(key.toString(;
> 
>   }
> 
> 
> 
> }
> 
> 
> 
> Is there any way to use Reducer and OutputCollector or any other classes to
> output like this
> 
> 
> 
> 1.0.2.92TextDesp206 -> TextDesp475
> 
> 1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847
> 
> 
> 
> 
> 
> Thanks,
> Senthil



Reduce Output

2008-04-14 Thread Natarajan, Senthil
Hi,

I have the reduce output like this.

1.0.2.92206475

1.0.2.9   316475847

1.0.3.933846495

1.0.4.93316975



But I want to display like this...

1.0.2.92206 475

1.0.2.9   316 475 847

1.0.3.93384 6495

1.0.4.93316 975



And each value has description associated with it something like this



206 ->TextDesp206

475 ->TextDesp475

316 ->TextDesp316

847 ->TextDesp847



So eventually I would like to see my output look like this



1.0.2.92TextDesp206 -> TextDesp475
1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847

How to do this, I tried different ways, but no luck.

public static class Reduce extends MapReduceBase implements Reducer {

  public void reduce(Text key, Iterator values, 
OutputCollector output, Reporter reporter) throws 
IOException {

 Text word = new Text();

String sum = "";

while (values.hasNext()) {

   sum += values.next().get() + " ";

}

//output.collect(key, new IntWritable(Integer.parseInt(sum)));

word.set(sum);

output.collect(word, new IntWritable(Integer.parseInt(key.toString(;

  }



}



Is there any way to use Reducer and OutputCollector or any other classes to 
output like this



1.0.2.92TextDesp206 -> TextDesp475

1.0.2.9   TextDesp316 -> TextDesp475 -> TextDesp847





Thanks,
Senthil


Re: hadoop 0.16.2 hangs

2008-04-14 Thread Nigel Daley
I'm guessing this is http://issues.apache.org/jira/browse/HADOOP-3139  
which is fixed in Hadoop 0.16.3 (currently being voted on for release  
by the developer community).


Nige

On Apr 14, 2008, at 3:07 AM, Andreas Kostyrka wrote:

As another item, the submitting Java process hangs in a futex call:

[EMAIL PROTECTED]:~# strace -p 3810
Process 3810 attached - interrupt to quit
futex(0xb7d6ebd8, FUTEX_WAIT, 3832, NULL

and hangs, hangs, hangs, ...

Andreas


Am Montag, den 14.04.2008, 11:46 +0200 schrieb Andreas Kostyrka:
Ok, a short grep in the sources suggests that the exceptions  
happen just

in the closeAll method of FileSystem. So no indication what hadoop is
working on :(


Am Montag, den 14.04.2008, 07:26 +0200 schrieb Andreas Kostyrka:

Hi!

I'm getting the following hang, when trying to run a streaming  
command:


[EMAIL PROTECTED]:~/hadoop-0.16.2$ time bin/hadoop jar  
contrib/streaming/hadoop-0.16.2-streaming.jar -mapper '/home/ 
hadoop/bin/llfp -f [EMAIL PROTECTED] -t [EMAIL PROTECTED] -s  
heaven.kostyrka.org -d gen_dailysites -d fb_memberfind ' -reducer  
NONE -input run-0003-input/* -output run-0003-output

additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar60050/] [] /tmp/ 
streamjob60051.jar tmpDir=null
08/04/14 05:22:10 INFO fs.FileSystem: FileSystem.closeAll() threw  
an exception:
org.apache.hadoop.io.MultipleIOException: 2 exceptions  
[java.io.IOException: S3FileSystem 
([EMAIL PROTECTED]) and Key 
([EMAIL PROTECTED]://lokgad) do not match., java.io.IOException:  
LocalFileSystem([EMAIL PROTECTED]) and  
Key([EMAIL PROTECTED]://null) do not match.]



The exception happens only when I press Control-C after many  
minutes.


The cluster uses an HDFS/S3 filesystem, and concurrently to this I'm
issueing many many "relative small" hadoop fs -put requests,  
copying in

data to a different folder.

The above hang happened multiple times, so I wonder what could it  
be?


Andreas




Re: hadoop 0.16.2 hangs

2008-04-14 Thread Andreas Kostyrka
As another item, the submitting Java process hangs in a futex call:

[EMAIL PROTECTED]:~# strace -p 3810
Process 3810 attached - interrupt to quit
futex(0xb7d6ebd8, FUTEX_WAIT, 3832, NULL

and hangs, hangs, hangs, ...

Andreas


Am Montag, den 14.04.2008, 11:46 +0200 schrieb Andreas Kostyrka:
> Ok, a short grep in the sources suggests that the exceptions happen just
> in the closeAll method of FileSystem. So no indication what hadoop is
> working on :(
> 
> 
> Am Montag, den 14.04.2008, 07:26 +0200 schrieb Andreas Kostyrka: 
> > Hi!
> > 
> > I'm getting the following hang, when trying to run a streaming command:
> > 
> > [EMAIL PROTECTED]:~/hadoop-0.16.2$ time bin/hadoop jar 
> > contrib/streaming/hadoop-0.16.2-streaming.jar -mapper 
> > '/home/hadoop/bin/llfp -f [EMAIL PROTECTED] -t [EMAIL PROTECTED] -s 
> > heaven.kostyrka.org -d gen_dailysites -d fb_memberfind ' -reducer NONE 
> > -input run-0003-input/* -output run-0003-output
> > additionalConfSpec_:null
> > null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
> > packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar60050/] [] 
> > /tmp/streamjob60051.jar tmpDir=null
> > 08/04/14 05:22:10 INFO fs.FileSystem: FileSystem.closeAll() threw an 
> > exception:
> > org.apache.hadoop.io.MultipleIOException: 2 exceptions 
> > [java.io.IOException: S3FileSystem([EMAIL PROTECTED]) and Key([EMAIL 
> > PROTECTED]://lokgad) do not match., java.io.IOException: 
> > LocalFileSystem([EMAIL PROTECTED]) and Key([EMAIL PROTECTED]://null) do not 
> > match.]
> > 
> > 
> > The exception happens only when I press Control-C after many minutes.
> > 
> > The cluster uses an HDFS/S3 filesystem, and concurrently to this I'm
> > issueing many many "relative small" hadoop fs -put requests, copying in
> > data to a different folder.
> > 
> > The above hang happened multiple times, so I wonder what could it be?
> > 
> > Andreas


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: hadoop 0.16.2 hangs

2008-04-14 Thread Andreas Kostyrka
Ok, a short grep in the sources suggests that the exceptions happen just
in the closeAll method of FileSystem. So no indication what hadoop is
working on :(


Am Montag, den 14.04.2008, 07:26 +0200 schrieb Andreas Kostyrka: 
> Hi!
> 
> I'm getting the following hang, when trying to run a streaming command:
> 
> [EMAIL PROTECTED]:~/hadoop-0.16.2$ time bin/hadoop jar 
> contrib/streaming/hadoop-0.16.2-streaming.jar -mapper '/home/hadoop/bin/llfp 
> -f [EMAIL PROTECTED] -t [EMAIL PROTECTED] -s heaven.kostyrka.org -d 
> gen_dailysites -d fb_memberfind ' -reducer NONE -input run-0003-input/* 
> -output run-0003-output
> additionalConfSpec_:null
> null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
> packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar60050/] [] 
> /tmp/streamjob60051.jar tmpDir=null
> 08/04/14 05:22:10 INFO fs.FileSystem: FileSystem.closeAll() threw an 
> exception:
> org.apache.hadoop.io.MultipleIOException: 2 exceptions [java.io.IOException: 
> S3FileSystem([EMAIL PROTECTED]) and Key([EMAIL PROTECTED]://lokgad) do not 
> match., java.io.IOException: LocalFileSystem([EMAIL PROTECTED]) and 
> Key([EMAIL PROTECTED]://null) do not match.]
> 
> 
> The exception happens only when I press Control-C after many minutes.
> 
> The cluster uses an HDFS/S3 filesystem, and concurrently to this I'm
> issueing many many "relative small" hadoop fs -put requests, copying in
> data to a different folder.
> 
> The above hang happened multiple times, so I wonder what could it be?
> 
> Andreas


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Getting a DataNode files list

2008-04-14 Thread Shimi K
Is there a way to get a list of a files from a specific DataNode in a
programmatic way?