Re: Easy Question

2010-10-05 Thread maha
Hi Neil,

   Thanks for responding. Basically formatting removes all my files, is there 
away not to? I didn't thought about checking the log. Thanks,

   Maha

On Oct 4, 2010, at 10:54 PM, Neil Ghosh wrote:

 Maha,
 
 Is there any specific reason you don't want to format the name node ?
 
 Did you see the log why name node is not starting ?
 
 Thanks
 Neil
 
 On Tue, Oct 5, 2010 at 10:54 AM, maha m...@umail.ucsb.edu wrote:
 
 Hi Folks,
 
  I'm sure this is easy for you guy, so please let me know. What's the
 solution when the NameNode doesn't start other than formatting it?
 I also tried stop-dfs.sh and starting it again over and over, no hope
 until I format it :(
 
  Please help and thank you,
 
 Maha
 
 
 
 
 -- 
 Thanks and Regards
 Neil
 http://neilghosh.com



Fwd: Easy Question

2010-10-05 Thread maha

 Sorry Harsh and thanks for the advice. I'm new to Hadoop and didn't thought 
 of reading the logs. But you're right. Now, the DATANODE is not starting:
 
 2010-10-04 23:09:22,812 ERROR 
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: 
 Incompatible namespaceIDs in /private/tmp/hadoop-Hadoop/dfs/data: namenode 
 namespaceID = 200395975; datanode namespaceID = 1970823831
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 
 But again what is incompatible here? I think it's because I added 
 dfs.data.dir to be /user/Hadoop/hdfs/data and dfs.name.dir to be 
 user/Hadoop/hdfs/name into the core-site.xml.
 
Maha




 
 On Oct 4, 2010, at 11:08 PM, Harsh J wrote:
 
 The logs tell what the problem is precisely 99% of times. Formatting is not
 the only solution. How and when does your node go down? Give the list some
 more information to help you better :)
 
 On Oct 5, 2010 11:35 AM, maha m...@umail.ucsb.edu wrote:
 
 Hi Neil,
 
 Thanks for responding. Basically formatting removes all my files, is there
 away not to? I didn't thought about checking the log. Thanks,
 
 Maha
 
 
 On Oct 4, 2010, at 10:54 PM, Neil Ghosh wrote:
 
 Maha,
 
 Is there any specific reason you don't...
 



Re: Easy Question

2010-10-05 Thread Matthew John
hi Maha,

  try the folowing :

goto ur dfs.data.dir/current

You will find a file VERSION.. just modify the namespace id in it with your
namespace id found in the log ( in this prev post -- 200395975 ).. restart
hadoop..
(bin/start-all.sh) ...

see if all the daemons are up..


regards,
Matthew


Help!!The problem about Hadoop

2010-10-05 Thread Jander
Hi, all
I do an application using hadoop.
I take 1GB text data as input the result as follows:
(1) the cluster of 3 PCs: the time consumed is 1020 seconds.
(2) the cluster of 4 PCs: the time is about 680 seconds.
But the application before I use Hadoop takes about 280 seconds, so as the 
speed above, I must use 8 PCs in order to have the same speed as before. Now 
the problem: whether it is correct?

Jander,
Thanks.

  


Re:Re: Help!!The problem about Hadoop

2010-10-05 Thread Jander
Hi Jeff,

Thank you very much for your reply sincerely.

I exactly know hadoop has overhead, but is it too large in my problem?

The 1GB text input has about 500 map tasks because the input is composed of 
little text file. And the time each map taken is from 8 seconds to 20 seconds. 
I use compression like conf.setCompressMapOutput(true).

Thanks,
Jander




At 2010-10-05 16:28:55,Jeff Zhang zjf...@gmail.com wrote:

Hi Jander,

Hadoop has overhead compared to single-machine solution. How many task
have you get when you run your hadoop job ? And what is time consuming
for each map and reduce task ?

There's lots of tips for performance tuning of hadoop. Such as
compression and jvm reuse.


2010/10/5 Jander 442950...@163.com:
 Hi, all
 I do an application using hadoop.
 I take 1GB text data as input the result as follows:
(1) the cluster of 3 PCs: the time consumed is 1020 seconds.
(2) the cluster of 4 PCs: the time is about 680 seconds.
 But the application before I use Hadoop takes about 280 seconds, so as the 
 speed above, I must use 8 PCs in order to have the same speed as before. Now 
 the problem: whether it is correct?

 Jander,
 Thanks.






-- 
Best Regards

Jeff Zhang


Re: is there no streaming.jar file in hadoop-0.21.0??

2010-10-05 Thread Alejandro Abdelnur
Edward,

Yep, you should use the one from contrib/

Alejandro

On Tue, Oct 5, 2010 at 1:55 PM, edward choi mp2...@gmail.com wrote:
 Thanks, Tom.
 Didn't expect the author of THE BOOK would answer my question. Very
 surprised and honored :-)
 Just one more question if you don't mind.
 I read it on the Internet that in order to user Hadoop Streaming in
 Hadoop-0.21.0 you should go
 $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar args (Of
 course I don't see any hadoop-streaming.jar in $HADOOP_HOME)
 But according to your reply I should go
 $HADOOP_HOME/bin/hadoop jar
 $HADOOP_HOME/mapred/contrib/streaming/hadoop-*-streaming.jar args
 I suppose the latter one is the way to go?

 Ed.

 2010/10/5 Tom White t...@cloudera.com

 Hi Ed,

 The directory structure moved around as a result of the project
 splitting into three subprojects (Common, HDFS, MapReduce). The
 streaming jar is in mapred/contrib/streaming in the distribution.

 Cheers,
 Tom

 On Mon, Oct 4, 2010 at 8:03 PM, edward choi mp2...@gmail.com wrote:
  Hi,
  I've recently downloaded Hadoop-0.21.0.
  After the installation, I've noticed that there is no contrib directory
  that used to exist in Hadoop-0.20.2.
  So I was wondering if there is no hadoop-0.21.0-streaming.jar file in
  Hadoop-0.21.0.
  Anyone had any luck finding it?
  If the way to use streaming has changed in Hadoop-0.21.0, then please
 tell
  me how.
  Appreciate the help, thx.
 
  Ed.
 




Re: Re: Help!!The problem about Hadoop

2010-10-05 Thread Alejandro Abdelnur
Or you could try using MultiFileInputFormat for your MR job.

http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred/MultiFileInputFormat.html

Alejandro

On Tue, Oct 5, 2010 at 4:55 PM, Harsh J qwertyman...@gmail.com wrote:
 500 small files comprising one gigabyte? Perhaps you should try
 concatenating them all into one big file and try; as a mapper is
 supposed to run at least for a minute optimally. And small files don't
 make good use of the HDFS block feature.

 Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/

 2010/10/5 Jander 442950...@163.com:
 Hi Jeff,

 Thank you very much for your reply sincerely.

 I exactly know hadoop has overhead, but is it too large in my problem?

 The 1GB text input has about 500 map tasks because the input is composed of 
 little text file. And the time each map taken is from 8 seconds to 20 
 seconds. I use compression like conf.setCompressMapOutput(true).

 Thanks,
 Jander




 At 2010-10-05 16:28:55,Jeff Zhang zjf...@gmail.com wrote:

Hi Jander,

Hadoop has overhead compared to single-machine solution. How many task
have you get when you run your hadoop job ? And what is time consuming
for each map and reduce task ?

There's lots of tips for performance tuning of hadoop. Such as
compression and jvm reuse.


2010/10/5 Jander 442950...@163.com:
 Hi, all
 I do an application using hadoop.
 I take 1GB text data as input the result as follows:
    (1) the cluster of 3 PCs: the time consumed is 1020 seconds.
    (2) the cluster of 4 PCs: the time is about 680 seconds.
 But the application before I use Hadoop takes about 280 seconds, so as the 
 speed above, I must use 8 PCs in order to have the same speed as before. 
 Now the problem: whether it is correct?

 Jander,
 Thanks.






--
Best Regards

Jeff Zhang




 --
 Harsh J
 www.harshj.com



how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Vitaliy Semochkin
Hello,


I have mappers that do not need much ram but combiners and reducers need a lot.
Is it possible to set different VM parameters for mappers and reducers?



PS Often I face interesting problem, on same set of data I
recieve I have java.lang.OutOfMemoryError: Java heap space in combiner
but it happens not all the time.
What could be cause of such behavior?
My personal opinion is that I have

mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when
it should.

Thanks in Advance,
Vitaliy S


RE: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Michael Segel

Hi,

You don't say which version of Hadoop you are using. 
Going from memory, I believe in the CDH3 release from Cloudera, there are some 
specific OPTs you can set in hadoop-env.sh.

HTH

-Mike


 Date: Tue, 5 Oct 2010 16:59:35 +0400
 Subject: how to set diffent VM parameters for mappers and reducers?
 From: vitaliy...@gmail.com
 To: common-user@hadoop.apache.org
 
 Hello,
 
 
 I have mappers that do not need much ram but combiners and reducers need a 
 lot.
 Is it possible to set different VM parameters for mappers and reducers?
 
 
 
 PS Often I face interesting problem, on same set of data I
 recieve I have java.lang.OutOfMemoryError: Java heap space in combiner
 but it happens not all the time.
 What could be cause of such behavior?
 My personal opinion is that I have
 
 mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when
 it should.
 
 Thanks in Advance,
 Vitaliy S
  

Re: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Jeff Zhang
You can set mapred.child.java.opts in mapred-site.xml

BTW, combiner can been run both in map side and reduce side




On Tue, Oct 5, 2010 at 8:59 PM, Vitaliy Semochkin vitaliy...@gmail.com wrote:
 Hello,


 I have mappers that do not need much ram but combiners and reducers need a 
 lot.
 Is it possible to set different VM parameters for mappers and reducers?



 PS Often I face interesting problem, on same set of data I
 recieve I have java.lang.OutOfMemoryError: Java heap space in combiner
 but it happens not all the time.
 What could be cause of such behavior?
 My personal opinion is that I have

 mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when
 it should.

 Thanks in Advance,
 Vitaliy S




-- 
Best Regards

Jeff Zhang


Re: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Alejandro Abdelnur
The following 2 properties should work:

mapred.map.child.java.opts
mapred.reduce.child.java.opts

Alejandro


On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel michael_se...@hotmail.com wrote:

 Hi,

 You don't say which version of Hadoop you are using.
 Going from memory, I believe in the CDH3 release from Cloudera, there are 
 some specific OPTs you can set in hadoop-env.sh.

 HTH

 -Mike


 Date: Tue, 5 Oct 2010 16:59:35 +0400
 Subject: how to set diffent VM parameters for mappers and reducers?
 From: vitaliy...@gmail.com
 To: common-user@hadoop.apache.org

 Hello,


 I have mappers that do not need much ram but combiners and reducers need a 
 lot.
 Is it possible to set different VM parameters for mappers and reducers?



 PS Often I face interesting problem, on same set of data I
 recieve I have java.lang.OutOfMemoryError: Java heap space in combiner
 but it happens not all the time.
 What could be cause of such behavior?
 My personal opinion is that I have

 mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when
 it should.

 Thanks in Advance,
 Vitaliy S



Re: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Vitaliy Semochkin
I'm using apache hadoop-0.20.2 - the recent version i found in maven
central repo.

Regards,
Vitaliy S

On Tue, Oct 5, 2010 at 5:02 PM, Michael Segel michael_se...@hotmail.com wrote:

 Hi,

 You don't say which version of Hadoop you are using.
 Going from memory, I believe in the CDH3 release from Cloudera, there are 
 some specific OPTs you can set in hadoop-env.sh.

 HTH

 -Mike


 Date: Tue, 5 Oct 2010 16:59:35 +0400
 Subject: how to set diffent VM parameters for mappers and reducers?
 From: vitaliy...@gmail.com
 To: common-user@hadoop.apache.org

 Hello,


 I have mappers that do not need much ram but combiners and reducers need a 
 lot.
 Is it possible to set different VM parameters for mappers and reducers?



 PS Often I face interesting problem, on same set of data I
 recieve I have java.lang.OutOfMemoryError: Java heap space in combiner
 but it happens not all the time.
 What could be cause of such behavior?
 My personal opinion is that I have

 mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when
 it should.

 Thanks in Advance,
 Vitaliy S



Set number Reducer per machines.

2010-10-05 Thread Pramy Bhats
Hi,

I am trying to run a job on my hadoop cluster, where I get consistently get
heap space error.

I increased the heap-space to 4 GB in hadoop-env.sh and reboot the cluster.
However, I still get the heap space error.


One of things, I want to try is to reduce the number of map / reduce process
per machine. Currently each machine can have 2 maps and 2 reduce process
running.


I want to configure the hadoop to run 1 map and 1 reduce per machine to give
more heap space per process.

How can I configure the number of maps and number of reducer per node ?


thanks in advance,
-- Pramod


Re: Set number Reducer per machines.

2010-10-05 Thread Marcos Medrado Rubinelli
You can set the mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum properties in your 
mapred-site.xml file, but you may also want to check your current 
mapred.child.java.opts and mapred.child.ulimit values to make sure they 
aren't overriding the 4GB you set globally.


Cheers,
Marcos

Hi,

I am trying to run a job on my hadoop cluster, where I get consistently get
heap space error.

I increased the heap-space to 4 GB in hadoop-env.sh and reboot the cluster.
However, I still get the heap space error.


One of things, I want to try is to reduce the number of map / reduce process
per machine. Currently each machine can have 2 maps and 2 reduce process
running.


I want to configure the hadoop to run 1 map and 1 reduce per machine to give
more heap space per process.

How can I configure the number of maps and number of reducer per node ?


thanks in advance,
-- Pramod
   


Re: Problem with DistributedCache after upgrading to CDH3b2

2010-10-05 Thread Kim Vogt
I'm experiencing the same problem.  I was hoping there were be a reply to
this.  Anyone? Bueller?

-Kim

On Fri, Jul 16, 2010 at 1:58 AM, Jamie Cockrill jamie.cockr...@gmail.comwrote:

 Dear All,

 We recently upgraded from CDH3b1 to b2 and ever since, all our
 mapreduce jobs that use the DistributedCache have failed. Typically,
 we add files to the cache prior to job startup, using
 addCacheFile(URI, conf) and then get them on the other side, using
 getLocalCacheFiles(conf). I believe the hadoop-core versions for these
 are 0.20.2+228 and +320 respectively.

 We then open the files and read them in using a standard FileReader,
 using the toString on the path object as the constructor parameter,
 which has worked fine up to now. However, we're now getting
 FileNotFound exceptions when the file reader tries to open the file.

 Unfortunately the cluster is on an airgapped network, but the
 FileNotFound line comes out like:

 java.io.FileNotFoundException:

 /tmp/hadoop-hadoop/mapred/local/taskTracker/archive/master/path/to/my/file/filename.txt/filename.txt

 Note, the duplication of filename.txt is deliberate. I'm not sure if
 that's strange or not as this has previously worked absolutely fine.
 Has anyone else experienced this? Apologies if this is known, I've
 only just joined the list.

 Many thanks,

 Jamie



Re: Problem with DistributedCache after upgrading to CDH3b2

2010-10-05 Thread Jamie Cockrill
Hi Kim,

We didn't fix it in the end. I just ended up manually writing the
files to the cluster using the FileSystem class, and then reading them
back out again on the other side. Not terribly efficient as I guess
the point of DistributedCache is that the files get distributed to
every node, whereas I'm only writing to two or three nodes, then every
map-task is then trying to read back from those two or three nodes the
data are stored on.

Unfortunately I didn't have the will or inclination to investigate it
any further as I had some pretty tight deadlines to keep to and it
hasn't caused me any significant problems yet...

Thanks,

Jamie

On 5 October 2010 22:30, Kim Vogt k...@simplegeo.com wrote:
 I'm experiencing the same problem.  I was hoping there were be a reply to
 this.  Anyone? Bueller?

 -Kim

 On Fri, Jul 16, 2010 at 1:58 AM, Jamie Cockrill 
 jamie.cockr...@gmail.comwrote:

 Dear All,

 We recently upgraded from CDH3b1 to b2 and ever since, all our
 mapreduce jobs that use the DistributedCache have failed. Typically,
 we add files to the cache prior to job startup, using
 addCacheFile(URI, conf) and then get them on the other side, using
 getLocalCacheFiles(conf). I believe the hadoop-core versions for these
 are 0.20.2+228 and +320 respectively.

 We then open the files and read them in using a standard FileReader,
 using the toString on the path object as the constructor parameter,
 which has worked fine up to now. However, we're now getting
 FileNotFound exceptions when the file reader tries to open the file.

 Unfortunately the cluster is on an airgapped network, but the
 FileNotFound line comes out like:

 java.io.FileNotFoundException:

 /tmp/hadoop-hadoop/mapred/local/taskTracker/archive/master/path/to/my/file/filename.txt/filename.txt

 Note, the duplication of filename.txt is deliberate. I'm not sure if
 that's strange or not as this has previously worked absolutely fine.
 Has anyone else experienced this? Apologies if this is known, I've
 only just joined the list.

 Many thanks,

 Jamie




Re: Datanode Registration DataXceiver java.io.EOFException

2010-10-05 Thread Sudhir Vallamkondu
We use gangalia for monitoring our cluster and use a nagios plugin that
interfaces with gmeta node to setup various rules around number of
datanodes, missing/corrupted blocks etc

http://www.cloudera.com/blog/2009/03/hadoop-metrics/

http://exchange.nagios.org/directory/Plugins/Network-and-Systems-Management/
Others/check_ganglia/details




 From: Arthur Caranta art...@caranta.com
 Date: Mon, 04 Oct 2010 15:46:19 +0200
 To: common-user@hadoop.apache.org
 Subject: Re: Datanode Registration DataXceiver java.io.EOFException
 
  On 04/10/10 15:42, Steve Loughran wrote:
 On 04/10/10 14:30, Arthur Caranta wrote:
   Damn I found the answer to this problem, thanks to someone on the
 #hadoop IRC channel ...
 
 It was a network check I added for our supervision ... therefore every 5
 minutes the supervision connects to the datanode port to check if it is
 alive and then disconnects ...
 
 
 why not just GET the various local pages and let your HTTP monitoring
 tools do the work.
 
 
 True ... however the tcp method was the fastest to implement and script
 with our current supervision system.
 but I think I might be switching monitoring method.


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information of iCrossing. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.




Re: conf.setCombinerClass in Map/Reduce

2010-10-05 Thread Shi Yu

Hi, thanks for the answer, Antonio.

I have found one of the main problem. It was because I used the 
MultipleOutputs in the Reduce class, so when I set the Combiner and the 
Reducer, the Combiner will not provide normal data flow to the Reducer. 
Therefore, the program ceases at the Combiner and no Reducer actually 
works. To solve this, I have to use both outputs:


OutputCollector collector = 
multipleOutputs.getCollector(stringlabel,keyText,reporter)

collector.collect(keyText, value);
output.collect(key,value);

The collector generates the separated output files, the output makes 
sure the data flow is exchanged towards the Reducer. After this change, 
both Combiner and Reducer now work.


The remaining question is if I want to use the Combiner and the Reducer, 
should the input and output of Reduce class be the same K2,V2? 
Otherwise how to do it? I found the use case is very limited here, for 
example, if the Reducer class is a little bit complicated having the 
input as K2,V2 and output as K3,V3?


Thanks again.

Shi


On 2010-10-5 23:48, Antonio Piccolboni wrote:

On Tue, Oct 5, 2010 at 4:32 PM, Shi Yush...@uchicago.edu  wrote:

   

Hi,

I am still confused about the effect of using Combiner in Hadoop
Map/Reduce. The performance tips (
http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/)
suggest us to write a combiner to do initial aggregation before the data
hits the reducer for performance advantages. But in most of the example code
or book I have seen, a same reduce class is set as the reducer and the
combiner, such as

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);


I don't know what is the specific reason doing like this. In my own code
based on Hadoop 0.19.2, if I set the combiner class as the reduce class
using MultipleOutputs, the output files will be named as xxx-m-0. And if
there are multiple input paths, the number of output files will be the same
as the input paths number. The conf.setNumReduceTasks(int) has no use to
control the output file number now. I wonder where are the reducer generated
outputs in this case because I cannot see them.  To see the reducer output,
I have to remove the combiner class

//conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);


and then get the output files named as xxx-r-0. I could then control
the output file number using conf.setNumReduceTasks(int).

So my question is what is the main advantage to set combiner class and
reducer class using the same reduce class?
 


When the calculation performed by the reducer is commutative and
associative, with a combiner you get more work done before the shuffle, less
sorting and shuffling and less work in the reducer. Like in the word count
app, the mapper emitsthe, 1  a billion times, but with a combiner equal
to the reducer onlythe, 10^9  has to travel to the reducer. If you
couldn't use the combiner, not only the shuffle phase would be as heavy as
if you had a billion distinct words, but also the poor reducer that gets the
the key would be very slow. So you would have to go through multiple
mapreduce phases to aggregate the data anyway.



   

How to merge the output files in this case?
 


While I am not sure what you mean, there is no difference to you. The output
is the same.



   

And where to find any real example using different Combiner/Reducer classes
to improve the map/reduce performance?

 

If you want to compute an average, the combiner needs to do only sums, the
reducer sums and the final division. It would  not be OK to divide in the
combiner. See also
http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/

The interface of reducer and combiner are the same, but they need not be the
same class.


Antonio



   

Thanks.

Shi







 
   



--
Postdoctoral Scholar
Institute for Genomics and Systems Biology
Department of Medicine, the University of Chicago
Knapp Center for Biomedical Discovery
900 E. 57th St. Room 10148
Chicago, IL 60637, US
Tel: 773-702-6799



Re: Efficient query to directory-num-files?

2010-10-05 Thread Keith Wiley


On 2010, Oct 04, at 11:38 AM, Harsh J wrote:

On Mon, Oct 4, 2010 at 11:11 PM, Keith Wiley kwi...@keithwiley.com  
wrote:

- I want to know how many files are in a directory.
- Well, actually, I want to know how many files are in a few  
thousand directories.

- I anticipate the answer to be approximately four million.
- If I were to pipe hadoop fs -ls | wc I estimate a return of  
about 360MBs of textual ls data to my client (Each hadoop ls entry  
is about 90B since it is always ls -l style), when all I really  
want is the file-count.


Is there a smarter way to do this?

Thanks.


There's a FileSystem.listStatus(...).length you could use, in Java.

(cook up a utility for it if you need it in commandline. Its what the
FsShell does anyway when you use it via 'hadoop fs/dfs'.)

But I do not know if this will indeed reduce the querying time also,
as it seems to create an array of all the entries under a path. I
could not find a direct counting command, as even the count given by
the FsShell seems to be of this manner. Trying it on some 50,000 items
I created for testing it out seemed quick enough. I wouldn't know
about 4 million though! Try it out and wait for better answers if any!
:)




Thanks, I'll take a look at that and see what I can do with it.

Cheers!


Keith Wiley kwi...@keithwiley.com keithwiley.com 
music.keithwiley.com


What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge  
ratio than

when I entered.
   --  Keith Wiley