Re: is 12 minutes ok for dfs chown -R on 45000 files ?

2008-10-03 Thread Raghu Angadi


This is mostly disk bound on NameNode. I think this ends up being one 
fsync for each file. If you have multiple directories, you could start 
multiple commands in parallel. Because of the way NameNode syncs having 
multiple clients helps.


Raghu.

Frank Singleton wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Did a test on recursive chown on a fedora 9 box here (2xquad core,16Gram)
Took about 12.5 minutes to complete for 45000 files. (hmm approx 60 files/sec)

This was the namenode that I executed the command on

Q1. Is this rate (60 files/sec) typical of what other folks are seeing ?
Q2. Are there any dfs/jvm parameters I should look at to see if I can improve 
this

time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R frank:frank 
/home/frank/proj100

real12m38.631s
user1m54.662s
sys 0m33.124s

time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -count /home/frank/proj100
 22045891 3965996260 
hdfs://namenode:9000/home/frank/proj100

real0m1.579s
user0m0.686s
sys 0m0.129s


cheers / frank
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkjln0MACgkQpZzN+MMic6dqgQCdEtto3qEhKIc50ICMf058w8ar
o4QAoILcDRDYmUUuxPwSFh7LNTQdKodn
=xuZE
-END PGP SIGNATURE-




Re: is 12 minutes ok for dfs chown -R on 45000 files ?

2008-10-03 Thread Frank Singleton
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Frank Singleton wrote:
 Hi,
 
 Did a test on recursive chown on a fedora 9 box here (2xquad core,16Gram)
 Took about 12.5 minutes to complete for 45000 files. (hmm approx 60 files/sec)
 
 This was the namenode that I executed the command on
 
 Q1. Is this rate (60 files/sec) typical of what other folks are seeing ?
 Q2. Are there any dfs/jvm parameters I should look at to see if I can improve 
 this
 
 time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R frank:frank 
 /home/frank/proj100
 
 real  12m38.631s
 user  1m54.662s
 sys   0m33.124s
 
 time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -count /home/frank/proj100
  22045891 3965996260 
 hdfs://namenode:9000/home/frank/proj100
 
 real  0m1.579s
 user  0m0.686s
 sys   0m0.129s
 
 
 cheers / frank

Just to clarify, this is for when the chown will modify all files owner 
attributes

eg: toggle all from frank:frank to hadoop:hadoop (see below)

for chown -R from frank:frank to frank:frank , the results is only 5 or 6 
seconds.


at this point , all files  under /home/frank/proj100  are frank:frank,  and the 
command executes
in 6 seconds or so.

[EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R 
frank:frank /home/frank/proj100

real0m5.624s
user0m6.744s
sys 0m0.402s

#now lets change all to hadoop:hadoop

[EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R 
hadoop:hadoop /home/frank/proj100

real12m43.732s
user0m53.781s
sys 0m10.655s


# now toggle back to frank:frank

[EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R 
frank:frank /home/frank/proj100

real12m40.700s
user0m45.757s
sys 0m8.173s

# now frank:frank to frank:frank

[EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R 
frank:frank /home/frank/proj100

real0m5.648s
user0m6.734s
sys 0m0.593s
[EMAIL PROTECTED] ~]$


cheers / frank

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkjlvKwACgkQpZzN+MMic6eO4ACfVYEJ3DqWXo1Mg/4StUhG2Vii
r2AAn2YpDmDi2l2a4Bn/1CHAHQtLDgrg
=Dq3d
-END PGP SIGNATURE-


Re: Sharing an object across mappers

2008-10-03 Thread Devajyoti Sarkar
Hi Alan,

Thanks for your message.

The object can be read-only once it is initialized - I do not need to modify
it. Essentially it is an object that allows me to analyze/modify data that I
am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is
that if I run multiple mappers, this object gets replicated in the different
VMs and I run out of memory on my node. I pretty much need to have the full
object in memory to do my processing. It is possible (though quite
difficult) to have it partially on disk and query it (like a lucene store
implementation) but there is a significant performance hit. As an e.g., let
us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this
scenario, I can really only have 1 mapper per node whereas there are 8 CPUs.
But if the overhead of sharing the object (e.g. RMI) or persisting the
object (e.g. lucene) is greater than 8 times the memory speed, then it is
cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was
getting a roughly 600 times decrease in performance versus in-memory access.

So ideally, if I could have all the mappers in the same VM, then I can
create a singleton and still have multiple mappers access it at memory
speeds.

Please do let me know if I am looking at this correctly and if the above is
possible.

Thanks a lot for all your help.

Cheers,
Dev




On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho [EMAIL PROTECTED] wrote:

 It really depends on what type of data you are sharing, how you are looking
 up the data, whether the data is Read-write, and whether you care about
 consistency. If you don't care about consistency, I suggest that you shove
 the data into a BDB store (for key-value lookup) or a lucene store, and copy
 the data to all the nodes. That way all data access will be in-process, no
 gc problems, and you will get very fast results. BDB and lucene both have
 easy replication strategies.

 If the data is RW, and you need consistency, you should probably forget
 about MapReduce and just run everything on big-iron.

 Regards,
 Alan Ho




 - Original Message 
 From: Devajyoti Sarkar [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Thursday, October 2, 2008 8:41:04 PM
 Subject: Sharing an object across mappers

 I think each mapper/reducer runs in its own JVM which makes it impossible
 to
 share objects. I need to share a large object so that I can access it at
 memory speeds across all the mappers. Is it possible to have all the
 mappers
 run in the same VM? Or is there a way to do this across VMs at high speed?
 I
 guess JMI and others such methods will be just too slow.

 Thanks,
 Dev



   __
 Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo!
 Canada Messenger at http://ca.beta.messenger.yahoo.com/



Re: 1 file per record

2008-10-03 Thread chandravadana


suppose i use TextInputFormat.. i set issplitable false.. and there are 5
files.. 
so what happens to numsplits now... will that be set to 0..

S.Chandravadana


owen.omalley wrote:
 
 On Oct 2, 2008, at 1:50 AM, chandravadana wrote:
 
 If we dont specify numSplits in getsplits(), then what is the default
 number of splits taken...
 
 The getSplits() is either library or user code, so it depends which  
 class you are using as your InputFormat. The FileInputFormats  
 (TextInputFormat and SequenceFileInputFormat) basically divide input  
 files by blocks, unless the requested number of mappers is really high.
 
 -- Owen
 
 

-- 
View this message in context: 
http://www.nabble.com/1-file-per-record-tp19644985p19794194.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Sharing an object across mappers

2008-10-03 Thread Arun C Murthy


On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote:


Hi Alan,

Thanks for your message.

The object can be read-only once it is initialized - I do not need  
to modify


Please take a look at DistributedCache:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache

An example:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0

Arun



it. Essentially it is an object that allows me to analyze/modify  
data that I
am mapping/reducing. It comes to about 3-4GB of RAM. The problem I  
have is
that if I run multiple mappers, this object gets replicated in the  
different
VMs and I run out of memory on my node. I pretty much need to have  
the full

object in memory to do my processing. It is possible (though quite
difficult) to have it partially on disk and query it (like a lucene  
store
implementation) but there is a significant performance hit. As an  
e.g., let
us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In  
this
scenario, I can really only have 1 mapper per node whereas there are  
8 CPUs.

But if the overhead of sharing the object (e.g. RMI) or persisting the
object (e.g. lucene) is greater than 8 times the memory speed, then  
it is
cheaper to run 1 mapper/node. I tried sharing with Terracotta and I  
was
getting a roughly 600 times decrease in performance versus in-memory  
access.


So ideally, if I could have all the mappers in the same VM, then I can
create a singleton and still have multiple mappers access it at memory
speeds.

Please do let me know if I am looking at this correctly and if the  
above is

possible.

Thanks a lot for all your help.

Cheers,
Dev




On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho [EMAIL PROTECTED] wrote:

It really depends on what type of data you are sharing, how you are  
looking
up the data, whether the data is Read-write, and whether you care  
about
consistency. If you don't care about consistency, I suggest that  
you shove
the data into a BDB store (for key-value lookup) or a lucene store,  
and copy
the data to all the nodes. That way all data access will be in- 
process, no
gc problems, and you will get very fast results. BDB and lucene  
both have

easy replication strategies.

If the data is RW, and you need consistency, you should probably  
forget

about MapReduce and just run everything on big-iron.

Regards,
Alan Ho




- Original Message 
From: Devajyoti Sarkar [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Thursday, October 2, 2008 8:41:04 PM
Subject: Sharing an object across mappers

I think each mapper/reducer runs in its own JVM which makes it  
impossible

to
share objects. I need to share a large object so that I can access  
it at

memory speeds across all the mappers. Is it possible to have all the
mappers
run in the same VM? Or is there a way to do this across VMs at high  
speed?

I
guess JMI and others such methods will be just too slow.

Thanks,
Dev



  
__
Instant Messaging, free SMS, sharing photos and more... Try the new  
Yahoo!

Canada Messenger at http://ca.beta.messenger.yahoo.com/





Re: Sharing an object across mappers

2008-10-03 Thread Devajyoti Sarkar
Hi Arun,

Briefly going through the DistributedCache information, it seems to be a way
to distribute files to mappers/reducers. One still needs to read the
contents into each map/reduce task VM. Therefore, the data gets replicated
across the VMs in a single node. It seems it does not address my basic
problem which is to have a large shared object across multiple map/reduce
tasks at a given node without having to replicate it across the VMs.

Is there a setting in Hadoop where one can tell Hadoop to create the
individual map/reduce tasks in the same JVM?

Thanks,
Dev


On Fri, Oct 3, 2008 at 10:32 PM, Arun C Murthy [EMAIL PROTECTED] wrote:


 On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote:

  Hi Alan,

 Thanks for your message.

 The object can be read-only once it is initialized - I do not need to
 modify


 Please take a look at DistributedCache:

 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache

 An example:

 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0

 Arun



 it. Essentially it is an object that allows me to analyze/modify data that
 I
 am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is
 that if I run multiple mappers, this object gets replicated in the
 different
 VMs and I run out of memory on my node. I pretty much need to have the
 full
 object in memory to do my processing. It is possible (though quite
 difficult) to have it partially on disk and query it (like a lucene store
 implementation) but there is a significant performance hit. As an e.g.,
 let
 us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this
 scenario, I can really only have 1 mapper per node whereas there are 8
 CPUs.
 But if the overhead of sharing the object (e.g. RMI) or persisting the
 object (e.g. lucene) is greater than 8 times the memory speed, then it is
 cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was
 getting a roughly 600 times decrease in performance versus in-memory
 access.

 So ideally, if I could have all the mappers in the same VM, then I can
 create a singleton and still have multiple mappers access it at memory
 speeds.

 Please do let me know if I am looking at this correctly and if the above
 is
 possible.

 Thanks a lot for all your help.

 Cheers,
 Dev




 On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho [EMAIL PROTECTED] wrote:

  It really depends on what type of data you are sharing, how you are
 looking
 up the data, whether the data is Read-write, and whether you care about
 consistency. If you don't care about consistency, I suggest that you
 shove
 the data into a BDB store (for key-value lookup) or a lucene store, and
 copy
 the data to all the nodes. That way all data access will be in-process,
 no
 gc problems, and you will get very fast results. BDB and lucene both have
 easy replication strategies.

 If the data is RW, and you need consistency, you should probably forget
 about MapReduce and just run everything on big-iron.

 Regards,
 Alan Ho




 - Original Message 
 From: Devajyoti Sarkar [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Thursday, October 2, 2008 8:41:04 PM
 Subject: Sharing an object across mappers

 I think each mapper/reducer runs in its own JVM which makes it impossible
 to
 share objects. I need to share a large object so that I can access it at
 memory speeds across all the mappers. Is it possible to have all the
 mappers
 run in the same VM? Or is there a way to do this across VMs at high
 speed?
 I
 guess JMI and others such methods will be just too slow.

 Thanks,
 Dev



 __
 Instant Messaging, free SMS, sharing photos and more... Try the new
 Yahoo!
 Canada Messenger at http://ca.beta.messenger.yahoo.com/





Re: architecture diagram

2008-10-03 Thread Alex Loddengaard
Can you confirm that the example you've presented is accurate?  I think you
may have made some typos, because the letter G isn't in the final result;
I also think your first pivot accidentally swapped C and G.  I'm having a
hard time understanding what you want to do, because it seems like your
operations differ from your example.

With that said, at first glance, this problem may not fit well in to the
MapReduce paradigm.  The reason I'm making this claim is because in order to
do the pivot operation you must know about every row.  Your input files will
be split at semi-arbitrary places, essentially making it impossible for each
mapper to know every single row.  There may be a way to do this by
collecting, in your map step, key = column number (0, 1, 2, etc) and value
= (A, B, C, etc), though you may run in to problems when you try to pivot
back.  I say this because when you pivot back, you need to have each column,
which means you'll need one reduce step.  There may be a way to put the
pivot-back operation in a second iteration, though I don't think that would
help you.

Terrence, please confirm that you've defined your example correctly.  In the
meantime, can someone else confirm that this problem does not fit will in to
the MapReduce paradigm?

Alex

On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi 
[EMAIL PROTECTED] wrote:

 I am trying to write a map reduce implementation to do the following:

 1) read tabular data delimited in some fashion
 2) pivot that data, so the rows are columns and the columns are rows
 3) shuffle the rows (that were the columns) to randomize the data
 4) pivot the data back

 For example.

 A|B|C
 D|E|G

 pivots too...

 D|A
 E|B
 C|G

 Then for each row, shuffle the contents around randomly...

 D|A
 B|E
 G|C

 Then pivot the data back...

 A|E|C
 D|B|C

 You can reference my progress so far...

 http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/

 Terrence A. Pietrondi


 --- On Thu, 10/2/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Thursday, October 2, 2008, 1:36 PM
  I think it really depends on the job as to where logic goes.
   Sometimes your
  reduce step is as simple as an identify function, and
  sometimes it can be
  more complex than your map step.  It all depends on your
  data and the
  operation(s) you're trying to perform.
 
  Perhaps we should step out of the abstract.  Do you have a
  specific problem
  you're trying to solve?  Can you describe it?
 
  Alex
 
  On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   I am sorry for the confusion. I meant distributed
  data.
  
   So help me out here. For example, if I am reducing to
  a single file, then
   my main transformation logic would be in my mapping
  step since I am reducing
   away from the data?
  
   Terrence A. Pietrondi
   http://del.icio.us/tepietrondi
  
  
   --- On Wed, 10/1/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Wednesday, October 1, 2008, 7:44 PM
I'm not sure what you mean by
  disconnected parts
of data, but Hadoop is
implemented to try and perform map tasks on
  machines that
have input data.
This is to lower the amount of network traffic,
  hence
making the entire job
run faster.  Hadoop does all this for you under
  the hood.
From a user's
point of view, all you need to do is store data
  in HDFS
(the distributed
filesystem), and run MapReduce jobs on that data.
   Take a
look here:
   
http://wiki.apache.org/hadoop/WordCount
   
Alex
   
On Wed, Oct 1, 2008 at 1:11 PM, Terrence A.
  Pietrondi
[EMAIL PROTECTED]
 wrote:
   
 So to be distributed in a sense,
  you would
want to do your computation on
 the disconnected parts of data in the map
  phase I
would guess?

 Terrence A. Pietrondi
 http://del.icio.us/tepietrondi


 --- On Wed, 10/1/08, Arun C Murthy
[EMAIL PROTECTED] wrote:

  From: Arun C Murthy
  [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Wednesday, October 1, 2008, 2:16
  PM
  On Oct 1, 2008, at 10:17 AM, Terrence
  A.
Pietrondi wrote:
 
   I am trying to plan out my
  map-reduce
implementation
  and I have some
   questions of where computation
  should be
split in
  order to take
   advantage of the distributed
  nodes.
  
   Looking at the architecture
  diagram
 
   
  (http://hadoop.apache.org/core/images/architecture.gif
   ), are the map boxes the major
  computation
areas or is
  the reduce
   the major computation area?
  
 
  Usually the maps perform the
  'embarrassingly
  

Re: architecture diagram

2008-10-03 Thread Terrence A. Pietrondi
Sorry for the confusion, I did make some typos. My example should have looked 
like... 

 A|B|C
 D|E|G

 pivots too...

 D|A
 E|B
 G|C

 Then for each row, shuffle the contents around randomly...

 D|A
 B|E
 C|G

 Then pivot the data back...

 A|E|G
 D|B|C

The general goal is to shuffle the elements in each column in the input data. 
Meaning, the ordering of the elements in each column will not be the same as in 
input.

If you look at the initial input and compare to the final output, you'll see 
that during the shuffling, B and E are swapped, and G and C are swapped, while 
A and D were shuffled back into their originating positions in the column. 

Once again, sorry for the typos and confusion.

Terrence A. Pietrondi

--- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

 From: Alex Loddengaard [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Friday, October 3, 2008, 11:01 AM
 Can you confirm that the example you've presented is
 accurate?  I think you
 may have made some typos, because the letter G
 isn't in the final result;
 I also think your first pivot accidentally swapped C and G.
  I'm having a
 hard time understanding what you want to do, because it
 seems like your
 operations differ from your example.
 
 With that said, at first glance, this problem may not fit
 well in to the
 MapReduce paradigm.  The reason I'm making this claim
 is because in order to
 do the pivot operation you must know about every row.  Your
 input files will
 be split at semi-arbitrary places, essentially making it
 impossible for each
 mapper to know every single row.  There may be a way to do
 this by
 collecting, in your map step, key = column number (0,
 1, 2, etc) and value
 = (A, B, C, etc), though you may run in to problems
 when you try to pivot
 back.  I say this because when you pivot back, you need to
 have each column,
 which means you'll need one reduce step.  There may be
 a way to put the
 pivot-back operation in a second iteration, though I
 don't think that would
 help you.
 
 Terrence, please confirm that you've defined your
 example correctly.  In the
 meantime, can someone else confirm that this problem does
 not fit will in to
 the MapReduce paradigm?
 
 Alex
 
 On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi 
 [EMAIL PROTECTED] wrote:
 
  I am trying to write a map reduce implementation to do
 the following:
 
  1) read tabular data delimited in some fashion
  2) pivot that data, so the rows are columns and the
 columns are rows
  3) shuffle the rows (that were the columns) to
 randomize the data
  4) pivot the data back
 
  For example.
 
  A|B|C
  D|E|G
 
  pivots too...
 
  D|A
  E|B
  C|G
 
  Then for each row, shuffle the contents around
 randomly...
 
  D|A
  B|E
  G|C
 
  Then pivot the data back...
 
  A|E|C
  D|B|C
 
  You can reference my progress so far...
 
 
 http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
 
  Terrence A. Pietrondi
 
 
  --- On Thu, 10/2/08, Alex Loddengaard
 [EMAIL PROTECTED] wrote:
 
   From: Alex Loddengaard
 [EMAIL PROTECTED]
   Subject: Re: architecture diagram
   To: core-user@hadoop.apache.org
   Date: Thursday, October 2, 2008, 1:36 PM
   I think it really depends on the job as to where
 logic goes.
Sometimes your
   reduce step is as simple as an identify function,
 and
   sometimes it can be
   more complex than your map step.  It all depends
 on your
   data and the
   operation(s) you're trying to perform.
  
   Perhaps we should step out of the abstract.  Do
 you have a
   specific problem
   you're trying to solve?  Can you describe it?
  
   Alex
  
   On Thu, Oct 2, 2008 at 4:55 AM, Terrence A.
 Pietrondi
   [EMAIL PROTECTED]
wrote:
  
I am sorry for the confusion. I meant
 distributed
   data.
   
So help me out here. For example, if I am
 reducing to
   a single file, then
my main transformation logic would be in my
 mapping
   step since I am reducing
away from the data?
   
Terrence A. Pietrondi
http://del.icio.us/tepietrondi
   
   
--- On Wed, 10/1/08, Alex Loddengaard
   [EMAIL PROTECTED] wrote:
   
 From: Alex Loddengaard
   [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Wednesday, October 1, 2008, 7:44
 PM
 I'm not sure what you mean by
   disconnected parts
 of data, but Hadoop is
 implemented to try and perform map
 tasks on
   machines that
 have input data.
 This is to lower the amount of network
 traffic,
   hence
 making the entire job
 run faster.  Hadoop does all this for
 you under
   the hood.
 From a user's
 point of view, all you need to do is
 store data
   in HDFS
 (the distributed
 filesystem), and run MapReduce jobs on
 that data.
Take a
 look here:


 http://wiki.apache.org/hadoop/WordCount

 Alex

 On Wed, Oct 1, 2008 at 1:11 PM,
 Terrence A.
   Pietrondi
 

Unable to retrieve filename using mapred.input.file

2008-10-03 Thread Yair Even-Zohar
I'm running map reduce and have the following lines of code:

public void configure(JobConf job) {

mapTaskId = job.get(mapred.task.id);

inputFile = job.get(mapred.input.file);

 

The problem I'm facing is that the inputFile I'm getting is null (the
mapTaskId works fine).

 

The input file are all the files in a given directory and they are all
gzipped. Something like  .../blah/*.gz

 

Any suggestion on how to get the name of the processed filename to the
map task?

 

Thanks

-Yair



Re: Maps running after reducers complete successfully?

2008-10-03 Thread pvvpr
thanks Owen,
  So this may be an enhancement?

- Prasad.

On Thursday 02 October 2008 09:58:03 pm Owen O'Malley wrote:
 It isn't optimal, but it is the expected behavior. In general when we
 lose a TaskTracker, we want the map outputs regenerated so that any
 reduces that need to re-run (including speculative execution). We
 could handle it as a special case if:
1. We didn't lose any running reduces.
2. All of the reduces (including speculative tasks) are done with
 shuffling.
3. We don't plan on launching any more speculative reduces.
 If all 3 hold, we don't need to re-run the map tasks. Actually doing
 so, would be a pretty involved patch to the JobTracker/Schedulers.

 -- Owen







Re: Sharing an object across mappers

2008-10-03 Thread Owen O'Malley


On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote:

Briefly going through the DistributedCache information, it seems to  
be a way

to distribute files to mappers/reducers.


Sure, but it handles the distribution problem for you.


One still needs to read the
contents into each map/reduce task VM.


If the data is straight binary data, you could just mmap it from the  
various tasks. It would be pretty efficient.


The other direction is to use the MultiThreadedMapRunner and run  
multiple maps as threads in the same VM. But unless your maps are CPU  
heavy or contacting external servers, it probably won't help as much  
as you'd like.


-- Owen


Re: Sharing an object across mappers

2008-10-03 Thread Devajyoti Sarkar
Hi Owen,

Thanks a lot for the pointers.

In order to use the MultiThreadedMapRunner, if I change the
setMapRunnerClass() method in the jobConf, then does the rest of my code
remain the same (apart from making it thread-safe)?

Thanks in advance,
Dev


On Sat, Oct 4, 2008 at 12:29 AM, Owen O'Malley [EMAIL PROTECTED] wrote:


 On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote:

  Briefly going through the DistributedCache information, it seems to be a
 way
 to distribute files to mappers/reducers.


 Sure, but it handles the distribution problem for you.

  One still needs to read the
 contents into each map/reduce task VM.


 If the data is straight binary data, you could just mmap it from the
 various tasks. It would be pretty efficient.

 The other direction is to use the MultiThreadedMapRunner and run multiple
 maps as threads in the same VM. But unless your maps are CPU heavy or
 contacting external servers, it probably won't help as much as you'd like.

 -- Owen



mapreduce input file question

2008-10-03 Thread Ski Gh3
Hi all,

I have a maybe naive question on providing input to a mapreduce program:
   how can I specify the input with respect to the hdfs path?

right now I can specify a input file from my local directory, say, hadoop
trunk
I can also specify an absolute path for a dfs file using where it is
actually stored on my local node, eg/, /usr/username/tmp/x

How can I do something like hdfs://inputdata/myinputdata.txt? I always got a
cannot find file kind of error
Furthermore, maybe the input files can already be some sharded outputs from
another mapreduce, e.g., myinputdata-0001.txt, myinputdata-0002.txt?

Thanks a lot!


Re: Maps running after reducers complete successfully?

2008-10-03 Thread Billy Pearson

Do we not have an option to store the map results in hdfs?

Billy

Owen O'Malley [EMAIL PROTECTED] wrote in 
message news:[EMAIL PROTECTED]
It isn't optimal, but it is the expected behavior. In general when we 
lose a TaskTracker, we want the map outputs regenerated so that any 
reduces that need to re-run (including speculative execution). We  could 
handle it as a special case if:

  1. We didn't lose any running reduces.
  2. All of the reduces (including speculative tasks) are done with 
shuffling.

  3. We don't plan on launching any more speculative reduces.
If all 3 hold, we don't need to re-run the map tasks. Actually doing  so, 
would be a pretty involved patch to the JobTracker/Schedulers.


-- Owen






Re: mapreduce input file question

2008-10-03 Thread Alex Loddengaard
First, you need to point a MapReduce job at a directory, not an individual
file.  Second, when you specify a path in your job conf, using the Path
object, that path you supply is a HDFS path, not a local path.

Yes, you can use the output files of another MapReduce job as input for a
second job, but again you want to point your second job's input at the
directory that the first job outputted to.

Hope this helps.

Alex

On Fri, Oct 3, 2008 at 11:15 AM, Ski Gh3 [EMAIL PROTECTED] wrote:

 Hi all,

 I have a maybe naive question on providing input to a mapreduce program:
   how can I specify the input with respect to the hdfs path?

 right now I can specify a input file from my local directory, say, hadoop
 trunk
 I can also specify an absolute path for a dfs file using where it is
 actually stored on my local node, eg/, /usr/username/tmp/x

 How can I do something like hdfs://inputdata/myinputdata.txt? I always got
 a
 cannot find file kind of error
 Furthermore, maybe the input files can already be some sharded outputs from
 another mapreduce, e.g., myinputdata-0001.txt, myinputdata-0002.txt?

 Thanks a lot!



Re: architecture diagram

2008-10-03 Thread Alex Loddengaard
The approach that you've described does not fit well in to the MapReduce
paradigm.  You may want to consider randomizing your data in a different
way.

Unfortunately some things can't be solved well with MapReduce, and I think
this is one of them.

Can someone else say more?

Alex

On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 Sorry for the confusion, I did make some typos. My example should have
 looked like...

  A|B|C
  D|E|G
 
  pivots too...
 
  D|A
  E|B
  G|C
 
  Then for each row, shuffle the contents around randomly...
 
  D|A
  B|E
  C|G
 
  Then pivot the data back...
 
  A|E|G
  D|B|C

 The general goal is to shuffle the elements in each column in the input
 data. Meaning, the ordering of the elements in each column will not be the
 same as in input.

 If you look at the initial input and compare to the final output, you'll
 see that during the shuffling, B and E are swapped, and G and C are swapped,
 while A and D were shuffled back into their originating positions in the
 column.

 Once again, sorry for the typos and confusion.

 Terrence A. Pietrondi

 --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Friday, October 3, 2008, 11:01 AM
  Can you confirm that the example you've presented is
  accurate?  I think you
  may have made some typos, because the letter G
  isn't in the final result;
  I also think your first pivot accidentally swapped C and G.
   I'm having a
  hard time understanding what you want to do, because it
  seems like your
  operations differ from your example.
 
  With that said, at first glance, this problem may not fit
  well in to the
  MapReduce paradigm.  The reason I'm making this claim
  is because in order to
  do the pivot operation you must know about every row.  Your
  input files will
  be split at semi-arbitrary places, essentially making it
  impossible for each
  mapper to know every single row.  There may be a way to do
  this by
  collecting, in your map step, key = column number (0,
  1, 2, etc) and value
  = (A, B, C, etc), though you may run in to problems
  when you try to pivot
  back.  I say this because when you pivot back, you need to
  have each column,
  which means you'll need one reduce step.  There may be
  a way to put the
  pivot-back operation in a second iteration, though I
  don't think that would
  help you.
 
  Terrence, please confirm that you've defined your
  example correctly.  In the
  meantime, can someone else confirm that this problem does
  not fit will in to
  the MapReduce paradigm?
 
  Alex
 
  On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi 
  [EMAIL PROTECTED] wrote:
 
   I am trying to write a map reduce implementation to do
  the following:
  
   1) read tabular data delimited in some fashion
   2) pivot that data, so the rows are columns and the
  columns are rows
   3) shuffle the rows (that were the columns) to
  randomize the data
   4) pivot the data back
  
   For example.
  
   A|B|C
   D|E|G
  
   pivots too...
  
   D|A
   E|B
   C|G
  
   Then for each row, shuffle the contents around
  randomly...
  
   D|A
   B|E
   G|C
  
   Then pivot the data back...
  
   A|E|C
   D|B|C
  
   You can reference my progress so far...
  
  
  http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
  
   Terrence A. Pietrondi
  
  
   --- On Thu, 10/2/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Thursday, October 2, 2008, 1:36 PM
I think it really depends on the job as to where
  logic goes.
 Sometimes your
reduce step is as simple as an identify function,
  and
sometimes it can be
more complex than your map step.  It all depends
  on your
data and the
operation(s) you're trying to perform.
   
Perhaps we should step out of the abstract.  Do
  you have a
specific problem
you're trying to solve?  Can you describe it?
   
Alex
   
On Thu, Oct 2, 2008 at 4:55 AM, Terrence A.
  Pietrondi
[EMAIL PROTECTED]
 wrote:
   
 I am sorry for the confusion. I meant
  distributed
data.

 So help me out here. For example, if I am
  reducing to
a single file, then
 my main transformation logic would be in my
  mapping
step since I am reducing
 away from the data?

 Terrence A. Pietrondi
 http://del.icio.us/tepietrondi


 --- On Wed, 10/1/08, Alex Loddengaard
[EMAIL PROTECTED] wrote:

  From: Alex Loddengaard
[EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Wednesday, October 1, 2008, 7:44
  PM
  I'm not sure what you mean by
disconnected parts
  of data, but Hadoop is
  implemented to try and perform map
  tasks 

Re: mapreduce input file question

2008-10-03 Thread Ski Gh3
I wonder if I am missing something.

I have a .txt file for input, and I placed it under the input directory of
hdfs.
Then I called
FileInputFormat.setInputPaths(c, new Path(input));
and I got an error:
Exception in thread main
org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist :
file:/C:/workspace/MyHBase/input

The input directory has been interpreted as a local directory from where the
program was initiated...

Can you please tell me what I am doing wrong?
Thanks a lot in advance!



On Fri, Oct 3, 2008 at 2:15 PM, Alex Loddengaard
[EMAIL PROTECTED]wrote:

 First, you need to point a MapReduce job at a directory, not an individual
 file.  Second, when you specify a path in your job conf, using the Path
 object, that path you supply is a HDFS path, not a local path.

 Yes, you can use the output files of another MapReduce job as input for a
 second job, but again you want to point your second job's input at the
 directory that the first job outputted to.

 Hope this helps.

 Alex

 On Fri, Oct 3, 2008 at 11:15 AM, Ski Gh3 [EMAIL PROTECTED] wrote:

  Hi all,
 
  I have a maybe naive question on providing input to a mapreduce program:
how can I specify the input with respect to the hdfs path?
 
  right now I can specify a input file from my local directory, say, hadoop
  trunk
  I can also specify an absolute path for a dfs file using where it is
  actually stored on my local node, eg/, /usr/username/tmp/x
 
  How can I do something like hdfs://inputdata/myinputdata.txt? I always
 got
  a
  cannot find file kind of error
  Furthermore, maybe the input files can already be some sharded outputs
 from
  another mapreduce, e.g., myinputdata-0001.txt, myinputdata-0002.txt?
 
  Thanks a lot!
 



Turning off FileSystem statistics during MapReduce

2008-10-03 Thread Nathan Marz

Hello,

We have been doing some profiling of our MapReduce jobs, and we are  
seeing about 20% of the time of our jobs is spent calling FileSystem 
$Statistics.incrementBytesRead when we interact with the FileSystem.  
Is there a way to turn this stats-collection off?


Thanks,
Nathan Marz
Rapleaf



Re: Turning off FileSystem statistics during MapReduce

2008-10-03 Thread Arun C Murthy

Nathan,

On Oct 3, 2008, at 5:18 PM, Nathan Marz wrote:


Hello,

We have been doing some profiling of our MapReduce jobs, and we are  
seeing about 20% of the time of our jobs is spent calling FileSystem 
$Statistics.incrementBytesRead when we interact with the  
FileSystem. Is there a way to turn this stats-collection off?




This is interesting... could you provide more details? Are you seeing  
this on Maps or Reduces? Which FileSystem exhibited this i.e. HDFS or  
LocalFS? Any details on about your application?


To answer your original question - no, there isn't a way to disable  
this. However, if this turns out to be a systemic problem we  
definitely should consider having an option to allow users to switch  
it off.


So any information you can provide helps - thanks!

Arun



Thanks,
Nathan Marz
Rapleaf





A question about Mapper

2008-10-03 Thread Zhou, Yunqing
the input is as follows.
flag
a
b
flag
c
d
e
flag
f

then I used a mapper to first store values and then emit them all when met
with a line contains flag
but when the file reached its end, I have no chance to emit the last
record.(in this case ,f)
so how can I detect the mapper's end of its life , or how can I emit a last
record before a mapper exits.

Thanks


[Hadoop NY User Group Meetup] HIVE: Data Warehousing using Hadoop 10/9

2008-10-03 Thread Alex Dorman
Next NY Hadoop meetup will take place on Thursday, 10/9 at 6:30 pm.

Jeff Hammerbacher will present HIVE: Data Warehousing using Hadoop.

About HIVE:
- Data Organization into Tables with logical and hash partitioning 
- A Metastore to store metadata about Tables/Partitions etc 
- A SQL like query language over object data stored in Tables 
- DDL commands to define and load external data into tables

About the speaker:
Jeff Hammerbacher conceived, built, and led the Data team at Facebook.
The Data team was responsible for driving many of the applications of
statistics and machine learning at Facebook, as well as building out the
infrastructure to support these tasks for massive data sets. The team
produced two open source projects: Hive, a system for offline analysis
built above Hadoop, and Cassandra, a structured storage system on a P2P
network. Before joining Facebook, Jeff wore a suit on Wall Street and id
Mathematics at Harvard.
Currently Jeff is an Entrepreneur in Residence at Accel Partners.

Location 
ContextWeb, 9th floor  
22 Cortlandt Street
New York, NY 10007 

If you are interested, RSVP here:
http://softwaredev.meetup.com/110/calendar/8881385/

-Alex


Re: A question about Mapper

2008-10-03 Thread Joman Chu
Hello,

Does MapReduceBase.close() fit your needs? Take a look at 
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/MapReduceBase.html#close()

On Fri, October 3, 2008 11:36 pm, Zhou, Yunqing said:
 the input is as follows. flag a b flag c d e flag f
 
 then I used a mapper to first store values and then emit them all when
 met with a line contains flag but when the file reached its end, I have
 no chance to emit the last record.(in this case ,f) so how can I detect the
 mapper's end of its life , or how can I emit a last record before a mapper
 exits.
 
 Thanks
 

Have a good one,
-- 
Joman Chu
Carnegie Mellon University
School of Computer Science 2011
AIM: ARcanUSNUMquam



Seeking Hadoop Guru

2008-10-03 Thread howard23

Appreciate any assist on this oppty in New York Cityif you or someone
you know might be in interested in a F/T gig...pls contact me ASAP!

Software Engineer-Hadoop Guru   NYC F/T 

2-5yrs experience   130K+

Responsibilities

   * Develop and support a secure and flexible large-scale data
processing infrastructure for research and development within the company.
   * As a core member of a small and deeply talented team, you will be
responsible across many technical aspects of helping to deliver the
results of our RD as a world-class platform for partners and customers.


Qualifications

   * Bachelor's Degree in Engineering, Computer Science, or related
technical field.
   * Required: real world experience building data solutions using Hadoop.
   * Strong design/admin experience with relational database systems,
esp. MySQL and/or PostgreSQL.
   * At least 4 years software engineering experience designing and
developing modern web-based consumer-facing server solutions in rapid
development cycles.
   * Expert in Java (C++, Python, a plus) development and debugging on
a Linux platform.
   * A deep and powerful need to create useful, readable and accurate
documentation as you work.

Regds,
Howard Berger
Beacon Staffing
[EMAIL PROTECTED]
-- 
View this message in context: 
http://www.nabble.com/Seeking-Hadoop-Guru-tp19809079p19809079.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



hadoop under windows.

2008-10-03 Thread Dmitry Pushkarev
Hi.

 

I have a strange problem with hadoop when I run jobs under windows (my
laptop runs XP, but all cluster machines including namenode run Ubuntu).  I
run job (which runs perfectly under linux, and all configs and Java versions
are the same), all mappers  finishes successfully, and so does reducer but
when I tries to copy resulting file to the output directory I get things
like: 

 

 

03.10.2008 21:47:24 *INFO * audit:
ugi=Dmitry,mkpasswd,root,None,Administrators,Users
ip=/171.65.102.211cmd=rename
src=/user/public/tmp/streaming-job12345/out48/_temporary/_attempt_2008100320
05_0013_r_00_0/part-0
dst=/user/public/tmp/streaming-job12345/out48/_temporary/_attempt_2008100320
05_0013_r_00_0/part-0perm=Dmitry:supergroup:rw-r--r--
(FSNamesystem.java, line 94)

 

And then it deletes the file. And I get no output. 

Why does it renames the files into itself and does it have anything to do
with Path.getParent()?

 

Thanks.