Re: is 12 minutes ok for dfs chown -R on 45000 files ?
This is mostly disk bound on NameNode. I think this ends up being one fsync for each file. If you have multiple directories, you could start multiple commands in parallel. Because of the way NameNode syncs having multiple clients helps. Raghu. Frank Singleton wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Did a test on recursive chown on a fedora 9 box here (2xquad core,16Gram) Took about 12.5 minutes to complete for 45000 files. (hmm approx 60 files/sec) This was the namenode that I executed the command on Q1. Is this rate (60 files/sec) typical of what other folks are seeing ? Q2. Are there any dfs/jvm parameters I should look at to see if I can improve this time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R frank:frank /home/frank/proj100 real12m38.631s user1m54.662s sys 0m33.124s time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -count /home/frank/proj100 22045891 3965996260 hdfs://namenode:9000/home/frank/proj100 real0m1.579s user0m0.686s sys 0m0.129s cheers / frank -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iEYEARECAAYFAkjln0MACgkQpZzN+MMic6dqgQCdEtto3qEhKIc50ICMf058w8ar o4QAoILcDRDYmUUuxPwSFh7LNTQdKodn =xuZE -END PGP SIGNATURE-
Re: is 12 minutes ok for dfs chown -R on 45000 files ?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Frank Singleton wrote: Hi, Did a test on recursive chown on a fedora 9 box here (2xquad core,16Gram) Took about 12.5 minutes to complete for 45000 files. (hmm approx 60 files/sec) This was the namenode that I executed the command on Q1. Is this rate (60 files/sec) typical of what other folks are seeing ? Q2. Are there any dfs/jvm parameters I should look at to see if I can improve this time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R frank:frank /home/frank/proj100 real 12m38.631s user 1m54.662s sys 0m33.124s time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -count /home/frank/proj100 22045891 3965996260 hdfs://namenode:9000/home/frank/proj100 real 0m1.579s user 0m0.686s sys 0m0.129s cheers / frank Just to clarify, this is for when the chown will modify all files owner attributes eg: toggle all from frank:frank to hadoop:hadoop (see below) for chown -R from frank:frank to frank:frank , the results is only 5 or 6 seconds. at this point , all files under /home/frank/proj100 are frank:frank, and the command executes in 6 seconds or so. [EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R frank:frank /home/frank/proj100 real0m5.624s user0m6.744s sys 0m0.402s #now lets change all to hadoop:hadoop [EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R hadoop:hadoop /home/frank/proj100 real12m43.732s user0m53.781s sys 0m10.655s # now toggle back to frank:frank [EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R frank:frank /home/frank/proj100 real12m40.700s user0m45.757s sys 0m8.173s # now frank:frank to frank:frank [EMAIL PROTECTED] ~]$ time /home/hadoop/hadoop-0.18.1/bin/hadoop dfs -chown -R frank:frank /home/frank/proj100 real0m5.648s user0m6.734s sys 0m0.593s [EMAIL PROTECTED] ~]$ cheers / frank -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iEYEARECAAYFAkjlvKwACgkQpZzN+MMic6eO4ACfVYEJ3DqWXo1Mg/4StUhG2Vii r2AAn2YpDmDi2l2a4Bn/1CHAHQtLDgrg =Dq3d -END PGP SIGNATURE-
Re: Sharing an object across mappers
Hi Alan, Thanks for your message. The object can be read-only once it is initialized - I do not need to modify it. Essentially it is an object that allows me to analyze/modify data that I am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is that if I run multiple mappers, this object gets replicated in the different VMs and I run out of memory on my node. I pretty much need to have the full object in memory to do my processing. It is possible (though quite difficult) to have it partially on disk and query it (like a lucene store implementation) but there is a significant performance hit. As an e.g., let us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this scenario, I can really only have 1 mapper per node whereas there are 8 CPUs. But if the overhead of sharing the object (e.g. RMI) or persisting the object (e.g. lucene) is greater than 8 times the memory speed, then it is cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was getting a roughly 600 times decrease in performance versus in-memory access. So ideally, if I could have all the mappers in the same VM, then I can create a singleton and still have multiple mappers access it at memory speeds. Please do let me know if I am looking at this correctly and if the above is possible. Thanks a lot for all your help. Cheers, Dev On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho [EMAIL PROTECTED] wrote: It really depends on what type of data you are sharing, how you are looking up the data, whether the data is Read-write, and whether you care about consistency. If you don't care about consistency, I suggest that you shove the data into a BDB store (for key-value lookup) or a lucene store, and copy the data to all the nodes. That way all data access will be in-process, no gc problems, and you will get very fast results. BDB and lucene both have easy replication strategies. If the data is RW, and you need consistency, you should probably forget about MapReduce and just run everything on big-iron. Regards, Alan Ho - Original Message From: Devajyoti Sarkar [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Thursday, October 2, 2008 8:41:04 PM Subject: Sharing an object across mappers I think each mapper/reducer runs in its own JVM which makes it impossible to share objects. I need to share a large object so that I can access it at memory speeds across all the mappers. Is it possible to have all the mappers run in the same VM? Or is there a way to do this across VMs at high speed? I guess JMI and others such methods will be just too slow. Thanks, Dev __ Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! Canada Messenger at http://ca.beta.messenger.yahoo.com/
Re: 1 file per record
suppose i use TextInputFormat.. i set issplitable false.. and there are 5 files.. so what happens to numsplits now... will that be set to 0.. S.Chandravadana owen.omalley wrote: On Oct 2, 2008, at 1:50 AM, chandravadana wrote: If we dont specify numSplits in getsplits(), then what is the default number of splits taken... The getSplits() is either library or user code, so it depends which class you are using as your InputFormat. The FileInputFormats (TextInputFormat and SequenceFileInputFormat) basically divide input files by blocks, unless the requested number of mappers is really high. -- Owen -- View this message in context: http://www.nabble.com/1-file-per-record-tp19644985p19794194.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Sharing an object across mappers
On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote: Hi Alan, Thanks for your message. The object can be read-only once it is initialized - I do not need to modify Please take a look at DistributedCache: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache An example: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0 Arun it. Essentially it is an object that allows me to analyze/modify data that I am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is that if I run multiple mappers, this object gets replicated in the different VMs and I run out of memory on my node. I pretty much need to have the full object in memory to do my processing. It is possible (though quite difficult) to have it partially on disk and query it (like a lucene store implementation) but there is a significant performance hit. As an e.g., let us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this scenario, I can really only have 1 mapper per node whereas there are 8 CPUs. But if the overhead of sharing the object (e.g. RMI) or persisting the object (e.g. lucene) is greater than 8 times the memory speed, then it is cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was getting a roughly 600 times decrease in performance versus in-memory access. So ideally, if I could have all the mappers in the same VM, then I can create a singleton and still have multiple mappers access it at memory speeds. Please do let me know if I am looking at this correctly and if the above is possible. Thanks a lot for all your help. Cheers, Dev On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho [EMAIL PROTECTED] wrote: It really depends on what type of data you are sharing, how you are looking up the data, whether the data is Read-write, and whether you care about consistency. If you don't care about consistency, I suggest that you shove the data into a BDB store (for key-value lookup) or a lucene store, and copy the data to all the nodes. That way all data access will be in- process, no gc problems, and you will get very fast results. BDB and lucene both have easy replication strategies. If the data is RW, and you need consistency, you should probably forget about MapReduce and just run everything on big-iron. Regards, Alan Ho - Original Message From: Devajyoti Sarkar [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Thursday, October 2, 2008 8:41:04 PM Subject: Sharing an object across mappers I think each mapper/reducer runs in its own JVM which makes it impossible to share objects. I need to share a large object so that I can access it at memory speeds across all the mappers. Is it possible to have all the mappers run in the same VM? Or is there a way to do this across VMs at high speed? I guess JMI and others such methods will be just too slow. Thanks, Dev __ Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! Canada Messenger at http://ca.beta.messenger.yahoo.com/
Re: Sharing an object across mappers
Hi Arun, Briefly going through the DistributedCache information, it seems to be a way to distribute files to mappers/reducers. One still needs to read the contents into each map/reduce task VM. Therefore, the data gets replicated across the VMs in a single node. It seems it does not address my basic problem which is to have a large shared object across multiple map/reduce tasks at a given node without having to replicate it across the VMs. Is there a setting in Hadoop where one can tell Hadoop to create the individual map/reduce tasks in the same JVM? Thanks, Dev On Fri, Oct 3, 2008 at 10:32 PM, Arun C Murthy [EMAIL PROTECTED] wrote: On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote: Hi Alan, Thanks for your message. The object can be read-only once it is initialized - I do not need to modify Please take a look at DistributedCache: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache An example: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0 Arun it. Essentially it is an object that allows me to analyze/modify data that I am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is that if I run multiple mappers, this object gets replicated in the different VMs and I run out of memory on my node. I pretty much need to have the full object in memory to do my processing. It is possible (though quite difficult) to have it partially on disk and query it (like a lucene store implementation) but there is a significant performance hit. As an e.g., let us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this scenario, I can really only have 1 mapper per node whereas there are 8 CPUs. But if the overhead of sharing the object (e.g. RMI) or persisting the object (e.g. lucene) is greater than 8 times the memory speed, then it is cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was getting a roughly 600 times decrease in performance versus in-memory access. So ideally, if I could have all the mappers in the same VM, then I can create a singleton and still have multiple mappers access it at memory speeds. Please do let me know if I am looking at this correctly and if the above is possible. Thanks a lot for all your help. Cheers, Dev On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho [EMAIL PROTECTED] wrote: It really depends on what type of data you are sharing, how you are looking up the data, whether the data is Read-write, and whether you care about consistency. If you don't care about consistency, I suggest that you shove the data into a BDB store (for key-value lookup) or a lucene store, and copy the data to all the nodes. That way all data access will be in-process, no gc problems, and you will get very fast results. BDB and lucene both have easy replication strategies. If the data is RW, and you need consistency, you should probably forget about MapReduce and just run everything on big-iron. Regards, Alan Ho - Original Message From: Devajyoti Sarkar [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Thursday, October 2, 2008 8:41:04 PM Subject: Sharing an object across mappers I think each mapper/reducer runs in its own JVM which makes it impossible to share objects. I need to share a large object so that I can access it at memory speeds across all the mappers. Is it possible to have all the mappers run in the same VM? Or is there a way to do this across VMs at high speed? I guess JMI and others such methods will be just too slow. Thanks, Dev __ Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! Canada Messenger at http://ca.beta.messenger.yahoo.com/
Re: architecture diagram
Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With that said, at first glance, this problem may not fit well in to the MapReduce paradigm. The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. There may be a way to do this by collecting, in your map step, key = column number (0, 1, 2, etc) and value = (A, B, C, etc), though you may run in to problems when you try to pivot back. I say this because when you pivot back, you need to have each column, which means you'll need one reduce step. There may be a way to put the pivot-back operation in a second iteration, though I don't think that would help you. Terrence, please confirm that you've defined your example correctly. In the meantime, can someone else confirm that this problem does not fit will in to the MapReduce paradigm? Alex On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am trying to write a map reduce implementation to do the following: 1) read tabular data delimited in some fashion 2) pivot that data, so the rows are columns and the columns are rows 3) shuffle the rows (that were the columns) to randomize the data 4) pivot the data back For example. A|B|C D|E|G pivots too... D|A E|B C|G Then for each row, shuffle the contents around randomly... D|A B|E G|C Then pivot the data back... A|E|C D|B|C You can reference my progress so far... http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ Terrence A. Pietrondi --- On Thu, 10/2/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Thursday, October 2, 2008, 1:36 PM I think it really depends on the job as to where logic goes. Sometimes your reduce step is as simple as an identify function, and sometimes it can be more complex than your map step. It all depends on your data and the operation(s) you're trying to perform. Perhaps we should step out of the abstract. Do you have a specific problem you're trying to solve? Can you describe it? Alex On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am sorry for the confusion. I meant distributed data. So help me out here. For example, if I am reducing to a single file, then my main transformation logic would be in my mapping step since I am reducing away from the data? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 7:44 PM I'm not sure what you mean by disconnected parts of data, but Hadoop is implemented to try and perform map tasks on machines that have input data. This is to lower the amount of network traffic, hence making the entire job run faster. Hadoop does all this for you under the hood. From a user's point of view, all you need to do is store data in HDFS (the distributed filesystem), and run MapReduce jobs on that data. Take a look here: http://wiki.apache.org/hadoop/WordCount Alex On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: So to be distributed in a sense, you would want to do your computation on the disconnected parts of data in the map phase I would guess? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Arun C Murthy [EMAIL PROTECTED] wrote: From: Arun C Murthy [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 2:16 PM On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: I am trying to plan out my map-reduce implementation and I have some questions of where computation should be split in order to take advantage of the distributed nodes. Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif ), are the map boxes the major computation areas or is the reduce the major computation area? Usually the maps perform the 'embarrassingly
Re: architecture diagram
Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare to the final output, you'll see that during the shuffling, B and E are swapped, and G and C are swapped, while A and D were shuffled back into their originating positions in the column. Once again, sorry for the typos and confusion. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 11:01 AM Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With that said, at first glance, this problem may not fit well in to the MapReduce paradigm. The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. There may be a way to do this by collecting, in your map step, key = column number (0, 1, 2, etc) and value = (A, B, C, etc), though you may run in to problems when you try to pivot back. I say this because when you pivot back, you need to have each column, which means you'll need one reduce step. There may be a way to put the pivot-back operation in a second iteration, though I don't think that would help you. Terrence, please confirm that you've defined your example correctly. In the meantime, can someone else confirm that this problem does not fit will in to the MapReduce paradigm? Alex On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am trying to write a map reduce implementation to do the following: 1) read tabular data delimited in some fashion 2) pivot that data, so the rows are columns and the columns are rows 3) shuffle the rows (that were the columns) to randomize the data 4) pivot the data back For example. A|B|C D|E|G pivots too... D|A E|B C|G Then for each row, shuffle the contents around randomly... D|A B|E G|C Then pivot the data back... A|E|C D|B|C You can reference my progress so far... http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ Terrence A. Pietrondi --- On Thu, 10/2/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Thursday, October 2, 2008, 1:36 PM I think it really depends on the job as to where logic goes. Sometimes your reduce step is as simple as an identify function, and sometimes it can be more complex than your map step. It all depends on your data and the operation(s) you're trying to perform. Perhaps we should step out of the abstract. Do you have a specific problem you're trying to solve? Can you describe it? Alex On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am sorry for the confusion. I meant distributed data. So help me out here. For example, if I am reducing to a single file, then my main transformation logic would be in my mapping step since I am reducing away from the data? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 7:44 PM I'm not sure what you mean by disconnected parts of data, but Hadoop is implemented to try and perform map tasks on machines that have input data. This is to lower the amount of network traffic, hence making the entire job run faster. Hadoop does all this for you under the hood. From a user's point of view, all you need to do is store data in HDFS (the distributed filesystem), and run MapReduce jobs on that data. Take a look here: http://wiki.apache.org/hadoop/WordCount Alex On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi
Unable to retrieve filename using mapred.input.file
I'm running map reduce and have the following lines of code: public void configure(JobConf job) { mapTaskId = job.get(mapred.task.id); inputFile = job.get(mapred.input.file); The problem I'm facing is that the inputFile I'm getting is null (the mapTaskId works fine). The input file are all the files in a given directory and they are all gzipped. Something like .../blah/*.gz Any suggestion on how to get the name of the processed filename to the map task? Thanks -Yair
Re: Maps running after reducers complete successfully?
thanks Owen, So this may be an enhancement? - Prasad. On Thursday 02 October 2008 09:58:03 pm Owen O'Malley wrote: It isn't optimal, but it is the expected behavior. In general when we lose a TaskTracker, we want the map outputs regenerated so that any reduces that need to re-run (including speculative execution). We could handle it as a special case if: 1. We didn't lose any running reduces. 2. All of the reduces (including speculative tasks) are done with shuffling. 3. We don't plan on launching any more speculative reduces. If all 3 hold, we don't need to re-run the map tasks. Actually doing so, would be a pretty involved patch to the JobTracker/Schedulers. -- Owen
Re: Sharing an object across mappers
On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote: Briefly going through the DistributedCache information, it seems to be a way to distribute files to mappers/reducers. Sure, but it handles the distribution problem for you. One still needs to read the contents into each map/reduce task VM. If the data is straight binary data, you could just mmap it from the various tasks. It would be pretty efficient. The other direction is to use the MultiThreadedMapRunner and run multiple maps as threads in the same VM. But unless your maps are CPU heavy or contacting external servers, it probably won't help as much as you'd like. -- Owen
Re: Sharing an object across mappers
Hi Owen, Thanks a lot for the pointers. In order to use the MultiThreadedMapRunner, if I change the setMapRunnerClass() method in the jobConf, then does the rest of my code remain the same (apart from making it thread-safe)? Thanks in advance, Dev On Sat, Oct 4, 2008 at 12:29 AM, Owen O'Malley [EMAIL PROTECTED] wrote: On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote: Briefly going through the DistributedCache information, it seems to be a way to distribute files to mappers/reducers. Sure, but it handles the distribution problem for you. One still needs to read the contents into each map/reduce task VM. If the data is straight binary data, you could just mmap it from the various tasks. It would be pretty efficient. The other direction is to use the MultiThreadedMapRunner and run multiple maps as threads in the same VM. But unless your maps are CPU heavy or contacting external servers, it probably won't help as much as you'd like. -- Owen
mapreduce input file question
Hi all, I have a maybe naive question on providing input to a mapreduce program: how can I specify the input with respect to the hdfs path? right now I can specify a input file from my local directory, say, hadoop trunk I can also specify an absolute path for a dfs file using where it is actually stored on my local node, eg/, /usr/username/tmp/x How can I do something like hdfs://inputdata/myinputdata.txt? I always got a cannot find file kind of error Furthermore, maybe the input files can already be some sharded outputs from another mapreduce, e.g., myinputdata-0001.txt, myinputdata-0002.txt? Thanks a lot!
Re: Maps running after reducers complete successfully?
Do we not have an option to store the map results in hdfs? Billy Owen O'Malley [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] It isn't optimal, but it is the expected behavior. In general when we lose a TaskTracker, we want the map outputs regenerated so that any reduces that need to re-run (including speculative execution). We could handle it as a special case if: 1. We didn't lose any running reduces. 2. All of the reduces (including speculative tasks) are done with shuffling. 3. We don't plan on launching any more speculative reduces. If all 3 hold, we don't need to re-run the map tasks. Actually doing so, would be a pretty involved patch to the JobTracker/Schedulers. -- Owen
Re: mapreduce input file question
First, you need to point a MapReduce job at a directory, not an individual file. Second, when you specify a path in your job conf, using the Path object, that path you supply is a HDFS path, not a local path. Yes, you can use the output files of another MapReduce job as input for a second job, but again you want to point your second job's input at the directory that the first job outputted to. Hope this helps. Alex On Fri, Oct 3, 2008 at 11:15 AM, Ski Gh3 [EMAIL PROTECTED] wrote: Hi all, I have a maybe naive question on providing input to a mapreduce program: how can I specify the input with respect to the hdfs path? right now I can specify a input file from my local directory, say, hadoop trunk I can also specify an absolute path for a dfs file using where it is actually stored on my local node, eg/, /usr/username/tmp/x How can I do something like hdfs://inputdata/myinputdata.txt? I always got a cannot find file kind of error Furthermore, maybe the input files can already be some sharded outputs from another mapreduce, e.g., myinputdata-0001.txt, myinputdata-0002.txt? Thanks a lot!
Re: architecture diagram
The approach that you've described does not fit well in to the MapReduce paradigm. You may want to consider randomizing your data in a different way. Unfortunately some things can't be solved well with MapReduce, and I think this is one of them. Can someone else say more? Alex On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: Sorry for the confusion, I did make some typos. My example should have looked like... A|B|C D|E|G pivots too... D|A E|B G|C Then for each row, shuffle the contents around randomly... D|A B|E C|G Then pivot the data back... A|E|G D|B|C The general goal is to shuffle the elements in each column in the input data. Meaning, the ordering of the elements in each column will not be the same as in input. If you look at the initial input and compare to the final output, you'll see that during the shuffling, B and E are swapped, and G and C are swapped, while A and D were shuffled back into their originating positions in the column. Once again, sorry for the typos and confusion. Terrence A. Pietrondi --- On Fri, 10/3/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Friday, October 3, 2008, 11:01 AM Can you confirm that the example you've presented is accurate? I think you may have made some typos, because the letter G isn't in the final result; I also think your first pivot accidentally swapped C and G. I'm having a hard time understanding what you want to do, because it seems like your operations differ from your example. With that said, at first glance, this problem may not fit well in to the MapReduce paradigm. The reason I'm making this claim is because in order to do the pivot operation you must know about every row. Your input files will be split at semi-arbitrary places, essentially making it impossible for each mapper to know every single row. There may be a way to do this by collecting, in your map step, key = column number (0, 1, 2, etc) and value = (A, B, C, etc), though you may run in to problems when you try to pivot back. I say this because when you pivot back, you need to have each column, which means you'll need one reduce step. There may be a way to put the pivot-back operation in a second iteration, though I don't think that would help you. Terrence, please confirm that you've defined your example correctly. In the meantime, can someone else confirm that this problem does not fit will in to the MapReduce paradigm? Alex On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am trying to write a map reduce implementation to do the following: 1) read tabular data delimited in some fashion 2) pivot that data, so the rows are columns and the columns are rows 3) shuffle the rows (that were the columns) to randomize the data 4) pivot the data back For example. A|B|C D|E|G pivots too... D|A E|B C|G Then for each row, shuffle the contents around randomly... D|A B|E G|C Then pivot the data back... A|E|C D|B|C You can reference my progress so far... http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ Terrence A. Pietrondi --- On Thu, 10/2/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Thursday, October 2, 2008, 1:36 PM I think it really depends on the job as to where logic goes. Sometimes your reduce step is as simple as an identify function, and sometimes it can be more complex than your map step. It all depends on your data and the operation(s) you're trying to perform. Perhaps we should step out of the abstract. Do you have a specific problem you're trying to solve? Can you describe it? Alex On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED] wrote: I am sorry for the confusion. I meant distributed data. So help me out here. For example, if I am reducing to a single file, then my main transformation logic would be in my mapping step since I am reducing away from the data? Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: architecture diagram To: core-user@hadoop.apache.org Date: Wednesday, October 1, 2008, 7:44 PM I'm not sure what you mean by disconnected parts of data, but Hadoop is implemented to try and perform map tasks
Re: mapreduce input file question
I wonder if I am missing something. I have a .txt file for input, and I placed it under the input directory of hdfs. Then I called FileInputFormat.setInputPaths(c, new Path(input)); and I got an error: Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/C:/workspace/MyHBase/input The input directory has been interpreted as a local directory from where the program was initiated... Can you please tell me what I am doing wrong? Thanks a lot in advance! On Fri, Oct 3, 2008 at 2:15 PM, Alex Loddengaard [EMAIL PROTECTED]wrote: First, you need to point a MapReduce job at a directory, not an individual file. Second, when you specify a path in your job conf, using the Path object, that path you supply is a HDFS path, not a local path. Yes, you can use the output files of another MapReduce job as input for a second job, but again you want to point your second job's input at the directory that the first job outputted to. Hope this helps. Alex On Fri, Oct 3, 2008 at 11:15 AM, Ski Gh3 [EMAIL PROTECTED] wrote: Hi all, I have a maybe naive question on providing input to a mapreduce program: how can I specify the input with respect to the hdfs path? right now I can specify a input file from my local directory, say, hadoop trunk I can also specify an absolute path for a dfs file using where it is actually stored on my local node, eg/, /usr/username/tmp/x How can I do something like hdfs://inputdata/myinputdata.txt? I always got a cannot find file kind of error Furthermore, maybe the input files can already be some sharded outputs from another mapreduce, e.g., myinputdata-0001.txt, myinputdata-0002.txt? Thanks a lot!
Turning off FileSystem statistics during MapReduce
Hello, We have been doing some profiling of our MapReduce jobs, and we are seeing about 20% of the time of our jobs is spent calling FileSystem $Statistics.incrementBytesRead when we interact with the FileSystem. Is there a way to turn this stats-collection off? Thanks, Nathan Marz Rapleaf
Re: Turning off FileSystem statistics during MapReduce
Nathan, On Oct 3, 2008, at 5:18 PM, Nathan Marz wrote: Hello, We have been doing some profiling of our MapReduce jobs, and we are seeing about 20% of the time of our jobs is spent calling FileSystem $Statistics.incrementBytesRead when we interact with the FileSystem. Is there a way to turn this stats-collection off? This is interesting... could you provide more details? Are you seeing this on Maps or Reduces? Which FileSystem exhibited this i.e. HDFS or LocalFS? Any details on about your application? To answer your original question - no, there isn't a way to disable this. However, if this turns out to be a systemic problem we definitely should consider having an option to allow users to switch it off. So any information you can provide helps - thanks! Arun Thanks, Nathan Marz Rapleaf
A question about Mapper
the input is as follows. flag a b flag c d e flag f then I used a mapper to first store values and then emit them all when met with a line contains flag but when the file reached its end, I have no chance to emit the last record.(in this case ,f) so how can I detect the mapper's end of its life , or how can I emit a last record before a mapper exits. Thanks
[Hadoop NY User Group Meetup] HIVE: Data Warehousing using Hadoop 10/9
Next NY Hadoop meetup will take place on Thursday, 10/9 at 6:30 pm. Jeff Hammerbacher will present HIVE: Data Warehousing using Hadoop. About HIVE: - Data Organization into Tables with logical and hash partitioning - A Metastore to store metadata about Tables/Partitions etc - A SQL like query language over object data stored in Tables - DDL commands to define and load external data into tables About the speaker: Jeff Hammerbacher conceived, built, and led the Data team at Facebook. The Data team was responsible for driving many of the applications of statistics and machine learning at Facebook, as well as building out the infrastructure to support these tasks for massive data sets. The team produced two open source projects: Hive, a system for offline analysis built above Hadoop, and Cassandra, a structured storage system on a P2P network. Before joining Facebook, Jeff wore a suit on Wall Street and id Mathematics at Harvard. Currently Jeff is an Entrepreneur in Residence at Accel Partners. Location ContextWeb, 9th floor 22 Cortlandt Street New York, NY 10007 If you are interested, RSVP here: http://softwaredev.meetup.com/110/calendar/8881385/ -Alex
Re: A question about Mapper
Hello, Does MapReduceBase.close() fit your needs? Take a look at http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/MapReduceBase.html#close() On Fri, October 3, 2008 11:36 pm, Zhou, Yunqing said: the input is as follows. flag a b flag c d e flag f then I used a mapper to first store values and then emit them all when met with a line contains flag but when the file reached its end, I have no chance to emit the last record.(in this case ,f) so how can I detect the mapper's end of its life , or how can I emit a last record before a mapper exits. Thanks Have a good one, -- Joman Chu Carnegie Mellon University School of Computer Science 2011 AIM: ARcanUSNUMquam
Seeking Hadoop Guru
Appreciate any assist on this oppty in New York Cityif you or someone you know might be in interested in a F/T gig...pls contact me ASAP! Software Engineer-Hadoop Guru NYC F/T 2-5yrs experience 130K+ Responsibilities * Develop and support a secure and flexible large-scale data processing infrastructure for research and development within the company. * As a core member of a small and deeply talented team, you will be responsible across many technical aspects of helping to deliver the results of our RD as a world-class platform for partners and customers. Qualifications * Bachelor's Degree in Engineering, Computer Science, or related technical field. * Required: real world experience building data solutions using Hadoop. * Strong design/admin experience with relational database systems, esp. MySQL and/or PostgreSQL. * At least 4 years software engineering experience designing and developing modern web-based consumer-facing server solutions in rapid development cycles. * Expert in Java (C++, Python, a plus) development and debugging on a Linux platform. * A deep and powerful need to create useful, readable and accurate documentation as you work. Regds, Howard Berger Beacon Staffing [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/Seeking-Hadoop-Guru-tp19809079p19809079.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
hadoop under windows.
Hi. I have a strange problem with hadoop when I run jobs under windows (my laptop runs XP, but all cluster machines including namenode run Ubuntu). I run job (which runs perfectly under linux, and all configs and Java versions are the same), all mappers finishes successfully, and so does reducer but when I tries to copy resulting file to the output directory I get things like: 03.10.2008 21:47:24 *INFO * audit: ugi=Dmitry,mkpasswd,root,None,Administrators,Users ip=/171.65.102.211cmd=rename src=/user/public/tmp/streaming-job12345/out48/_temporary/_attempt_2008100320 05_0013_r_00_0/part-0 dst=/user/public/tmp/streaming-job12345/out48/_temporary/_attempt_2008100320 05_0013_r_00_0/part-0perm=Dmitry:supergroup:rw-r--r-- (FSNamesystem.java, line 94) And then it deletes the file. And I get no output. Why does it renames the files into itself and does it have anything to do with Path.getParent()? Thanks.