Re: hadoop jobs take long time to setup
Although I've never done it, I believe you could manually copy your jar files out to your cluster somewhere in hadoop's classpath, and that would remove the need for you to copy them to your cluster at the start of each job. On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou wrote: > Hi. > > Running without a jobtracker makes the job start almost instantly. > I think it is due to something with the classloader. I use a huge amount of > jarfiles jobConf.set("tmpjars", "jar1.jar,jar2.jar")... which need to be > loaded every time I guess. > > By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child > live forever then ? > > Cheers > > //Marcus > > On Sun, Jun 28, 2009 at 9:54 PM, tim robertson >wrote: > > > How long does it take to start the code locally in a single thread? > > > > Can you reuse the JVM so it only starts once per node per job? > > conf.setNumTasksToExecutePerJvm(-1) > > > > Cheers, > > Tim > > > > > > > > On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou > > > wrote: > > > Hi. > > > > > > Wonder how one should improve the startup times of a hadoop job. Some > of > > my > > > jobs which have a lot of dependencies in terms of many jar files take a > > long > > > time to start in hadoop up to 2 minutes some times. > > > The data input amounts in these cases are neglible so it seems that > > Hadoop > > > have a really high setup cost, which I can live with but this seems to > > much. > > > > > > Let's say a job takes 10 minutes to complete then it is bad if it takes > 2 > > > mins to set it up... 20-30 sec max would be a lot more reasonable. > > > > > > Hints ? > > > > > > //Marcus > > > > > > > > > -- > > > Marcus Herou CTO and co-founder Tailsweep AB > > > +46702561312 > > > marcus.he...@tailsweep.com > > > http://www.tailsweep.com/ > > > > > > > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > marcus.he...@tailsweep.com > http://www.tailsweep.com/ >
Re: Confused about partitioning and reducers
Please disregard this question. I think I'm mistaken. On Sat, Jun 27, 2009 at 10:25 AM, Stuart White wrote: > If I call HashPartitioner.getPartition(), passing a key of 4 and a > numPartitions of 5, it returns a partition of 4. (Which is what I would > expect.) > > However, if I have a mapred job, and in my mapper I emit a record with key > 4, I'm configured to use the HashPartitioner, I have 5 Reducers configured, > and I'm using the IdentityReducer, the record with key 4 gets handled by > Reducer #0 (because it gets written out to part-0). > > I would have expected a record with key 4 to be handled by reducer #4 (and > therefore written to part-4) because the HashPartitioner returns 4 for a > key of 4 and a numPartitions of 5. > > Obviously I'm missing something here. What is the logic for deciding which > partition of records is handled by which reducer instance? > > It can't be random, otherwise mapside join wouldn't work. > > Thanks. >
Confused about partitioning and reducers
If I call HashPartitioner.getPartition(), passing a key of 4 and a numPartitions of 5, it returns a partition of 4. (Which is what I would expect.) However, if I have a mapred job, and in my mapper I emit a record with key 4, I'm configured to use the HashPartitioner, I have 5 Reducers configured, and I'm using the IdentityReducer, the record with key 4 gets handled by Reducer #0 (because it gets written out to part-0). I would have expected a record with key 4 to be handled by reducer #4 (and therefore written to part-4) because the HashPartitioner returns 4 for a key of 4 and a numPartitions of 5. Obviously I'm missing something here. What is the logic for deciding which partition of records is handled by which reducer instance? It can't be random, otherwise mapside join wouldn't work. Thanks.
Does balancer ensure a file's replication is satisfied?
In my Hadoop cluster, I've had several drives fail lately (and they've been replaced). Each time a new empty drive is placed in the cluster, I run the balancer. I understand that the balancer will redistribute the load of file blocks across the nodes. My question is: will balancer also look at the desired replication of a file, and if the actual replication of a file is less than the desired (because the file had blocks stored on the lost drive), will balancer re-replicate those lost blocks? If not, is there another tool that will ensure the desired replication factor of files is satisfied? If this functionality doesn't exist, I'm concerned that I'm slowly, silently losing my files as I replace drives, and I may not even realize it. Thoughts?
Re: which Java version for hadoop-0.19.1 ?
http://hadoop.apache.org/core/docs/r0.19.1/quickstart.html#Required+Software On Wed, Jun 10, 2009 at 12:02 PM, Roldano Cattoni wrote: > It works, many thanks. > > Last question: is this information documented somewhere in the package? I > was not able to find it. > > > Roldano > > > > On Wed, Jun 10, 2009 at 06:37:08PM +0200, Stuart White wrote: > > Java 1.6. > > > > On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni > wrote: > > > > > A very basic question: which Java version is required for > hadoop-0.19.1? > > > > > > With jre1.5.0_06 I get the error: > > > java.lang.UnsupportedClassVersionError: Bad version number in .class > file > > > at java.lang.ClassLoader.defineClass1(Native Method) > > > (..) > > > > > > By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06 > > > > > > > > > Thanks in advance for your kind help > > > > > > Roldano > > > >
Re: which Java version for hadoop-0.19.1 ?
Java 1.6. On Wed, Jun 10, 2009 at 11:33 AM, Roldano Cattoni wrote: > A very basic question: which Java version is required for hadoop-0.19.1? > > With jre1.5.0_06 I get the error: > java.lang.UnsupportedClassVersionError: Bad version number in .class file > at java.lang.ClassLoader.defineClass1(Native Method) > (..) > > By the way hadoop-0.17.2.1 was running successfully with jre1.5.0_06 > > > Thanks in advance for your kind help > > Roldano >
Renaming all nodes in Hadoop cluster
Is it possible to rename all nodes in a Hadoop cluster and not lose the data stored on hdfs? Of course I'll need to update the "master" and "slaves" files, but I'm not familiar with how hdfs tracks where it has written all the splits of the files. Is it possible to retain the data written to hdfs when renaming all nodes in the cluster, and if so, what additional configuration changes, if any, are required?
Re: InputFormat for fixed-width records?
On Thu, May 28, 2009 at 9:50 AM, Owen O'Malley wrote: > > The update to the terasort example has an InputFormat that does exactly > that. The key is 10 bytes and the value is the next 90 bytes. It is pretty > easy to write, but I should upload it soon. The output types are Text, but > they just have the binary data in them. > Would you mind uploading it or sending it to the list?
InputFormat for fixed-width records?
I need to process a dataset that contains text records of fixed length in bytes. For example, each record may be 100 bytes in length, with the first field being the first 10 bytes, the second field being the second 10 bytes, etc... There are no newlines on the file. Field values have been either whitespace-padded or truncated to fit within the specific locations in these fixed-width records. Does Hadoop have an InputFormat to support processing of such files? I looked but couldn't find one. Of course, I could pre-process the file (outside of Hadoop) to put newlines at the end of each record, but I'd prefer not to require such a prep step. Thanks.
Efficient algorithm for many-to-many reduce-side join?
I need to do a reduce-side join of two datasets. It's a many-to-many join; that is, each dataset can can multiple records with any given key. Every description of a reduce-side join I've seen involves constructing your keys out of your mapper such that records from one dataset will be presented to the reducers before records from the second dataset. I should "hold on" to the value from the one dataset and remember it as I iterate across the values from the second dataset. This seems like it only works well for one-to-many joins (when one of your datasets will only have a single record with any given key). This scales well because you're only remembering one value. In a many-to-many join, if you apply this same algorithm, you'll need to remember all values from one dataset, which of course will be problematic (and won't scale) when dealing with large datasets with large numbers of records with the same keys. Does an efficient algorithm exist for a many-to-many reduce-side join?
Re: Map-side join: Sort order preserved?
On Thu, May 14, 2009 at 10:25 AM, jason hadoop wrote: > If you put up a discussion question on www.prohadoopbook.com, I will fill in > the example on how to do this. Done. Thanks! http://www.prohadoopbook.com/forum/topics/preserving-partition-file
Map-side join: Sort order preserved?
I'm implementing a map-side join as described in chapter 8 of "Pro Hadoop". I have two files that have been partitioned using the TotalOrderPartitioner on the same key into the same number of partitions. I've set mapred.min.split.size to Long.MAX_VALUE so that one Mapper will handle an entire partition. I want the output to be written in the same partitioned, total sort order. If possible, I want to accomplish this by setting my NumReducers to 0 and having the output of my Mappers written directly to HDFS, thereby skipping the partition/sort step. My question is this: Am I guaranteed that the Mapper that processes part-0 will have its output written to the output file named part-0, the Mapper that processes part-1 will have its output written to part-1, etc... ? If so, then I can preserve the partitioning/sort order of my input files without re-partitioning and re-sorting. Thanks.
Access counters from within Reducer#configure() ?
I'd like to be able to access my job's counters from within my Reducer's configure() method (so I can know how many records were output from my mappers). Is this possible? Thanks!
Re: Hadoop Summit 2009 - Open for registration
Any chance these presentations (as well as the Cloudera ones on the following day) will be recorded and uploaded to YouTube? On Tue, May 5, 2009 at 4:10 PM, Ajay Anand wrote: > This year's Hadoop Summit > (http://developer.yahoo.com/events/hadoopsummit09/) is confirmed for > June 10th at the Santa Clara Marriott, and is now open for registration. > > > > We have a packed agenda, with three tracks - for developers, > administrators, and one focused on new and innovative applications using > Hadoop. The presentations include talks from Amazon, IBM, Sun, Cloudera, > Facebook, HP, Microsoft, and the Yahoo! team, as well as leading > universities including UC Berkeley, CMU, Cornell, U of Maryland, U of > Nebraska and SUNY. > > > > From our experience last year with the rush for seats, I would encourage > people to register early at http://hadoopsummit09.eventbrite.com/ > > > > Looking forward to seeing you at the summit! > > > > Ajay > >
Re: Multiple outputs and getmerge?
On Tue, Apr 21, 2009 at 1:00 PM, Koji Noguchi wrote: > > I once used MultipleOutputFormat and created > (mapred.work.output.dir)/type1/part-_ > (mapred.work.output.dir)/type2/part-_ > ... > > And JobTracker took care of the renaming to > (mapred.output.dir)/type{1,2}/part-__ > > Would that work for you? Can you please explain this in more detail? It looks like you're using MultipleOutputFormat for *both* of your outputs? So, you simply don't use the OutputCollector passed as a parm to Mapper#map()?
Re: Multiple outputs and getmerge?
On Tue, Apr 21, 2009 at 12:06 PM, Todd Lipcon wrote: > Would dfs -cat do what you need? e.g: > > ./bin/hdfs dfs -cat /path/to/output/ExceptionDocuments-m-\* > > /tmp/exceptions-merged Yes, that would work. Thanks for the suggestion.
Re: Put computation in Map or in Reduce
Unless you need the hashing/sorting provided by the reduce phase, I'd recommend placing your logic in your mapper and, when setting up your job, calling JobConf#setNumReduceTasks(0), so that the reduce phase won't be executed. In that case, any records emitted by your mapper will be written to the output. http://hadoop.apache.org/core/docs/r0.19.1/api/org/apache/hadoop/mapred/JobConf.html#setNumReduceTasks(int) On Mon, Apr 20, 2009 at 10:25 PM, Mark Kerzner wrote: > Hi, > > in an MR step, I need to extract text from various files (using Tika). I > have put text extraction into reduce(), because I am writing the extracted > text to the output on HDFS. But now it occurs to me that I might as well > have put it into map() and have default reduce() which will write every > map() result out, is that true? > > Thank you, > Mark >
Multiple outputs and getmerge?
I've written a MR job with multiple outputs. The "normal" output goes to files named part-X and my secondary output records go to files I've chosen to name "ExceptionDocuments" (and therefore are named "ExceptionDocuments-m-X"). I'd like to pull merged copies of these files to my local filesystem (two separate merged files, one containing the "normal" output and one containing the ExceptionDocuments output). But, since hadoop lands both of these outputs to files residing in the same directory, when I issue "hadoop dfs -getmerge", what I get is a file that contains both outputs. To get around this, I have to move files around on HDFS so that my different outputs are in different directories. Is this the best/only way to deal with this? It would be better if hadoop offered the option of writing different outputs to different output directories, or if getmerge offered the ability to specify a file prefix for files desired to be merged. Thanks!
Re: Can we somehow read from the HDFS without converting it to local?
Not sure if this is what you're looking for... http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample On Thu, Apr 9, 2009 at 10:56 PM, Sid123 wrote: > > I need to reuse the O/P of my DFS file without copying to local. Is there a > way? > -- > View this message in context: > http://www.nabble.com/Can-we-somehow-read-from-the-HDFS-without-converting-it-to-local--tp22982760p22982760.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
Re: Coordination between Mapper tasks
> You might want to look at a memcached solution some students and I worked > out for exactly this problem. Thanks, Jimmy! This paper does exactly describe my problem. I started working to implement the memcached solution you describe, and I've run into a small problem. I've described it on the spymemcached forum: http://groups.google.com/group/spymemcached/browse_thread/thread/7b4d82bca469ed20 Essentially, it seems the keys are being hashed inconsistently by spymemcached across runs. This, of course, will result in inconsistent/invalid results. Did you guys run into this? Since I'm new to memcached, I'm hoping that this is simply something I don't understand or am overlooking.
Re: Coordination between Mapper tasks
Thanks to everyone for your feedback. I'm unfamiliar with many of the technologies you've mentioned, so it may take me some time to digest all your responses. The first thing I'm going to look at is Ted's suggestion of a pure map-reduce solution by pre-joining my data with my lookup values. On Fri, Mar 20, 2009 at 9:55 AM, Owen O'Malley wrote: > On Thu, Mar 19, 2009 at 6:42 PM, Stuart White wrote: > >> >> My process requires a large dictionary of terms (~ 2GB when loaded >> into RAM). The terms are looked-up very frequently, so I want the >> terms memory-resident. >> >> So, the problem is, I want 3 processes (to utilize CPU), but each >> process requires ~2GB, but my nodes don't have enough memory to each >> have their own copy of the 2GB of data. So, I need to somehow share >> the 2GB between the processes. > > > I would recommend using the multi-threaded map runner. Have 1 map/node and > just use 3 worker threads that all consume the input. The only disadvantage > is that it works best for cpu-heavy loads (or maps that are doing crawling, > etc.), since you only have one record reader for all three of the map > threads. > > In the longer term, it might make sense to enable parallel jvm reuse in > addition to serial jvm reuse. > > -- Owen >
Re: Coordination between Mapper tasks
The nodes in my cluster have 4 cores & 4 GB RAM. So, I've set mapred.tasktracker.map.tasks.maximum to 3 (leaving 1 core for "breathing room"). My process requires a large dictionary of terms (~ 2GB when loaded into RAM). The terms are looked-up very frequently, so I want the terms memory-resident. So, the problem is, I want 3 processes (to utilize CPU), but each process requires ~2GB, but my nodes don't have enough memory to each have their own copy of the 2GB of data. So, I need to somehow share the 2GB between the processes. What I have currently implemented is a standalone RMI service that, during startup, loads the 2GB dictionaries. My mappers are simply RMI clients that call this RMI service. This works just fine. The only problem is that my standalone RMI service is totally "outside" Hadoop. I have to ssh onto each of the nodes, start/stop/reconfigure the services manually, etc... So, I was thinking that, at job startup, the processes on each node would (using ZooKeeper) elect a leader responsible for hosting the 2GB dictionaries. This process would load the dictionaries and share them via RMI. The other processes would recognize that another process on the box is the leader, and they would act as RMI clients to that process. To make this work, I'm calling conf.setNumTasksToExecutePerJvm(-1) so that Hadoop does not create new JVMs for each task. Also note that the processes are "grouped" by node; that is, the ZooKeeper path that I'll use for coordination will include the hostname, so that only processes on the same node will compete for leadership. Anyway, in short, I was looking for a way to elect a leader process per node responsible for hosting/sharing a large amount of memory-resident data via RMI. Hopefully that made sense...
Coordination between Mapper tasks
I'd like to implement some coordination between Mapper tasks running on the same node. I was thinking of using ZooKeeper to provide this coordination. I think I remember hearing that MapReduce and/or HDFS use ZooKeeper under-the-covers. So, I'm wondering... in my Mappers, if I want distributed coordination, can I "piggy-back" onto the ZooKeeper instance being used by the underlying MapRed/HDFS? The benefit being that I don't need to create/configure/run my own ZooKeeper instance.
Re: Release batched-up output records at end-of-job?
Yeah, I thought of that, but I was concerned that, even if it did work, if it wasn't guaranteed behavior, that it might stop working in a future release. I'll go ahead and give that a try. Can anybody provide details on this new API? Thanks for the response! On Tue, Mar 17, 2009 at 7:29 AM, Jingkei Ly wrote: > You should be able to keep a reference to the OutputCollector provided > to the #map() method, and then use it in the #close() method. > > I believe that there's a new API that will actually provide the output > collector to the close() method via a context object, but in the mean > time I think the above should work. > > -----Original Message- > From: Stuart White [mailto:stuart.whi...@gmail.com] > Sent: 17 March 2009 12:13 > To: core-user@hadoop.apache.org > Subject: Release batched-up output records at end-of-job? > > I have a mapred job that simply performs data transformations in its > Mapper. I don't need sorting or reduction, so I don't use a Reducer. > > Without getting too detailed, the nature of my processing is such that > it is much more efficient if I can process blocks of records > at-a-time. So, what I'd like to do is, in my Mapper, in the map() > function, simply add the incoming record to a list, and once that list > reaches a certain size, process the batched-up records, and then call > output.collect() multiple times to release the output records, each > corresponding to one of the input records. > > At the end of the job, my Mappers will have partially full blocks of > records. I'd like to go ahead and process these blocks at end-of-job, > regardless of their sizes, and release the corresponding output > records. > > How can I accomplish this? In my Mapper#map(), I have no way of > knowing whether a record is the final record. The only end-of-job > hook that I'm aware of is for my Mapper to override > MapReduceBase#close(), but when in that method, there is no > OutputCollector available. > > Is it possible to batch-up records, and at end-of-job, process and > release any final partial blocks? > > Thanks! > > > > This message should be regarded as confidential. If you have received this > email in error please notify the sender and destroy it immediately. > Statements of intent shall only become binding when confirmed in hard copy by > an authorised signatory. The contents of this email may relate to dealings > with other companies within the Detica Group plc group of companies. > > Detica Limited is registered in England under No: 1337451. > > Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England. > > >
Release batched-up output records at end-of-job?
I have a mapred job that simply performs data transformations in its Mapper. I don't need sorting or reduction, so I don't use a Reducer. Without getting too detailed, the nature of my processing is such that it is much more efficient if I can process blocks of records at-a-time. So, what I'd like to do is, in my Mapper, in the map() function, simply add the incoming record to a list, and once that list reaches a certain size, process the batched-up records, and then call output.collect() multiple times to release the output records, each corresponding to one of the input records. At the end of the job, my Mappers will have partially full blocks of records. I'd like to go ahead and process these blocks at end-of-job, regardless of their sizes, and release the corresponding output records. How can I accomplish this? In my Mapper#map(), I have no way of knowing whether a record is the final record. The only end-of-job hook that I'm aware of is for my Mapper to override MapReduceBase#close(), but when in that method, there is no OutputCollector available. Is it possible to batch-up records, and at end-of-job, process and release any final partial blocks? Thanks!
Controlling maximum # of tasks per node on per-job basis?
My cluster nodes have 2 dual-core processors, so, in general, I want to configure my nodes with a maximum of 3 task processes executed per node at a time. But, for some jobs, my tasks load large amounts of memory, and I cannot fit 3 such tasks on a single node. For these jobs, I'd like to enforce running a maximum of 1 task process per node at a time. I've tried to enforce this by setting mapred.tasktracker.map.tasks.maximum at runtime, but I see it has no effect, because this is a configuration for the TaskTracker, which is of course already running before my job starts. Is there no way to configure a maximum # of map tasks per node on a per-job basis? Thanks!
Not a host:port pair when running balancer
I've been running hadoop-0.19.0 for several weeks successfully. Today, for the first time, I tried to run the balancer, and I'm receiving: java.lang.RuntimeException: Not a host:port pair: hvcwydev0601 In my hadoop-site.xml, I have this: fs.default.name hdfs://hvcwydev0601/ What do I need to change to get the balancer to work? It seems I need to add a port to fs.default.name. If so, what port? Can I just pick any port? If I specify a port, do I need to specify any other parms accordingly? I searched the forum, and found a few posts on this topic, but it seems that the configuration parms have changed over time, so I'm not sure what the current correct configuration is. Also, if fs.default.name is supposed to have a port, I'll point out that the docs don't say so: http://hadoop.apache.org/core/docs/r0.19.1/cluster_setup.html The example given for fs.default.name is "hdfs://hostname/". Thanks!
OT: How to search mailing list archives?
This is slightly off-topic, and I realize this question is not specific to Hadoop, but what is the best way to search the mailing list archives? Here's where I'm looking: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I don't see any way to search the archives. Am I missing something? Is there another archive site I should be looking at? Thanks!
Re: Best way to write multiple files from a MR job?
On Tue, Mar 3, 2009 at 9:16 PM, Nick Cen wrote: > have you try the MultipleOutputFormat and it is subclass? Nope (didn't know it existed). I'll take a look at it. Both of these suggestions sound great. Thanks for the tips!
Best way to write multiple files from a MR job?
I have a large amount of data, from which I'd like to extract multiple different types of data, writing each type of data to different sets of output files. What's the best way to accomplish this? (I should mention, I'm only using a mapper. I have no need for sorting or reduction.) Of course, if I only wanted 1 output file, I can just .collect() the output from my mapper and let mapreduce write the output for me. But, to get multiple output files, the only way I can see is to manually write the files myself from within my mapper. If that's the correct way, then how can I get a unique filename for each mapper instance? Obviously hadoop has solved this problem, because it writes out its partition files (part-0, etc...) with unique numbers. Is there a way for my mappers to get this unique number being used so they can use it to ensure a unique filename? Thanks!
MapReduce jobs with expensive initialization
I have a mapreduce job that requires expensive initialization (loading of some large dictionaries before processing). I want to avoid executing this initialization more than necessary. I understand that I need to call setNumTasksToExecutePerJvm to -1 to force mapreduce to reuse JVMs when executing tasks. How I've been performing my initialization is, in my mapper, I override MapReduceBase#configure, read my parms from the JobConf, and load my dictionaries. It appears, from the tests I've run, that even though NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class are being created for each task, and therefore I'm still re-running this expensive initialization for each task. So, my question is: how can I avoid re-executing this expensive initialization per-task? Should I move my initialization code out of my mapper class and into my "main" class? If so, how do I pass references to the loaded dictionaries from my main class to my mapper? Thanks!
Re: General questions about Map-Reduce
> On Sun, Jan 11, 2009 at 9:05 PM, tienduc_dinh wrote: > > Is there any article which describes it ? > I'd also recommend Google's MapReduce whitepaper: http://labs.google.com/papers/mapreduce.html
Re: -libjars with multiple jars broken when client and cluster reside on different OSs?
I agree. Using a List seems to make more sense. FYI... I opened a jira for this: https://issues.apache.org/jira/browse/HADOOP-4864 On Tue, Dec 30, 2008 at 3:53 PM, Jason Venner wrote: > The path separator is a major issue with a number of items in the > configuration data set that are multiple items packed together via the path > separator. > the class path > the distributed cache > the input path set > > all suffer from the path.separator issue for 2 reasons: > 1 being the difference across jvms as indicated in the previous email item > (I had missed this!) > 2 separator characters that happen to be embedded in the individual > elements are not escaped before the item is added to the existing set. > > For all of the pain we have with these packed items, it may be simpler to > serialize a List for multi element items rather than packing them > with the path.separator system property item. > > > > Aaron Kimball wrote: > >> Hi Stuart, >> >> Good sleuthing out that problem :) The correct way to submit patches is to >> file a ticket on JIRA (https://issues.apache.org/jira/browse/HADOOP). >> Create >> an account, create a new issue describing the bug, and then attach the >> patch >> file. There'll be a discussion there and others can review your patch and >> include it in the codebase. >> >> Cheers, >> - Aaron >> >> On Fri, Dec 12, 2008 at 12:14 PM, Stuart White > >wrote: >> >> >> >>> Ok, I'll answer my own question. >>> >>> This is caused by the fact that hadoop uses >>> system.getProperty("path.separator") as the delimiter in the list of >>> jar files passed via -libjars. >>> >>> If your job spans platforms, system.getProperty("path.separator") >>> returns a different delimiter on the different platforms. >>> >>> My solution is to use a comma as the delimiter, rather than the >>> path.separator. >>> >>> I realize comma is, perhaps, a poor choice for a delimiter because it >>> is valid in filenames on both Windows and Linux, but the -libjars uses >>> it as the delimiter when listing the additional required jars. So, I >>> figured if it's already being used as a delimiter, then it's >>> reasonable to use it internally as well. >>> >>> I've attached a patch (against 0.19.0) that applies this change. >>> >>> Now, with this change, I can submit hadoop jobs (requiring multiple >>> supporting jars) from my Windows laptop (via cygwin) to my 10-node >>> Linux hadoop cluster. >>> >>> Any chance this change could be applied to the hadoop codebase? >>> >>> >>> >> >> >> >
Simple data transformations in Hadoop?
(I'm quite new to hadoop and map/reduce, so some of these questions might not make complete sense.) I want to perform simple data transforms on large datasets, and it seems Hadoop is an appropriate tool. As a simple example, let's say I want to read every line of a text file, uppercase it, and write it out. First question: would Hadoop be an appropriate tool for something like this? What is the best way to model this type of work in Hadoop? I'm thinking my mappers will accept a Long key that represents the byte offset into the input file, and a Text value that represents the line in the file. I *could* simply uppercase the text lines and write them to an output file directly in the mapper (and not use any reducers). So, there's a question: is it considered bad practice to write output files directly from mappers? Assuming it's advisable in this example to write a file directly in the mapper - how should the mapper create a unique output partition file name? Is there a way for a mapper to know its index in the total # of mappers? Assuming it's inadvisable to write a file directly in the mapper - I can output the records to the reducers using the same key and using the uppercased data as the value. Then, in my reducer, should I write a file? Or should I collect() the records in the reducers and let hadoop write the output? If I let hadoop write the output, is there a way to prevent hadoop from writing the key to the output file? I may want to perform several transformations, one-after-another, on a set of data, and I don't want to place a superfluous key at the front of every record for each pass of the data. I appreciate any feedback anyone has to offer.
Re: -libjars with multiple jars broken when client and cluster reside on different OSs?
Ok, I'll answer my own question. This is caused by the fact that hadoop uses system.getProperty("path.separator") as the delimiter in the list of jar files passed via -libjars. If your job spans platforms, system.getProperty("path.separator") returns a different delimiter on the different platforms. My solution is to use a comma as the delimiter, rather than the path.separator. I realize comma is, perhaps, a poor choice for a delimiter because it is valid in filenames on both Windows and Linux, but the -libjars uses it as the delimiter when listing the additional required jars. So, I figured if it's already being used as a delimiter, then it's reasonable to use it internally as well. I've attached a patch (against 0.19.0) that applies this change. Now, with this change, I can submit hadoop jobs (requiring multiple supporting jars) from my Windows laptop (via cygwin) to my 10-node Linux hadoop cluster. Any chance this change could be applied to the hadoop codebase? diff -ur src/core/org/apache/hadoop/filecache/DistributedCache.java src_working/core/org/apache/hadoop/filecache/DistributedCache.java --- src/core/org/apache/hadoop/filecache/DistributedCache.java 2008-11-13 21:09:36.0 -0600 +++ src_working/core/org/apache/hadoop/filecache/DistributedCache.java 2008-12-12 14:07:48.865460800 -0600 @@ -710,7 +710,7 @@ throws IOException { String classpath = conf.get("mapred.job.classpath.archives"); conf.set("mapred.job.classpath.archives", classpath == null ? archive - .toString() : classpath + System.getProperty("path.separator") + .toString() : classpath + "," + archive.toString()); FileSystem fs = FileSystem.get(conf); URI uri = fs.makeQualified(archive).toUri(); @@ -727,8 +727,7 @@ String classpath = conf.get("mapred.job.classpath.archives"); if (classpath == null) return null; -ArrayList list = Collections.list(new StringTokenizer(classpath, System - .getProperty("path.separator"))); +ArrayList list = Collections.list(new StringTokenizer(classpath, ",")); Path[] paths = new Path[list.size()]; for (int i = 0; i < list.size(); i++) { paths[i] = new Path((String) list.get(i));
-libjars with multiple jars broken when client and cluster reside on different OSs?
I've written a simple map/reduce job that demonstrates a problem I'm having. Please see attached example. Environment: hadoop 0.19.0 cluster resides across linux nodes client resides on cygwin To recreate the problem I'm seeing, do the following: - Setup a hadoop cluster on linux - Perform the remaining steps on cygwin, with a hadoop installation configured to point to the linux cluster. (set fs.default.name and mapred.job.tracker) - Extract the tarball. Change into created directory. tar xvfz Example.tar.gz cd Example - Edit build.properties, set your hadoop.home appropriately, then build the example. ant - Load the file Example.in into your dfs hadoop dfs -copyFromLocal Example.in Example.in - Execute the provided shell script, passing it testID 1. ./Example.sh 1 This test does not use -libjars, and it completes successfully. - Next, execute testID 2. ./Example.sh 2 This test uses -libjars with 1 jarfile (Foo.jar), and it completes successfully. - Next, execute testID 3. ./Example.sh 3 This test uses -libjars with 1 jarfile (Bar.jar), and it completes successfully. - Next, execute testID 4. ./Example.sh 4 This test uses -libjars with 2 jarfiles (Foo.jar and Bar.jar), and it fails with a ClassNotFoundException. This behavior only occurs when calling from cygwin to linux or vice versa. If both the cluster and the client reside on either linux or cygwin, the problem does not occur. I'm continuing to dig to see what I can figure out, but since I'm very new to hadoop (started using it this week), I thought I'd go ahead and throw this out there to see if anyone can help. Thanks! Example.tar.gz Description: GNU Zip compressed data