Creating MapFile.Reader instance in reducer setup
Hello, I'm tring to use MapFile (stored on HDFS) in my reduce task, which processes some text data. When I try to initialize MapFile.Reader in reducer configure() method, app throws NullPointerException, when the same approach is used for each reduce() method call with the same parameters, everything goes all right. But creating instance of Reader for each reduce() call creates big slow down. Do you have any idea what am I doing wrong. Thanks Ondrej Klimpera
Re: Creating MapFile.Reader instance in reducer setup
Hello, sorry my mistake. Problem solved. On 06/19/2012 03:40 PM, Devaraj k wrote: Can you share the exception stack trace and piece of code where you are trying to create? Thanks Devaraj From: Ondřej Klimpera [klimp...@fit.cvut.cz] Sent: Tuesday, June 19, 2012 6:03 PM To: common-user@hadoop.apache.org Subject: Creating MapFile.Reader instance in reducer setup Hello, I'm tring to use MapFile (stored on HDFS) in my reduce task, which processes some text data. When I try to initialize MapFile.Reader in reducer configure() method, app throws NullPointerException, when the same approach is used for each reduce() method call with the same parameters, everything goes all right. But creating instance of Reader for each reduce() call creates big slow down. Do you have any idea what am I doing wrong. Thanks Ondrej Klimpera
Re: Setting number of mappers according to number of TextInput lines
Hi, I made some progress, combination of NLineInputFormat and mapre.max.split.size seems to work, but it is hard to exactly set the byte value. Input lines have from 64 to 1024 bytes approx. What I need is having as much mappers as possible (use full potential of the cluster), where each receives N input lines. On 06/17/2012 05:02 AM, Harsh J wrote: Ondřej, While NLineInputFormat will indeed give you N lines per task, it does not guarantee that the N map tasks that come out for a file from it will all be sent to different nodes. Which one is your need exactly - Simply having N lines per map task, or N wider distributed maps? On Sat, Jun 16, 2012 at 3:01 PM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: I tried this approach, but the job is not distributed among 10 mapper nodes. Seems Hadoop ignores this property :( My first thought is, that the small file size is the problem and Hadoop doesn't care about it's splitting in proper way. Thanks any ideas. On 06/16/2012 11:27 AM, Bejoy KS wrote: Hi Ondrej You can use NLineInputFormat with n set to 10. --Original Message-- From: Ondřej Klimpera To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Setting number of mappers according to number of TextInput lines Sent: Jun 16, 2012 14:31 Hello, I have very small input size (kB), but processing to produce some output takes several minutes. Is there a way how to say, file has 100 lines, i need 10 mappers, where each mapper node has to process 10 lines of input file? Thanks for advice. Ondrej Klimpera Regards Bejoy KS Sent from handheld, please excuse typos.
Setting number of mappers according to number of TextInput lines
Hello, I have very small input size (kB), but processing to produce some output takes several minutes. Is there a way how to say, file has 100 lines, i need 10 mappers, where each mapper node has to process 10 lines of input file? Thanks for advice. Ondrej Klimpera
Re: Setting number of mappers according to number of TextInput lines
I tried this approach, but the job is not distributed among 10 mapper nodes. Seems Hadoop ignores this property :( My first thought is, that the small file size is the problem and Hadoop doesn't care about it's splitting in proper way. Thanks any ideas. On 06/16/2012 11:27 AM, Bejoy KS wrote: Hi Ondrej You can use NLineInputFormat with n set to 10. --Original Message-- From: Ondřej Klimpera To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Setting number of mappers according to number of TextInput lines Sent: Jun 16, 2012 14:31 Hello, I have very small input size (kB), but processing to produce some output takes several minutes. Is there a way how to say, file has 100 lines, i need 10 mappers, where each mapper node has to process 10 lines of input file? Thanks for advice. Ondrej Klimpera Regards Bejoy KS Sent from handheld, please excuse typos.
Dealing with low space cluster
Hello, we're testing application on 8 nodes, where each node has 20GB of local storage available. What we are trying to achieve is to get more than 20GB to be processed on this cluster. Is there a way how to distribute the data on the cluster? There is also one shared NFS storage disk with 1TB of available space, which is now unused. Thanks for your reply. Ondrej Klimpera
Re: HADOOP_HOME depracated
Thanks, for your reply. It would be great to mention this in your tutorial on your web sites. Is the name of the HADOOP_PREFIX/HOME/INSTALL crucial to Hadoop, or it's just user benefit to set this variable. Thanks for reply. On 06/14/2012 07:46 AM, Harsh J wrote: Hi Ondřej, Due to a new packaging format, the Apache Hadoop 1.x has deprecated the HADOOP_HOME env-var in favor of a new env-var called 'HADOOP_PREFIX'. You can set HADOOP_PREFIX, or set HADOOP_HOME_WARN_SUPPRESS in your environment to a non-empty value to suppress the warning. On Thu, Jun 14, 2012 at 11:11 AM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: Hello, why when running Hadoop, there is always HADOOP_HOME shell variable being told to be deprecated. How to set installation directory on cluster nodes, which variable is correct. Thanks Ondrej Klimpera
Re: Dealing with low space cluster
Hello, you're right. That's exactly what I ment. And your answer is exactly what I thought. I was just wondering if Hadoop can distribute the data to other node's local storages if own local space is full. Thanks On 06/14/2012 03:38 PM, Harsh J wrote: Ondřej, If by processing you mean trying to write out (map outputs) 20 GB of data per map task, that may not be possible, as the outputs need to be materialized and the disk space is the constraint there. Or did I not understand you correctly (in thinking you are asking about MapReduce)? Cause you otherwise have ~50 GB space available for HDFS consumption (assuming replication = 3 for proper reliability). On Thu, Jun 14, 2012 at 1:25 PM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: Hello, we're testing application on 8 nodes, where each node has 20GB of local storage available. What we are trying to achieve is to get more than 20GB to be processed on this cluster. Is there a way how to distribute the data on the cluster? There is also one shared NFS storage disk with 1TB of available space, which is now unused. Thanks for your reply. Ondrej Klimpera
Re: Dealing with low space cluster
Thanks, I'll try. One more question, I've got few more nodes, which can be added to the cluster. But how to do that? If I understand it (according to Hadoop's wiki pages): 1. On master node - edit slaves file and add IP addresses of new nodes (everything clear) 2. log in to each newly added node and run (it's clear to me too) $ hadoop-daemon.sh start datanode $ hadoop-daemon.sh start tasktracker 3. Now I'm not sure, I'm not using dfs.include/mapred.include, do I have to run commands: $ hadoop dfsadmin -refreshNodes $ hadoop mradmin -refreshNodes If yes, must it be run on master node, or new slaves nodes? Ondrej On 06/14/2012 04:03 PM, Harsh J wrote: Ondřej, That isn't currently possible with local storage FS. Your 1 TB NFS point can help but I suspect it may act as a slow-down point if nodes use it in parallel. Perhaps mount it only on 3-4 machines (or less), instead of all, to avoid that? On Thu, Jun 14, 2012 at 7:28 PM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: Hello, you're right. That's exactly what I ment. And your answer is exactly what I thought. I was just wondering if Hadoop can distribute the data to other node's local storages if own local space is full. Thanks On 06/14/2012 03:38 PM, Harsh J wrote: Ondřej, If by processing you mean trying to write out (map outputs)20 GB of data per map task, that may not be possible, as the outputs need to be materialized and the disk space is the constraint there. Or did I not understand you correctly (in thinking you are asking about MapReduce)? Cause you otherwise have ~50 GB space available for HDFS consumption (assuming replication = 3 for proper reliability). On Thu, Jun 14, 2012 at 1:25 PM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: Hello, we're testing application on 8 nodes, where each node has 20GB of local storage available. What we are trying to achieve is to get more than 20GB to be processed on this cluster. Is there a way how to distribute the data on the cluster? There is also one shared NFS storage disk with 1TB of available space, which is now unused. Thanks for your reply. Ondrej Klimpera
How Hadoop splits TextInput?
Hello, I'd like to ask you how Hadoop splits text input, if it's size is smaller then HDFS block size. I'm testing an application, which creates from small input large outputs. When using NInputSplits input format and setting number of splits in mapred-conf.xml some results are lost during writing output. When app runs with default TextInput format everything goes OK. Have you an idea, where the problem should be? Thanks for your answer.
HADOOP_HOME depracated
Hello, why when running Hadoop, there is always HADOOP_HOME shell variable being told to be deprecated. How to set installation directory on cluster nodes, which variable is correct. Thanks Ondrej Klimpera
Re: Getting job progress in java application
Thanks a lot, checked the Docs and submitJob() method did the job. Two more question please:) [1] My app is running on Hadoop 0.20.203, if I upgrade the libraries to 1.0.X, will the old API work, or it is necessary to rewrite map() and reduce() functions to new API? [2] Does the new API support MultipleOutputs? Thanks again. On 04/30/2012 12:32 AM, Bill Graham wrote: Take a look at the JobClient API. You can use that to get the current progress of a running job. On Sunday, April 29, 2012, Ondřej Klimpera wrote: Hello I'd like to ask you what is the preferred way of getting running jobs progress from Java application, that has executed them. Im using Hadoop 0.20.203, tried job.end.notification.url property that works well, but as the property name says, it sends only job end notifications. What I need is to get updates on map() and reduce() progress. Please help how to do this. Thanks. Ondrej Klimpera
Getting job progress in java application
Hello I'd like to ask you what is the preferred way of getting running jobs progress from Java application, that has executed them. Im using Hadoop 0.20.203, tried job.end.notification.url property that works well, but as the property name says, it sends only job end notifications. What I need is to get updates on map() and reduce() progress. Please help how to do this. Thanks. Ondrej Klimpera
Setting a timeout for one Map() input processing
Hello, I'd like to ask you if there is a possibility of setting a timeout for processing one input line of text input in mapper function. The idea is, that if processing of one line is too long, Hadoop will cut this process and continue processing next input line. Thank you for your answer. Ondrej Klimpera
Re: Setting a timeout for one Map() input processing
Thanks, I'll try to implement it and get you know if it worked. On 04/18/2012 04:07 PM, Harsh J wrote: Since you're looking for per-line (and not per-task/file) monitoring, this is best done by your own application code (a monitoring thread, etc.). On Wed, Apr 18, 2012 at 6:09 PM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: Hello, I'd like to ask you if there is a possibility of setting a timeout for processing one input line of text input in mapper function. The idea is, that if processing of one line is too long, Hadoop will cut this process and continue processing next input line. Thank you for your answer. Ondrej Klimpera
Re: Creating and working with temporary file in a map() function
Thanks for your advise, File.createTempFile() works great, at least in pseudo-ditributed mode, hope cluster solution will do the same work. You saved me hours of trying... On 04/07/2012 11:29 PM, Harsh J wrote: MapReduce sets mapred.child.tmp for all tasks to be the Task Attempt's WorkingDir/tmp automatically. This also sets the -Djava.io.tmpdir prop for each task at JVM boot. Hence you may use the regular Java API to create a temporary file: http://docs.oracle.com/javase/6/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String) These files would also be automatically deleted away after the task attempt is done. On Sun, Apr 8, 2012 at 2:14 AM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: Hello, I would like to ask you if it is possible to create and work with a temporary file while in a map function. I suppose that map function is running on a single node in Hadoop cluster. So what is a safe way to create a temporary file and read from it in one map() run. If it is possible is there a size limit for the file. The file can not be created before hadoop job is created. I need to create and process the file inside map(). Thanks for your answer. Ondrej Klimpera.
Re: Creating and working with temporary file in a map() function
I will, but deploying application on a cluster is now far away. Just finishing raw implementation. Cluster tuning is planed in the end of this month. Thanks. On 04/08/2012 09:06 PM, Harsh J wrote: It will work. Pseudo-distributed mode shouldn't be all that different from a fully distributed mode. Do let us know if it does not work as intended. On Sun, Apr 8, 2012 at 11:40 PM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: Thanks for your advise, File.createTempFile() works great, at least in pseudo-ditributed mode, hope cluster solution will do the same work. You saved me hours of trying... On 04/07/2012 11:29 PM, Harsh J wrote: MapReduce sets mapred.child.tmp for all tasks to be the Task Attempt's WorkingDir/tmp automatically. This also sets the -Djava.io.tmpdir prop for each task at JVM boot. Hence you may use the regular Java API to create a temporary file: http://docs.oracle.com/javase/6/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String) These files would also be automatically deleted away after the task attempt is done. On Sun, Apr 8, 2012 at 2:14 AM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: Hello, I would like to ask you if it is possible to create and work with a temporary file while in a map function. I suppose that map function is running on a single node in Hadoop cluster. So what is a safe way to create a temporary file and read from it in one map() run. If it is possible is there a size limit for the file. The file can not be created before hadoop job is created. I need to create and process the file inside map(). Thanks for your answer. Ondrej Klimpera.
Re: Working with MapFiles
Ok, thanks. I missed setup() method because of using older version of hadoop, so I suppose that method configure() does the same in hadoop 0.20.203. Now I'm able to load a map file inside configure() method to MapFile.Reader instance as a class private variable, all works fine, just wondering if the MapFile is replicated on HDFS and data are read locally, or if reading from this file will increase the network bandwidth because of getting it's data from another computer node in the hadoop cluster. Hopefully last question to bother you is, if reading files from DistributedCache (normal text file) is limited to particular job. Before running a job I add a file to DistCache. When getting the file in Reducer implementation, can it access DistCache files from another jobs? In another words what will list this command: //Reducer impl. public void configure(JobConf job) { URI[] distCacheFileUris = DistributedCache.getCacheFiles(job); } will the distCacheFileUris variable contain only URIs for this job, or for any job running on Hadoop cluster? Hope it's understandable. Thanks. On 04/02/2012 11:34 AM, Ioan Eugen Stan wrote: Hi Ondrej, Pe 30.03.2012 14:30, Ondřej Klimpera a scris: And one more question, is it even possible to add a MapFile (as it consits of index and data file) to Distributed cache? Thanks Should be no problem, they are just two files. On 03/30/2012 01:15 PM, Ondřej Klimpera wrote: Hello, I'm not sure what you mean by using map reduce setup()? If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. Can you please explain little bit more? Check the javadocs[1]: setup is called once per task so you can read the file from HDFS then or perform other initializations. [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html Reading 20 MB in ram should not be a problem and is preferred if you need to make many requests against that data. It really depends on your use case so think carefully or just go ahead and test it. Thanks On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote: Hello Ondrej, Pe 29.03.2012 18:05, Ondřej Klimpera a scris: Hello, I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. A MapFile is actually two files [1]: one SequanceFile (with sorted keys) and a small index for that file. The map file does a version of binary search to find your key and performs seek() to go to the byte offset in the file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. The distributed cache will also use HDFS [2] and I don't think it will provide you with any benefits. Thanks for your reply:) Ondrej Klimpera [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html [2] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
Re: Working with MapFiles
Hello, I've got one more question, how is seek() (or get()) method implemented in MapFile.Reader, does it use hashCode, compareTo() or another mechanism to find a match in MapFile's index. Thanks for your reply. Ondrej Klimpera On 03/29/2012 08:26 PM, Ondřej Klimpera wrote: Thanks for your fast reply, I'll try this approach:) On 03/29/2012 05:43 PM, Deniz Demir wrote: Not sure if this helps in your use case but you can put all output file into distributed cache and then access them in the subsequent map-reduce job (in driver code): // previous mr-job's output String pstr = hdfs://output_path/; FileStatus[] files = fs.listStatus(new Path(pstr)); for (FileStatus f : files) { if (!f.isDir()) { DistributedCache.addCacheFile(f.getPath().toUri(), job.getConfiguration()); } } I think you can also copy these files to a different location in dfs and then put into distributed cache. Deniz On Mar 29, 2012, at 8:05 AM, Ondřej Klimpera wrote: Hello, I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. Thanks for your reply:) Ondrej Klimpera
Re: Working with MapFiles
Hello, I'm not sure what you mean by using map reduce setup()? If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. Can you please explain little bit more? Thanks On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote: Hello Ondrej, Pe 29.03.2012 18:05, Ondřej Klimpera a scris: Hello, I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. A MapFile is actually two files [1]: one SequanceFile (with sorted keys) and a small index for that file. The map file does a version of binary search to find your key and performs seek() to go to the byte offset in the file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. The distributed cache will also use HDFS [2] and I don't think it will provide you with any benefits. Thanks for your reply:) Ondrej Klimpera [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html [2] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
Re: Working with MapFiles
And one more question, is it even possible to add a MapFile (as it consits of index and data file) to Distributed cache? Thanks On 03/30/2012 01:15 PM, Ondřej Klimpera wrote: Hello, I'm not sure what you mean by using map reduce setup()? If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. Can you please explain little bit more? Thanks On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote: Hello Ondrej, Pe 29.03.2012 18:05, Ondřej Klimpera a scris: Hello, I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. A MapFile is actually two files [1]: one SequanceFile (with sorted keys) and a small index for that file. The map file does a version of binary search to find your key and performs seek() to go to the byte offset in the file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. The distributed cache will also use HDFS [2] and I don't think it will provide you with any benefits. Thanks for your reply:) Ondrej Klimpera [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html [2] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
Working with MapFiles
Hello, I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. Thanks for your reply:) Ondrej Klimpera
Re: Working with MapFiles
Thanks for your fast reply, I'll try this approach:) On 03/29/2012 05:43 PM, Deniz Demir wrote: Not sure if this helps in your use case but you can put all output file into distributed cache and then access them in the subsequent map-reduce job (in driver code): // previous mr-job's output String pstr = hdfs://output_path/; FileStatus[] files = fs.listStatus(new Path(pstr)); for (FileStatus f : files) { if (!f.isDir()) { DistributedCache.addCacheFile(f.getPath().toUri(), job.getConfiguration()); } } I think you can also copy these files to a different location in dfs and then put into distributed cache. Deniz On Mar 29, 2012, at 8:05 AM, Ondřej Klimpera wrote: Hello, I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. Thanks for your reply:) Ondrej Klimpera
Using MultipleOutputs with new API (v1.0)
Hello, I'm trying to develop an application, where Reducer has to produce multiple outputs. In detail I need the Reducer to produce two types of files. Each file will have different output. I found in Hadoop, The Definitive Guide, that new API uses only MultipleOutputs, but working with MultipleOutputs requires JobConf instace, that is @deprecated (I'm using org.apache.hadoop.mapreduce.Job instance to handle job configuration). So I'm wondering how to get MultipleOutputs working. Can you please provide me some short example or explanation. Thanks for your reply. Regards Ondrej Klimpera
Re: Using MultipleOutputs with new API (v1.0)
I'm using 1.0.0 beta, suppose it was wrong decision to use beta version. So do you recommend using 0.20.203.X and stick to Hadoop definitive guide approaches? Thanks for your reply On 01/25/2012 01:41 PM, Harsh J wrote: Oh and btw, do not fear the @deprecated 'Old' API. We have undeprecated it in the recent stable releases, and will continue to support it for a long time. I'd recommend using the older API, as that is more feature complete and test covered in the version you use. On Wed, Jan 25, 2012 at 6:09 PM, Harsh Jha...@cloudera.com wrote: What version/release/distro of Hadoop are you using? Apache releases got the new (unstable) API MultipleOutputs only in 0.21+, and was only very recently backported to branch-1. That said, the next release in 1.x (1.1.0, out soon) will carry the new API MultipleOutputs, but presently no release in 0.20.xxx/1.x has it. I'd still recommend sticking to stable API if you are using a 0.20.x/1.x stable Apache release. On Wed, Jan 25, 2012 at 5:13 PM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: Hello, I'm trying to develop an application, where Reducer has to produce multiple outputs. In detail I need the Reducer to produce two types of files. Each file will have different output. I found in Hadoop, The Definitive Guide, that new API uses only MultipleOutputs, but working with MultipleOutputs requires JobConf instace, that is @deprecated (I'm using org.apache.hadoop.mapreduce.Job instance to handle job configuration). So I'm wondering how to get MultipleOutputs working. Can you please provide me some short example or explanation. Thanks for your reply. Regards Ondrej Klimpera -- Harsh J Customer Ops. Engineer, Cloudera
Re: Using MultipleOutputs with new API (v1.0)
One more question. Just downloaded Hadoop 0.20.203.0 considered to be last stable release. What about JobConf vs. Confirguration classes. What should I use to avoid wrong approaches, because JobConf seems to be depricated. Sorry for bothering you with this questions. I'm just not used to having depricated things in my projects. Thanks. On 01/25/2012 01:46 PM, Ondřej Klimpera wrote: I'm using 1.0.0 beta, suppose it was wrong decision to use beta version. So do you recommend using 0.20.203.X and stick to Hadoop definitive guide approaches? Thanks for your reply On 01/25/2012 01:41 PM, Harsh J wrote: Oh and btw, do not fear the @deprecated 'Old' API. We have undeprecated it in the recent stable releases, and will continue to support it for a long time. I'd recommend using the older API, as that is more feature complete and test covered in the version you use. On Wed, Jan 25, 2012 at 6:09 PM, Harsh Jha...@cloudera.com wrote: What version/release/distro of Hadoop are you using? Apache releases got the new (unstable) API MultipleOutputs only in 0.21+, and was only very recently backported to branch-1. That said, the next release in 1.x (1.1.0, out soon) will carry the new API MultipleOutputs, but presently no release in 0.20.xxx/1.x has it. I'd still recommend sticking to stable API if you are using a 0.20.x/1.x stable Apache release. On Wed, Jan 25, 2012 at 5:13 PM, Ondřej Klimperaklimp...@fit.cvut.cz wrote: Hello, I'm trying to develop an application, where Reducer has to produce multiple outputs. In detail I need the Reducer to produce two types of files. Each file will have different output. I found in Hadoop, The Definitive Guide, that new API uses only MultipleOutputs, but working with MultipleOutputs requires JobConf instace, that is @deprecated (I'm using org.apache.hadoop.mapreduce.Job instance to handle job configuration). So I'm wondering how to get MultipleOutputs working. Can you please provide me some short example or explanation. Thanks for your reply. Regards Ondrej Klimpera -- Harsh J Customer Ops. Engineer, Cloudera