Hadoop Presentation at Ankara / Turkey
Hi all, I will be giving a presentation on Hadoop at "1. Ulusal Yüksek Başarım ve Grid Konferansı" tomorrow(Apr 17, 13:10). The conference location is at KKM ODTU/Ankara/Turkey. Presentation will be in Turkish. All the Hadoop users and wanna-be users in the area are welcome to attend. More info can be found at : http://basarim09.ceng.metu.edu.tr/ Cheers, Enis Söztutar
Re: what change to be done in OutputCollector to print custom writable object
Deepak Diwakar wrote: Hi, I am learning how to make custom-writable working. So I have implemented a simple MyWriitable class. And I can play with the MyWritable object within the Map-Reduce. but suppose in Reduce Values are a type of MyWritable object and I put them into OutputCollector to get final output. Since value is a custom object I can't get them into file but a reference. What and where I have to make changes /additions so that print into file function handles the custom-writable object? Thanks & regards, just implement toString() in your MyWritable class.
Re: Small Test Data Sets
Patterson, Josh wrote: I want to confirm something with the list that I'm seeing; I needed to confirm that my Reader was reading our file format correctly, so I created a MR job that simply output each K/V pair to the reducer, which then just wrote out each one to the output file. This allows me to check by hand that all K/V points of data from our file format are getting pulled out of the file correctly. I have setup our InputFormat, RecordReader, and Reader subclasses for our specific file format. While running some basic tests on a small (1meg) single file I noticed something odd --- I was getting 2 copies of each data point in the output file. Initially I thought my Reader was just somehow reading the data point and not moving the read head, but I verified that was not the case through a series of tests. I then went on to reason that since I had 2 mappers by default on my job, and only 1 input file, that each mapper must be reading the file independently. I then set the -m flag to 1, and I got the proper output; Is it safe to assume in testing on a file that is smaller than the block size that I should always use -m 1 in order to get proper block->mapper mapping? Also, should I assume that if you have more mappers than disk blocks involved that you will get duplicate values? I may have set something wrong, I just wanted to check. Thanks Josh Patterson TVA If you have developed your own inputformat, than the problem might be there. The job of the inputformat is to create input splits, and readers. For one file and two mappers, the input format should return two splits each representing half of the file. In your case, I assume you return two splits each containing the whole file. Is this the case? Enis
Re: merging files
Use MultipleInputs and use two different mappers for the inputs. map1 should be IdentityMapper, mapper 2 should output key, value pairs where value is a peudo marker value(same for all keys), which marks that the value is null/empty. In the reducer just output the key/value pairs which does not include the marker value in their values. in your example suppose that we use -1 as a marker value, then in mapper2, the output will be 4, -1 2, -1 and the reducer will get : 2, {1,3,5,-1} 3, {1,2} 4, {7,9,-1} 6, {3} then reducer will output : 3, 1 3, 2 6, 3 Nir Zohar wrote: Hi, I would like your help with the below question. I have 2 files: file1 (key, value), file2 (only key) and I need to exclude all records from file1 that these key records not in file2. 1. The output format is key-value, not only keys. 2. The key is not primary key; hence it's not possible to have joined in the end. Can you assist? Thanks, Nir. Example: file1: 2,1 2,3 2,5 3,1 3,2 4,7 4,9 6,3 file2: 4 2 Output: 3,1 3,2 6,3
Re: how to optimize mapreduce procedure??
ZhiHong Fu wrote: Hello, I'm writing a program which will finish lucene searching in about 12 index directorys, all of them are stored in HDFS. It is done like this: 1. We get about 12 index Directorys through lucene index functionality, each of which about 100M size, 2. We store these 12 index directorys on hadoop HDFS , and this hadoop cluster is made up of one namenode and five datanodes,totally 6 computers. 3. And then I will do lucene searching for these 12 index directorys, The mapreduce methods are as follows: Map Procedure: 12 index directory will be splitted into numOfMapTasks,for example, if numOfMapTasks=3, then each map we will get 4 indexDirs and store them in an Intermediate Result. Combine Procedure: for a intermediate Result locally, we will do really lucene search in its containing index directory. and then store these hit result in the intermediate Result. Reduce Procedure: Reduce the Intermediate Results' hit result. and get the search Result. But when I implement like this, I have a performance problem, I set numOfMapTasks and numOfReduceTasks to any value,such as numOfMapTasks=12,numOfReduceTasks=5, But a simple search method will spend about 28 seconds, and Obviously It is unacceptable. So I'm confused whether I did wrong map-reduce procedure or set wrong num of map or reduce tasks. and generally where the overhead of mapreduce proceduce will take place. Any suggestion will be appreciated. Thanks. Keeping the indexes at HDFS is not the best choice. Moreover mapreduce does not fit into the problem of distributed search over several nodes. The overhead of staring a new job for every search is not acceptable. You can use nutch distributed search or katta(not sure about the name) for this.
Re: HADOOP-2536 supports Oracle too?
There is nothing special about the jdbc driver library. I guess that you have added the jar from the IDE(netbeans), but did not include the necessary libraries(jdbc driver in this case) in the TableAccess.jar. The standard way is to include the dependent jars in the project's jar under the lib directory. For example: example.jar -> META-INF -> com/... -> lib/postgres.jar -> lib/abc.jar If your classpath is correct, check whether you call DBConfiguration.configureDB() with the correct driver class and url. sandhiya wrote: Hi, I'm using postgresql and the driver is not getting detected. How do you run it in the first place? I just typed bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar at the terminal without the quotes. I didnt copy any files from my local drives into the Hadoop file system. I get an error like this : java.lang.RuntimeException: java.lang.ClassNotFoundException: org.postgresql.Driver and then the complete stack trace Am i doing something wrong? I downloaded a jar file for postgresql jdbc support and included it in my Libraries folder (I'm using NetBeans). please help Fredrik Hedberg-3 wrote: Hi, Although it's not MySQL; this might be of use: http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/DBCountPageView.java Fredrik On Feb 16, 2009, at 8:33 AM, sandhiya wrote: @Amandeep Hi, I'm new to Hadoop and am trying to run a simple database connectivity program on it. Could you please tell me how u went about it?? my mail id is "sandys_cr...@yahoo.com" . A copy of your code that successfully connected to MySQL will also be helpful. Thanks, Sandhiya Enis Soztutar-2 wrote: From the exception : java.io.IOException: ORA-00933: SQL command not properly ended I would broadly guess that Oracle JDBC driver might be complaining that the statement does not end with ";", or something similar. you can 1. download the latest source code of hadoop 2. add a print statement printing the query (probably in DBInputFormat:119) 3. build hadoop jar 4. use the new hadoop jar to see the actual SQL query 5. run the query on Oracle if is gives an error. Enis Amandeep Khurana wrote: Ok. I created the same database in a MySQL database and ran the same hadoop job against it. It worked. So, that means there is some Oracle specific issue. It cant be an issue with the JDBC drivers since I am using the same drivers in a simple JDBC client. What could it be? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana wrote: Ok. I'm not sure if I got it correct. Are you saying, I should test the statement that hadoop creates directly with the database? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar wrote: Hadoop-2536 connects to the db via JDBC, so in theory it should work with proper jdbc drivers. It has been tested against MySQL, Hsqldb, and PostreSQL, but not Oracle. To answer your earlier question, the actual SQL statements might not be recognized by Oracle, so I suggest the best way to test this is to insert print statements, and run the actual SQL statements against Oracle to see if the syntax is accepted. We would appreciate if you publish your results. Enis Amandeep Khurana wrote: Does the patch HADOOP-2536 support connecting to Oracle databases as well? Or is it just limited to MySQL? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz -- View this message in context: http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22032715.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: re : How to use MapFile in C++ program
There is currently no way to read MapFiles in any language other than Java. You can write a JNI wrapper similar to libhdfs. Alternatively, you can also write the complete stack from scratch, however this might prove very difficult or impossible. You might want to check the ObjectFile/TFile specifications for which binary compatible reader/writers can be developed in any language : https://issues.apache.org/jira/browse/HADOOP-3315 Enis Anh Vũ Nguyễn wrote: Hi, everybody. I am writing a project in C++ and want to use the power of MapFile class(which belongs to org.apache.hadoop.io) of hadoop. Can you please tell me how can I write code in C++ using MapFile or there is no way to use API org.apache.hadoop.io in c++ (libhdfs only helps with org.apache.hadoop.fs). Thanks in advance!
Re: How to use DBInputFormat?
Please see below, Stefan Podkowinski wrote: As far as i understand the main problem is that you need to create splits from streaming data with an unknown number of records and offsets. Its just the same problem as with externally compressed data (.gz). You need to go through the complete stream (or do a table scan) to create logical splits. Afterwards each map task needs to seek to the appropriate offset on a new stream over again. Very expansive. As with compressed files, no wonder only one map task is started for each .gz file and will consume the complete file. I cannot see an easy way to split the JDBC stream and pass them to nodes. IMHO the DBInputFormat should follow this behavior and just create 1 split whatsoever. Why would we want to limit to 1 splits, which effectively resolves to sequential computation? Maybe a future version of hadoop will allow to create splits/map tasks on the fly dynamically? It is obvious that input residing in one database is not optimal for hadoop, and in any case(even with sharding) DB I/O would be the bottleneck. I guess DBInput/Output formats should be used when data is small but computation is costly. Stefan On Thu, Feb 5, 2009 at 3:28 PM, Fredrik Hedberg wrote: Indeed sir. The implementation was designed like you describe for two reasons. First and foremost to make is as simple as possible for the user to use a JDBC database as input and output for Hadoop. Secondly because of the specific requirements the MapReduce framework brings to the table (split distribution, split reproducibility etc). This design will, as you note, never handle the same amount of data as HBase (or HDFS), and was never intended to. That being said, there are a couple of ways that the current design could be augmented to perform better (and, as in its current form, tweaked, depending on you data and computational requirements). Shard awareness is one way, which would let each database/tasktracker-node execute mappers on data where each split is a single database server for example. If you have any ideas on how the current design can be improved, please do share. Fredrik On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote: The 0.19 DBInputFormat class implementation is IMHO only suitable for very simple queries working on only few datasets. Thats due to the fact that it tries to create splits from the query by 1) getting a count of all rows using the specified count query (huge performance impact on large tables) 2) creating splits by issuing an individual query for each split with a "limit" and "offset" parameter appended to the input sql query Effectively your input query "select * from orders" would become "select * from orders limit offset " and executed until count has been reached. I guess this is not working sql syntax for oracle. Stefan 2009/2/4 Amandeep Khurana : Adding a semicolon gives me the error "ORA-00911: Invalid character" Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS wrote: Amandeep, "SQL command not properly ended" I get this error whenever I forget the semicolon at the end. I know, it doesn't make sense, but I recommend giving it a try Rasit 2009/2/4 Amandeep Khurana : The same query is working if I write a simple JDBC client and query the database. So, I'm probably doing something wrong in the connection settings. But the error looks to be on the query side more than the connection side. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana wrote: Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001 09/02/03 19:21:21 INFO mapred.JobClient: map 0% reduce 0% 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: ORA-00933: SQL command not properly ended at org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LoadTable1.run(LoadTable1.java:130) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at LoadTable1.main(LoadTable1.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ
Re: HADOOP-2536 supports Oracle too?
From the exception : java.io.IOException: ORA-00933: SQL command not properly ended I would broadly guess that Oracle JDBC driver might be complaining that the statement does not end with ";", or something similar. you can 1. download the latest source code of hadoop 2. add a print statement printing the query (probably in DBInputFormat:119) 3. build hadoop jar 4. use the new hadoop jar to see the actual SQL query 5. run the query on Oracle if is gives an error. Enis Amandeep Khurana wrote: Ok. I created the same database in a MySQL database and ran the same hadoop job against it. It worked. So, that means there is some Oracle specific issue. It cant be an issue with the JDBC drivers since I am using the same drivers in a simple JDBC client. What could it be? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana wrote: Ok. I'm not sure if I got it correct. Are you saying, I should test the statement that hadoop creates directly with the database? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar wrote: Hadoop-2536 connects to the db via JDBC, so in theory it should work with proper jdbc drivers. It has been tested against MySQL, Hsqldb, and PostreSQL, but not Oracle. To answer your earlier question, the actual SQL statements might not be recognized by Oracle, so I suggest the best way to test this is to insert print statements, and run the actual SQL statements against Oracle to see if the syntax is accepted. We would appreciate if you publish your results. Enis Amandeep Khurana wrote: Does the patch HADOOP-2536 support connecting to Oracle databases as well? Or is it just limited to MySQL? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: HADOOP-2536 supports Oracle too?
Hadoop-2536 connects to the db via JDBC, so in theory it should work with proper jdbc drivers. It has been tested against MySQL, Hsqldb, and PostreSQL, but not Oracle. To answer your earlier question, the actual SQL statements might not be recognized by Oracle, so I suggest the best way to test this is to insert print statements, and run the actual SQL statements against Oracle to see if the syntax is accepted. We would appreciate if you publish your results. Enis Amandeep Khurana wrote: Does the patch HADOOP-2536 support connecting to Oracle databases as well? Or is it just limited to MySQL? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: how to pass an object to mapper
There are several ways you can pass static information to tasks in Hadoop. The first is to store it in conf via DefaultStringifier, which needs the object to be serialized either through Writable or Serializable interfaces. Second way would be to save/serialize the data to a file and send it via DistributedCache. Another way would be to save the file in the jar, and read from there. forbbs forbbs wrote: It seems that JobConf doesn't help. Do I have to write the object into DFS?
Re: Anyone have a Lucene index InputFormat for Hadoop?
I recommend you check nutch's src, which includes classes for Index input/output from mapred. Anthony Urso wrote: Anyone have a Lucene index InputFormat already implemented? Failing that, how about a Writable for the Lucene Document class? Cheers, Anthony
Re: Hadoop with image processing
From my understanding of the problem, you can - keep the image binary data in sequence files - copy the image whose similar images will searched to dfs with high replication. - in each map, calculate the similarity to the image - output only the similar images from the map. - no need a reduce step. I am not sure whether splitting the image into 4 and analyzing the parts individually will make any change, since the above alg. already distributes the computation to all nodes. Raşit Özdaş wrote: Hi to all, I'm a new subscriber of the group, I started to work on a hadoop-based project. In our application, there are a huge number of images with a regular pattern, differing in 4 parts/blocks. System takes an image as input and looks for a similar image, considering if all these 4 parts match. (System finds all the matches, even after finding one). Each of these parts are independent, result of each part computed separately, these are printed on the screen and then an average matching percentage is calculated from these. (I can write more detailed information if needed) Could you suggest a structure? or any ideas to have a better result? Images can be divided into 4 parts, I see that. But folder structure of images are important and I have no idea with that. Images are kept in DB (can be changed, if folder structure is better) Is two stage of map-reduce operations better? First, one map-reduce for each image, then a second map-reduce for every part of one image. But as far as I know, the slowest computation slows down whole operation. This is where I am now. Thanks in advance..
Re: 1 file per record
Nope, not right now. But this has came up before. Perhaps you will contribute one? chandravadana wrote: thanks is there any built in record reader which performs this function.. Enis Soztutar wrote: Yes, you can use MultiFileInputFormat. You can extend the MultiFileInputFormat to return a RecordReader, which reads a record for each file in the MultiFileSplit. Enis chandra wrote: hi.. By setting isSplitable false, we can set 1 file with n records 1 mapper. Is there any way to set 1 complete file per record.. Thanks in advance Chandravadana S This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: Questions about Hadoop
Arijit Mukherjee wrote: Thanx Enis. By workflow, I was trying to mean something like a chain of MapReduce jobs - the first one will extract a certain amount of data from the original set and do some computation resulting in a smaller summary, which will then be the input to a further MR job, and so on...somewhat similar to a workflow as in the SOA world. Yes, you can always chain job together to form a final summary. o.a.h.mapred.jobcontrol.JobControl might be interesting for you. Is it possible to use statistical analysis tools such as R (or say PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is working on a custom MapReduce engine over their Greenplum database which will also support PL/R procedures. Using R on Hadoop might include some level of custom coding. If you are looking for an ad-hoc tool for data mining, then check Pig and Hive. Enis Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com -Original Message- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 2:57 PM To: core-user@hadoop.apache.org Subject: Re: Questions about Hadoop Hi, Arijit Mukherjee wrote: Hi We've been thinking of using Hadoop for a decision making system which will analyze telecom-related data from various sources to take certain decisions. The data can be huge, of the order of terabytes, and can be stored as CSV files, which I understand will fit into Hadoop as Tom White mentions in the Rough Cut Guide that Hadoop is well suited for records. The question I want to ask is whether it is possible to perform statistical analysis on the data using Hadoop and MapReduce. If anyone has done such a thing, we'd be very interested to know about it. Is it also possible to create a workflow like functionality with MapReduce? Hadoop can handle TB data sizes, and statistical data analysis is one of the perfect things that fit into the mapreduce computation model. You can check what people are doing with Hadoop at http://wiki.apache.org/hadoop/PoweredBy. I think the best way to see if your requirements can be met by Hadoop/mapreduce is to read the Mapreduce paper by Dean et.al. Also you might be interested in checking out Mahout, which is a subproject of Lucene. They are doing ML on top of Hadoop. Hadoop is mostly suitable for batch jobs, however these jobs can be chained together to form a workflow. I will try to be more helpful if you could extend what you mean by workflow. Enis Soztutar Regards Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com No virus found in this incoming message. Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date: 9/23/2008 6:32 PM
Re: 1 file per record
Yes, you can use MultiFileInputFormat. You can extend the MultiFileInputFormat to return a RecordReader, which reads a record for each file in the MultiFileSplit. Enis chandra wrote: hi.. By setting isSplitable false, we can set 1 file with n records 1 mapper. Is there any way to set 1 complete file per record.. Thanks in advance Chandravadana S This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Re: Questions about Hadoop
Hi, Arijit Mukherjee wrote: Hi We've been thinking of using Hadoop for a decision making system which will analyze telecom-related data from various sources to take certain decisions. The data can be huge, of the order of terabytes, and can be stored as CSV files, which I understand will fit into Hadoop as Tom White mentions in the Rough Cut Guide that Hadoop is well suited for records. The question I want to ask is whether it is possible to perform statistical analysis on the data using Hadoop and MapReduce. If anyone has done such a thing, we'd be very interested to know about it. Is it also possible to create a workflow like functionality with MapReduce? Hadoop can handle TB data sizes, and statistical data analysis is one of the perfect things that fit into the mapreduce computation model. You can check what people are doing with Hadoop at http://wiki.apache.org/hadoop/PoweredBy. I think the best way to see if your requirements can be met by Hadoop/mapreduce is to read the Mapreduce paper by Dean et.al. Also you might be interested in checking out Mahout, which is a subproject of Lucene. They are doing ML on top of Hadoop. Hadoop is mostly suitable for batch jobs, however these jobs can be chained together to form a workflow. I will try to be more helpful if you could extend what you mean by workflow. Enis Soztutar Regards Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com
Re: Any InputFormat class implementation for Database records
There is a patch you can try it out and share your xp : https://issues.apache.org/jira/browse/HADOOP-2536 ruchir wrote: Hi, I want to know whether is there any implementation of InputFormat class in Hadoop which can read data from Database instead of from HDFS while processing any hadoop Job. We have application which gets data from user, stores it into HDFS and runs hadoop job on that data. Now, we have users whose data is already stored in database. Is there any such implementation which we can use to read data from database directly? I am aware of one such implementation, HBase, but heard that it is not production ready. Thanks in advance, Ruchir
Re: How I should use hadoop to analyze my logs?
You can use chukwa, which is a contrib in the trunk for collecting log entries from web servers. You can run adaptors in the web servers, and a collector in the log server. The log entries may not be analyzed in real time, but it should be close to real time. I suggest you use pig, for log data analysis. Juho Mäkinen wrote: Hello, I'm looking how Hadoo could solve our datamining applications and I've come up with a few questions which I haven't found any answer yet. Our setup contains multiple diskless webserver frontends which generates log data. Each webserver hit generates an UDP packet which contains basically the same info than normal apache access log line (url, return code, client ip, timestamp etc). The udp packet is receivered by a log server. I would want to run map/reduce processed on the log data at the same time when the servers are generating new data. I was planning that each day would have it's own file in HDFS which contains all log entries for that day. How I should use hadoop and HDFS to write each log entry to a file? I was planning that I would create a class which contains request attributes (url, return code, client ip etc) and use this as the value. I did not found any info how this could be done with HDFS. The api seems to support arbitary objects as both key and value, but there was no example how to do this. How will Hadoop handle the concurrency with the writes and the reads? The servers will generate log entries around the clock. I also want to analyse the log entries at the same time when the servers are generating new data. How I can do this? The HDFS architecture page tells that the client writes the data first into a local file and once the file has reached the block size, the file will be transferred to the HDFS storage nodes and the client writes the following data to another local file. Is it possible to read the blocks already transferred to the HDFS using the map/reduce processes and write new blocks to the same file at the same time? Thanks in advance, - Juho Mäkinen
Re: MultiFileInputFormat and gzipped files
MultiFileWordCount uses its own RecordReader, namely MultiFileLineRecordReader. This is different from the LineRecordReader which automatically detects the file's codec, and decodes it. You can write a custom RecordReader similar to LineRecordReader and MultiFileLineRecordReader, or just add codecs to MultiFileLineRecordReader. Michele Catasta wrote: Hi all, I'm writing some Hadoop jobs that should run on a collection of gzipped files. Everything is already working correctly with MultiFileInputFormat and an initial step of gunzip extraction. Considering that Hadoop recognizes and handles correctly .gz files (at least with a single file input), I was wondering if it's able to do the same with file collections, such that I avoid the overhad of sequential file extraction. I tried to run the multi file WordCount example with a bunch of gzipped text files (0.17.1 installation), and I get a wrong output (neither correct or empty). With my own InputFormat (not really different from the one in multiflewc), I got no output at all (map input record counter = 0). Is it a desired behavior? Are there some technical reasons why it's not working in a multi file scenario? Thanks in advance for the help. Regards, Michele Catasta
Re: MultiFileInputFormat - Not enough mappers
Yes, please open a jira for this. We should ensure that avgLengthPerSplit in MultiFileInputFormat should not exceed default file block size. However unlike FileInputFormat, all the files will come from a different block. Goel, Ankur wrote: In this case I have to compute the number of map tasks in the application - (totalSize / blockSize), which is what I am doing as a work-around. I think this should be the default behaviour in MultiFileInputFormat. Should a JIRA be opened for the same ? -Ankur -Original Message- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: Friday, July 11, 2008 7:21 PM To: core-user@hadoop.apache.org Subject: Re: MultiFileInputFormat - Not enough mappers MultiFileSplit currently does not support automatic map task count computation. You can manually set the number of maps via jobConf#setNumMapTasks() or via command line arg -D mapred.map.tasks= Goel, Ankur wrote: Hi Folks, I am using hadoop to process some temporal data which is split in lot of small files (~ 3 - 4 MB) Using TextInputFormat resulted in too many mappers (1 per file) creating a lot of overhead so I switched to MultiFileInputFormat - (MutiFileWordCount.MyInputFormat) which resulted in just 1 mapper. I was hoping to set the no of mappers to 1 so that hadoop automatically takes care of generating the right number of map tasks. Looks like when using MultiFileInputFormat one has to rely on the application to specify the right number of mappers or am I missing something ? Please advise. Thanks -Ankur
Re: MultiFileInputFormat - Not enough mappers
MultiFileSplit currently does not support automatic map task count computation. You can manually set the number of maps via jobConf#setNumMapTasks() or via command line arg -D mapred.map.tasks= Goel, Ankur wrote: Hi Folks, I am using hadoop to process some temporal data which is split in lot of small files (~ 3 - 4 MB) Using TextInputFormat resulted in too many mappers (1 per file) creating a lot of overhead so I switched to MultiFileInputFormat - (MutiFileWordCount.MyInputFormat) which resulted in just 1 mapper. I was hoping to set the no of mappers to 1 so that hadoop automatically takes care of generating the right number of map tasks. Looks like when using MultiFileInputFormat one has to rely on the application to specify the right number of mappers or am I missing something ? Please advise. Thanks -Ankur
Re: edge count question
Cam Bazz wrote: Hello, When I use an custom input format, as in the nutch project - do I have to keep my index in DFS, or regular file system? You have to ensure that your indexes are accessible by the map/reduce tasks, ie. by using hdfs, s3, nfs, kfs, etc. By the way, are there any alternatives to nutch? yes, of course. There are all sorts of open source crawlers / indexers. Best Regards -C.B. On Fri, Jun 27, 2008 at 10:08 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: Cam Bazz wrote: hello, I have a lucene index storing documents which holds src and dst words. word pairs may repeat. (it is a multigraph). I want to use hadoop to count how many of the same word pairs there are. I have looked at the aggregateword count example, and I understand that if I make a txt file such as src1>dst2 src2>dst2 src1>dst2 .. and use something similar to the aggregate word count example, I will get the result desired. Now questions. how can I hookup my lucene index to hadoop. is there a better way then dumping the index to a text file with >'s, copying this to dfs and getting the results back? Yes, you can implement an InputFormat to read from the lucene index. You can use the implementation in the nutch project, the classes DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader. how can I make incremental runs? (once the index processed and I got the results, how can I dump more data onto it so it does not start from beginning) As far as i know, there is no easy way for this. Why do you keep your data as a lucene index? Best regards, -C.B.
Re: edge count question
Cam Bazz wrote: hello, I have a lucene index storing documents which holds src and dst words. word pairs may repeat. (it is a multigraph). I want to use hadoop to count how many of the same word pairs there are. I have looked at the aggregateword count example, and I understand that if I make a txt file such as src1>dst2 src2>dst2 src1>dst2 .. and use something similar to the aggregate word count example, I will get the result desired. Now questions. how can I hookup my lucene index to hadoop. is there a better way then dumping the index to a text file with >'s, copying this to dfs and getting the results back? Yes, you can implement an InputFormat to read from the lucene index. You can use the implementation in the nutch project, the classes DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader. how can I make incremental runs? (once the index processed and I got the results, how can I dump more data onto it so it does not start from beginning) As far as i know, there is no easy way for this. Why do you keep your data as a lucene index? Best regards, -C.B.
Re: Hadoop supports RDBMS?
Yes, there is a way to use DBMS over JDBC. The feature is not realeased yet, but you can try it out, and give valuable feedback to us. You can find the patch and the jira issue at : https://issues.apache.org/jira/browse/HADOOP-2536 Lakshmi Narayanan wrote: Has anyone tried using any RDBMS with the hadoop? If the data is stored in the database is there any way we can use the mapreduce with the database instead of the HDFS?
Re: How long is Hadoop full unit test suit expected to run?
Just standard box like yours (ubuntu). I suspect something must be wrong. Could you determine which tests take long time. On hudson, it seems that tests take 1:30 on average. http://hudson.zones.apache.org/hudson/view/Hadoop/ Lukas Vlcek wrote: Hi, I am aware of HowToContrib wiki page but in my case [ant test] takes more then one hour. I can not tell you how much time it takes because I always stopped it after 4-5 hours... I was running these test on notebook Dell, dual core 1GB of RAM, Windows XP. I haven't tried it now after switching to Ubuntu but I think this should not make a big difference. What kind of HW are you using for Hadoop testing? I would definitely appreciate if [ant test] runs under an hour. Regards, Lukas On Tue, Jun 17, 2008 at 10:26 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: Lukas Vlcek wrote: Hi, How long is Hadoop full unit test suit expected to run? How do you go about running Hadoop tests? I found that it can take hours for [ant test] target to run which does not seem to be very efficient for development. Is there anything I can do to speed up tests (like running Hadoop in a real cluster)? Hi, Yes, ant test can take up to an hour. On my machine it completes in less than an hour. Previously work has been done to reduce the time that tests take, but some of the tests take a long time by nature such as testing dfs balance, etc. There is an issue to implement ant test-core as a mapred job so that it can be submitted to a cluster. That would help a lot. Say I would like to fix a bug in Hadoop in ABC.java. Is it OK if I execute just ABCTest.java (if available) for the development phase before the patch is attached to JIRA ticket? I don't expect the answer to this question is positive but I can not think of better workaround for now... This very much depends on the patch. If you are "convinced" by running TestABC, that the patch would be OK, then you can go ahead and submit patch to hudson for QA testing. However, since the resources at Hudson is limited, please do not use it for "regular" tests. As a side note, please run ant test-patch to check your patch. http://wiki.apache.org/hadoop/HowToContribute http://wiki.apache.org/hadoop/CodeReviewChecklist Enis Regards, Lukas
Re: How long is Hadoop full unit test suit expected to run?
Lukas Vlcek wrote: Hi, How long is Hadoop full unit test suit expected to run? How do you go about running Hadoop tests? I found that it can take hours for [ant test] target to run which does not seem to be very efficient for development. Is there anything I can do to speed up tests (like running Hadoop in a real cluster)? Hi, Yes, ant test can take up to an hour. On my machine it completes in less than an hour. Previously work has been done to reduce the time that tests take, but some of the tests take a long time by nature such as testing dfs balance, etc. There is an issue to implement ant test-core as a mapred job so that it can be submitted to a cluster. That would help a lot. Say I would like to fix a bug in Hadoop in ABC.java. Is it OK if I execute just ABCTest.java (if available) for the development phase before the patch is attached to JIRA ticket? I don't expect the answer to this question is positive but I can not think of better workaround for now... This very much depends on the patch. If you are "convinced" by running TestABC, that the patch would be OK, then you can go ahead and submit patch to hudson for QA testing. However, since the resources at Hudson is limited, please do not use it for "regular" tests. As a side note, please run ant test-patch to check your patch. http://wiki.apache.org/hadoop/HowToContribute http://wiki.apache.org/hadoop/CodeReviewChecklist Enis Regards, Lukas
Re: non-static map or reduce classes?
Hi, Static inner classes, and the static fields are different things in java. Hadoop needs to instantiate the Mapper and Reducer classes from their class names, so if they are defined as inner classes, they need to be static. You can either declare the inner classes to be static, and use the class normally(accessing non-static data), or refactor Mapper and Reducer classes to their own files. Deyaa Adranale wrote: hello. why do we have to set the map and reduce classes as static? i need inside them to access some data which is not static. what i should do? non static map or reduce classes generates the following exception: java.lang.RuntimeException: java.lang.NoSuchMethodException: hadoop.examples.Simple$MyMapper.() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:45) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:32) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:53) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1210) Caused by: java.lang.NoSuchMethodException: hadoop.examples.Simple$MyMapper.() at java.lang.Class.getConstructor0(Class.java:2705) at java.lang.Class.getDeclaredConstructor(Class.java:1984) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:41) ... 4 more any ideas? thanks in advance, Deyaa
Re: Why ComparableWritable does not take a template?
Hi, WritableComparable uses generics in trunk, but if you use 0.16.x you cannot use that version. WritableComparable is not generified yet due to legacy reasons, but the work is in progress. The problem with your code is raising from WritableComparator.newKey(). It seems your object cannot be created by keyClass.newInstance() (no default constructor I guess). You can implement you own WritableComparator implementation. steph wrote: The example in the java doc shows that the compareTo() method uses the type of the class instead of the Object type. However the ComparableWritable class does not take any template and therefore it cannot set the tempate for the class Comparable. Is that a mistake? * public class MyWritableComparable implements WritableComparable { * // Some data * private int counter; * private long timestamp; * * public void write(DataOutput out) throws IOException { * out.writeInt(counter); * out.writeLong(timestamp); * } * * public void readFields(DataInput in) throws IOException { * counter = in.readInt(); * timestamp = in.readLong(); * } * * public int compareTo(MyWritableComparable w) { * int thisValue = this.value; * int thatValue = ((IntWritable)o).value; * return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1)); * } * } If i try to do what it example shows it does not compile: [javac] /Users/steph/Work/Rinera/TRUNK/vma/hadoop/parser-hadoop/apps/src/com/rinera/hadoop/weblogs/SummarySQLKey.java:13: com.rinera.hadoop.weblogs.SummarySQLKey is not abstract and does not override abstract method compareTo(java.lang.Object) in java.lang.Comparable [javac] public class SummarySQLKey If i don't use the type but instead use the Object type for compareTo() i get a RuntimeException: java.lang.RuntimeException: java.lang.InstantiationException: com.rinera.hadoop.weblogs.SummarySQLKey at org.apache.hadoop.io.WritableComparator.newKey(WritableComparator.java:75) at org.apache.hadoop.io.WritableComparator.(WritableComparator.java:63) at org.apache.hadoop.io.WritableComparator.get(WritableComparator.java:42) at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:645) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:313) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:174) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157) Caused by: java.lang.InstantiationException: com.rinera.hadoop.weblogs.SummarySQLKey at java.lang.Class.newInstance0(Class.java:335) at java.lang.Class.newInstance(Class.java:303) at org.apache.hadoop.io.WritableComparator.newKey(WritableComparator.java:73) ... 6 more
Re: JobConf: How to pass List/Map
It is exactly what DefaultStringifier does, ugly but useful *smile*. Jason Venner wrote: We have been serializing to a bytearrayoutput stream then base64 encoding the underlying byte array and passing that string in the conf. It is ugly but it works well until 0.17 Enis Soztutar wrote: Yes Stringifier was committed in 0.17. What you can do in 0.16 is to simulate DefaultStringifier. The key feature of the Stringifier is that it can convert/restore any object to string using base64 encoding on the binary form of the object. If your objects can be easily converted to and from strings, then you can directly store them in conf. The other obvious alternative would be to switch to 0.17, once it is out. Tarandeep Singh wrote: On Wed, Apr 30, 2008 at 5:11 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: Hi, There are many ways which you can pass objects using configuration. Possibly the easiest way would be to use Stringifier interface. you can for example : DefaultStringifier.store(conf, variable ,"mykey"); variable = DefaultStringifier.load(conf, "mykey", variableClass ); thanks... but I am using Hadoop-0.16 and Stringifier is a fix for 0.17 version - https://issues.apache.org/jira/browse/HADOOP-3048 Any thoughts on how to do this in 0.16 version ? thanks, Taran you should take into account that the variable you pass to configuration should be serializable by the framework. That means it must implement Writable of Serializable interfaces. In your particular case, you might want to look at ArrayWritable and MapWritable classes. That said, you should however not pass large objects via configuration, since it can seriously effect job overhead. If the data you want to pass is large, then you should use other alternatives(such as DistributedCache, HDFS, etc). Tarandeep Singh wrote: Hi, How can I set a list or map to JobConf that I can access in Mapper/Reducer class ? The get/setObject method from Configuration has been deprecated and the documentation says - "A side map of Configuration to Object should be used instead." I could not follow this :( Can someone please explain to me how to do this ? Thanks, Taran
Re: JobConf: How to pass List/Map
Yes Stringifier was committed in 0.17. What you can do in 0.16 is to simulate DefaultStringifier. The key feature of the Stringifier is that it can convert/restore any object to string using base64 encoding on the binary form of the object. If your objects can be easily converted to and from strings, then you can directly store them in conf. The other obvious alternative would be to switch to 0.17, once it is out. Tarandeep Singh wrote: On Wed, Apr 30, 2008 at 5:11 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: Hi, There are many ways which you can pass objects using configuration. Possibly the easiest way would be to use Stringifier interface. you can for example : DefaultStringifier.store(conf, variable ,"mykey"); variable = DefaultStringifier.load(conf, "mykey", variableClass ); thanks... but I am using Hadoop-0.16 and Stringifier is a fix for 0.17 version - https://issues.apache.org/jira/browse/HADOOP-3048 Any thoughts on how to do this in 0.16 version ? thanks, Taran you should take into account that the variable you pass to configuration should be serializable by the framework. That means it must implement Writable of Serializable interfaces. In your particular case, you might want to look at ArrayWritable and MapWritable classes. That said, you should however not pass large objects via configuration, since it can seriously effect job overhead. If the data you want to pass is large, then you should use other alternatives(such as DistributedCache, HDFS, etc). Tarandeep Singh wrote: Hi, How can I set a list or map to JobConf that I can access in Mapper/Reducer class ? The get/setObject method from Configuration has been deprecated and the documentation says - "A side map of Configuration to Object should be used instead." I could not follow this :( Can someone please explain to me how to do this ? Thanks, Taran
Re: JobConf: How to pass List/Map
Hi, There are many ways which you can pass objects using configuration. Possibly the easiest way would be to use Stringifier interface. you can for example : DefaultStringifier.store(conf, variable ,"mykey"); variable = DefaultStringifier.load(conf, "mykey", variableClass ); you should take into account that the variable you pass to configuration should be serializable by the framework. That means it must implement Writable of Serializable interfaces. In your particular case, you might want to look at ArrayWritable and MapWritable classes. That said, you should however not pass large objects via configuration, since it can seriously effect job overhead. If the data you want to pass is large, then you should use other alternatives(such as DistributedCache, HDFS, etc). Tarandeep Singh wrote: Hi, How can I set a list or map to JobConf that I can access in Mapper/Reducer class ? The get/setObject method from Configuration has been deprecated and the documentation says - "A side map of Configuration to Object should be used instead." I could not follow this :( Can someone please explain to me how to do this ? Thanks, Taran
Re: Best practices for handling many small files
A shameless attempt to defend MultiFileInputFormat : A concrete implementation of MultiFileInputFormat is not needed, since every InputFormat relying on MultiFileInputFormat is expected to have its custom RecordReader implementation, thus they need to override getRecordReader(). An implementation which returns (sort of) LineRecordReader is under src/examples/.../MultiFileWordCount. However we may include it if any generic (for example returning SequenceFileRecordReader) implementation pops up. An InputFormat returns many Splits from getSplits(JobConf job, int numSplits), which is the number of maps, not the number of machines in the cluster. Last of all, MultiFileSplit class implements getLocations() method, which returns the files' locations. Thus it's the JT's job to assign tasks to leverage local processing. Coming to the original question, I think #2 is better, if the construction of the sequence file is not a bottleneck. You may, for example, create several sequence files in parallel and use all of them as input w/o merging. Joydeep Sen Sarma wrote: million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393) well - even for #2 - it begs the question of how the packing itself will be parallelized .. There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least) text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go .. -Original Message- From: [EMAIL PROTECTED] on behalf of Stuart Sierra Sent: Wed 4/23/2008 8:55 AM To: core-user@hadoop.apache.org Subject: Best practices for handling many small files Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single "record"? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org
Re: I need your help sincerely!
Hi, The number of map tasks is supposed to be greater than the number of machines, so in your configuration, 6 map tasks is ok. However there should be another problem. Have you changed the code for word count? Please ensure that the example code is unchanged and your configuration is right. Also you may want to try the latest stable release, which is 0.16.3 at this point. wangxiaowei wrote: hello, I am a Chinese user.I am using hadoop-0.15.3 now.The problem is that:I install hadoop with three nodes.One is taken as NameNode and JobTracker.The three are all taken as slaves. It dfs runs normally.I use your example of wordcount,the instruction is:bin/hadoop jar hadoop-0.15.3-examples.jar wordcount -m 6 -r 2 inputfile outoutdir then I submit it.when it runs,the map tasks are all completes,but the reduce task do not run at all! I checked the log of JobTracker,finding the error: at java.util.Hashtable.get(Hashtable.java:334) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutput(ReduceTask.java:966) at org.apache.hadoop.mapred.ReduceTask.run at org.apache.hadoop.mapred.TaskTracker$Child.main() the machine just stoped there,I have to ctrl+c to stop the program. but when I use this instructon:bin/hadoop jar hadoop-0.15.3-examples.jar wordcount -m 2 -r 1 inputfile outoutdir it runs successfully! I do not konw why! I think the numbers of machines should be greater than mapTasks,so i set it 6.my machines number is 3. Looking forward for your reply! 4.22,2008 xiaowei
Re: small sized files - how to use MultiInputFileFormat
Hi, An example extracting one record per file would be : public class FooInputFormat extends MultiFileInputFormat { @Override public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException { return new FooRecordReader(job, (MultiFileSplit)split); } } public static class FooRecordReader implements RecordReader { private MultiFileSplit split; private long offset; private long totLength; private FileSystem fs; private int count = 0; private Path[] paths; public FooRecordReader(Configuration conf, MultiFileSplit split) throws IOException { this.split = split; fs = FileSystem.get(conf); this.paths = split.getPaths(); this.totLength = split.getLength(); this.offset = 0; } public WritableComparable createKey() { .. } public Writable createValue() { .. } public void close() throws IOException { } public long getPos() throws IOException { return offset; } public float getProgress() throws IOException { return ((float)offset) / split.getLength(); } public boolean next(Writable key, Writable value) throws IOException { if(offset >= totLength) return false; if(count >= split.numPaths()) return false; Path file = paths[count]; FSDataInputStream stream = fs.open(file); BufferedReader reader = new BufferedReader(new InputStreamReader(stream)); Scanner scanner = new Scanner(reader.readLine()); //read from file, fill in key and value reader.close(); stream.close(); offset += split.getLength(count); count++; return true; } } I guess, I should add an example code to the mapred tutorial, and examples directory. Jason Curtes wrote: Hello, I have been trying to run Hadoop on a set of small text files, not larger than 10k each. The total input size is 15MB. If I try to run the example word count application, it takes about 2000 seconds, more than half an hour to complete. However, if I merge all the files into one large file, it takes much less than a minute. I think using MultiInputFileFormat can be helpful at this point. However, the API documentation is not really helpful. I wonder if MultiInputFileFormat can really solve my problem, and if so, can you suggest me a reference on how to use it, or a few lines to be added to the word count example to make things more clear? Thanks in advance. Regards, Jason Curtes
Re: Hadoop summit video capture?
+1 Otis Gospodnetic wrote: Hi, Wasn't there going to be a live stream from the Hadoop summit? I couldn't find any references on the event site/page, and searches on veoh, youtube and google video yielded nothing. Is an archived version of the video (going to be) available? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Re: [core-user] Processing binary files Howto??
Hi, please see below, Ted Dunning wrote: This sounds very different from your earlier questions. If you have a moderate (10's to 1000's) number of binary files, then it is very easy to write a special purpose InputFormat that tells hadoop that the file is not splittable. @ Ted, actually we have MultiFileInputFormat and MultiFileSplit for exactly this :) @ Alfonso, The core of the hadoop does not care about the source of the data(such as files, database, etc). The map and reduce functions operate on records which are just key value pairs. The job of the InputFormat/InputSplit/RecordReader interfaces is to map the actual data source to records. So, if a file contains a few records and no records is split among two files and the total number of files is in the order of ten thousands, you can extend MultiFileInputFormat to return a Records reader which extracts records from these binary files. If the above does not apply, you can concatenate all the files into a smaller number of files, then use FileInputFormat. Then your RecordReader implementation is responsible for finding the record boundaries and extracting the records. In both options, storing the files in DFS and using map-red is a wise choice, since mapred over dfs already has locality optimizations. But if you must you can distribute the files to the nodes manually, and implement an ad-hock Partitioner which ensures the map task is executed on the node that has the relevant files. This allows you to add all of the files as inputs to the map step and you will get the locality that you want. The files should be large enough so that you take at least 10 seconds or more processing them to get good performance relative to startup costs. If they are not, then you may want to package them up in a form that can be read sequentially. This need not be splittable, but it would be nice if it were. If you are producing a single file per hour, then this style works pretty well. In my own work, we have a few compressed and encrypted files each hour that are map-reduced into a more congenial and splittable form each hour. Then subsequent steps are used to aggregate or process the data as needed. This gives you all of the locality that you were looking for. On 3/17/08 6:19 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]> wrote: Hi there. After reading a bit of the hadoop framework and trying the WordCount example. I have several doubts about how to use map /reduce with binary files. In my case binary files are generated in a time line basis. Let's say 1 file per hour. The size of each file is different (briefly we are getting pictures from space and the stars density is different between observations). The mappers, rather than receiving the file content. They have to receive the file name. I read that if the input files are big (several blocks), they are split among several tasks in same/different node/s (block sizes?). But we want each map task processes a file rather than a block (or a line of a file as in the WordCount sample). In a previous post I did to this forum. I was recommended to use an input file with all the file names, so the mappers would receive the file name. But there is a drawback related with data location (also was mentioned this), because data then has to be moved from one node to another. Data is not going to be replicated to all the nodes. So a task taskA that has to process fileB on nodeN, it has to be executed on nodeN. How can we achive that??? What if a task requires a file that is on other node. Does the framework moves the logic to that node? We need to define a URI file map in each node (hostname/path/filename) for all the files. Tasks would access the local URI file map in order to process the files. Another approach we have thought is to use the distributed file system to load balance the data among the nodes. And have our processes running on every node (without using the map/reduce framework). Then each process has to access to the local node to process the data, using the dfs API (or checking the local URI file map). This approach would be more flexible to us, because depending on the machine (cuadcore, dualcore) we know how many java threads we can run in order to get the maximum performance of the machine. Using the framework we can only say a number of tasks to be executed on every node, but all the nodes have to be the same. URI file map. Once the files are copied to the distributed file system, then we need to create this table map. Or is it a way to access a at the data node and retrieve the files it handles? rather than getting all the files in all the nodes in that ie NodeA /tmp/.../mytask/input/fileA-1 /tmp/.../mytask/input/fileA-2 NodeB /tmp/.../mytask/input/fileB A process at nodeB listing the /tmp/.../input directory, would get only fileB Any ideas? Thanks Alfonso.
Re: displaying intermediate results of map/reduce
You can also run the job in local mode with zero reducers, so that the map results are the results of the job. Prasan Ary wrote: Hi All, I am using eclipse to write a map/reduce java application that connects to hadoop on remote cluster. Is there a way I can display intermediate results of map ( or reduce) much the same way as I would use System.out.println( variable_name) if I were running any application on a single machine? thx, Prasan. - Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
Re: Nutch Extensions to MapReduce
Naama Kraus wrote: OK. Let me try an example: Say my map maps a person name to a his child name. . If a person "Dan" has more than 1 child, bunch of * pairs will be produced, right ? Now say I have two different information needs: 1. Get a list of all children names for each person. 2. Get the number of children of each person. I could run two different MapReduce jobs, with same map but different reducres: 1. emits * pairs where p is the person, lc is a concatenation of his children names. 2. emits * pairs where p is the person, n is the number of children. No you cannot have more than one type of reduces in one job. But yes you can write more than one file as the result of the reduce phase, which is what I wanted to explain by pointing to ParseOutputFormat which writes ParseText and ParseDatato different MapFiles at the end of the reduce step. So this is done by implementing OutputFormat + RecordWriter(given a resulting record from the reduce, write separate parts of it in different files) Does that make any sense by now ? Now, my question is whether I can save the two jobs and have a single one only which emits both two type of pairs - * and *. In separate files probably. This way I gain one pass on the input files instead of two (or more, if I had more output types ...). Actually for this scenario you do not even need two different files with * and *. You can just compute > which also contains the number of the children (The value is a List(for example ArrayWritable) containing children names). If not, that's also fine, I was just curious :-) Naama On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <[EMAIL PROTECTED]> wrote: Let me explain this more technically :) An MR job takes pairs. Each map(k1,v1) may result result * pairs. So at the end of the map stage, the output will be of the form pairs. The reduce takes pairs and emits * pairs, where k1,k2,k3,v1,v2,v3 are all types. I cannot understand what you meant by if a MapReduce job could output multiple files each holds different pairs" The resulting segment directories after a crawl contain subdirectories(like crawl_generate, content, etc), but these are generated one-by-one in several jobs running sequentially(and sometimes by the same job, see ParseOutputFormat in nutch). You can refer further to the OutputFormat and RecordWriter interfaces for specific needs. For each split in the reduce phrase a different output file will be generated, but all the records in the files have the same type. However in some cases using GenericWritable or ObjectWtritable, you can wrap different types of keys and values. Hope it helps, Enis Naama Kraus wrote: Well, I was not actually thinking to use Nutch. To be concrete, I was interested if a MapReduce job could output multiple files each holds different pairs. I got the impression this is done in Nutch from slide 15 of http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf but maybe I was mis-understanding. Is it Nutch specific or achievable using Hadoop API ? Would multiple different reducers do the trick ? Thanks for offering to help, I might have more concrete details of what I am trying to implement later on, now I am basically learning. Naama On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]> wrote: Hi, Currently nutch is a fairly complex application that *uses* hadoop as a base for distributed computing and storage. In this regard there is no part in nutch that "extends" hadoop. The core of the mapreduce indeed does work with pairs, and nutch uses specific pairs such as , etc. So long story short, it depends on what you want to build. If you working on something that is not related to nutch, you do not need it. You can give further info about your project if you want extended help. best wishes. Enis Naama Kraus wrote: Hi, I've seen in http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide> < http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide 12) that Nutch has extensions to MapReduce. I wanted to ask whether these are part of the Hadoop API or inside Nutch only. More specifically, I saw in http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide> < http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide 15) that MapReduce outputs two files each holds different pairs. I'd be curious to know if I can achieve that using the standard API. Thanks, Naama
Re: Nutch Extensions to MapReduce
Let me explain this more technically :) An MR job takes pairs. Each map(k1,v1) may result result * pairs. So at the end of the map stage, the output will be of the form pairs. The reduce takes pairs and emits v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types. I cannot understand what you meant by if a MapReduce job could output multiple files each holds different pairs" The resulting segment directories after a crawl contain subdirectories(like crawl_generate, content, etc), but these are generated one-by-one in several jobs running sequentially(and sometimes by the same job, see ParseOutputFormat in nutch). You can refer further to the OutputFormat and RecordWriter interfaces for specific needs. For each split in the reduce phrase a different output file will be generated, but all the records in the files have the same type. However in some cases using GenericWritable or ObjectWtritable, you can wrap different types of keys and values. Hope it helps, Enis Naama Kraus wrote: Well, I was not actually thinking to use Nutch. To be concrete, I was interested if a MapReduce job could output multiple files each holds different pairs. I got the impression this is done in Nutch from slide 15 of http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf but maybe I was mis-understanding. Is it Nutch specific or achievable using Hadoop API ? Would multiple different reducers do the trick ? Thanks for offering to help, I might have more concrete details of what I am trying to implement later on, now I am basically learning. Naama On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <[EMAIL PROTECTED]> wrote: Hi, Currently nutch is a fairly complex application that *uses* hadoop as a base for distributed computing and storage. In this regard there is no part in nutch that "extends" hadoop. The core of the mapreduce indeed does work with pairs, and nutch uses specific pairs such as , etc. So long story short, it depends on what you want to build. If you working on something that is not related to nutch, you do not need it. You can give further info about your project if you want extended help. best wishes. Enis Naama Kraus wrote: Hi, I've seen in http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide> 12) that Nutch has extensions to MapReduce. I wanted to ask whether these are part of the Hadoop API or inside Nutch only. More specifically, I saw in http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide> 15) that MapReduce outputs two files each holds different pairs. I'd be curious to know if I can achieve that using the standard API. Thanks, Naama
Re: Difference between local mode and distributed mode
Hi, LocalJobRunner uses just 0 or 1 reduce. This is because running in local mode is only supported for testing purposes. Although you can simulate distribute mode in local, by using MiniMRCluster and MiniDFSCluster under src/test. Best wishes Enis Naama Kraus wrote: Hi, I ran a simple MapReduce job which defines 3 reducers: *conf.setNumReduceTasks(3);* When running on top of HDFS (distributed mode), I got 3 out files as I expected. When running on top of a local files system (local mode), I got 1 file and not 3. My question is whether the behavior in the local is the expected one, is a bug, or maybe something needs to be configured in the local mode in order to get the 3 files as well ? Thanks for an insight, Naama
Re: Nutch Extensions to MapReduce
Hi, Currently nutch is a fairly complex application that *uses* hadoop as a base for distributed computing and storage. In this regard there is no part in nutch that "extends" hadoop. The core of the mapreduce indeed does work with pairs, and nutch uses specific pairs such as , etc. So long story short, it depends on what you want to build. If you working on something that is not related to nutch, you do not need it. You can give further info about your project if you want extended help. best wishes. Enis Naama Kraus wrote: Hi, I've seen in http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide 12) that Nutch has extensions to MapReduce. I wanted to ask whether these are part of the Hadoop API or inside Nutch only. More specifically, I saw in http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide 15) that MapReduce outputs two files each holds different pairs. I'd be curious to know if I can achieve that using the standard API. Thanks, Naama
Re: Why no DoubleWritable?
Hi, The reason may be that perhaps nobody needed the extra precision brought by double compansating the extra space, compared to FloatWritable. If you really need DoubleWritable you may write the class, which will be straightforward, and then attach it to a jira issue so that we can add it to the core. That's the way open sourse works after all. *smile* Jimmy Lin wrote: Hi guys, What's the design decision for not implementing a DoubleWritable type that implements WritableComparable? I noticed that there are classes corresponding to all Java primitives except for double. Thanks in advance, Jimmy
Re: hadoop file system browser
Yes, you can solve the bottleneck by starting a webdav server on each client. But this would include the burden to manage the servers etc. and it may not be the intended use case for webdav. But we can further discuss the architecture in the relevant issue. Alban Chevignard wrote: Thanks for the clarification. I agree that running a single WebDAV server for all clients would make it a bottleneck. But I can't see anything in the current WebDAV server implementation that precludes running an instance of it on each client. It seems to me that would solve any bottleneck issue. -Alban On Jan 23, 2008 2:53 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: As you know, dfs client connects to the individual datanodes to read/write data and has a minimal interaction with the Namenode, which improves the io rate linearly(theoretically 1:1). However current implementation of webdav interface, is just a server working on a single machine, which translates the webdav requests to namenode. Thus the whole traffic passes through this webdav server, which makes it a bottleneck. I was planning to integrate webdav server with namenode/datanode, and forward the requests to the other datanodes, so that we can do io in parallel, but my focus on webdav has faded for now. Alban Chevignard wrote: What are the scalability issues associated with the current WebDAV interface? Thanks, -Alban On Jan 22, 2008 7:27 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: Webdav interface for hadoop works as it is, but it needs a major redesign to be scalable, however it is still useful. It has even been used with windows explorer defining the webdav server as a remote service. Ted Dunning wrote: There has been significant work on building a web-DAV interface for HDFS. I haven't heard any news for some time, however. On 1/21/08 11:32 AM, "Dawid Weiss" <[EMAIL PROTECTED]> wrote: The Eclipse plug-in also features a DFS browser. Yep. That's all true, I don't mean to self-promote, because there really isn't that much to advertise ;) I was just quite attached to file manager-like user interface; the mucommander clone I posted served me as a browser, but also for rudimentary file operations (copying to/from, deleting folders etc.). In my experience it's been quite handy. It would be probably a good idea to implement a commons-vfs plugin for Hadoop so that HDFS filesystem is transparent to use for other apps. Dawid
Re: hadoop file system browser
As you know, dfs client connects to the individual datanodes to read/write data and has a minimal interaction with the Namenode, which improves the io rate linearly(theoretically 1:1). However current implementation of webdav interface, is just a server working on a single machine, which translates the webdav requests to namenode. Thus the whole traffic passes through this webdav server, which makes it a bottleneck. I was planning to integrate webdav server with namenode/datanode, and forward the requests to the other datanodes, so that we can do io in parallel, but my focus on webdav has faded for now. Alban Chevignard wrote: What are the scalability issues associated with the current WebDAV interface? Thanks, -Alban On Jan 22, 2008 7:27 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: Webdav interface for hadoop works as it is, but it needs a major redesign to be scalable, however it is still useful. It has even been used with windows explorer defining the webdav server as a remote service. Ted Dunning wrote: There has been significant work on building a web-DAV interface for HDFS. I haven't heard any news for some time, however. On 1/21/08 11:32 AM, "Dawid Weiss" <[EMAIL PROTECTED]> wrote: The Eclipse plug-in also features a DFS browser. Yep. That's all true, I don't mean to self-promote, because there really isn't that much to advertise ;) I was just quite attached to file manager-like user interface; the mucommander clone I posted served me as a browser, but also for rudimentary file operations (copying to/from, deleting folders etc.). In my experience it's been quite handy. It would be probably a good idea to implement a commons-vfs plugin for Hadoop so that HDFS filesystem is transparent to use for other apps. Dawid
Re: hadoop file system browser
Webdav interface for hadoop works as it is, but it needs a major redesign to be scalable, however it is still useful. It has even been used with windows explorer defining the webdav server as a remote service. Ted Dunning wrote: There has been significant work on building a web-DAV interface for HDFS. I haven't heard any news for some time, however. On 1/21/08 11:32 AM, "Dawid Weiss" <[EMAIL PROTECTED]> wrote: The Eclipse plug-in also features a DFS browser. Yep. That's all true, I don't mean to self-promote, because there really isn't that much to advertise ;) I was just quite attached to file manager-like user interface; the mucommander clone I posted served me as a browser, but also for rudimentary file operations (copying to/from, deleting folders etc.). In my experience it's been quite handy. It would be probably a good idea to implement a commons-vfs plugin for Hadoop so that HDFS filesystem is transparent to use for other apps. Dawid