Re: Changing the Java heap
HDFS doesn't care about the contents of the file. The file gets divided into 64MB Blocks. For example, If your input file contains data in custom format (like Paragraphs) and you want the files to split as per paragraphs, HDFS isn't responsible - and rightly so. The application developer needs to use a custom InputFormat which internally uses RecordReader and InputSplit. The default, text input format makes sure that your mappers get each line as an input. The lines that span two blocks are handled by the InputSplit which makes sure that the necessary bytes from two blocks are made available, and Record Reader actually converts that byte view into (key,value). On Thu, Apr 26, 2012 at 4:59 PM, Barry, Sean F sean.f.ba...@intel.comwrote: I guess what I meant to say was, how does hadoop make 64M blocks without cutting off parts of words at the end of each block? Does it only make blocks at whitespace? -SB -Original Message- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Thursday, April 26, 2012 1:56 PM To: common-user@hadoop.apache.org Subject: Re: Changing the Java heap Not sure of your question. Java child Heap size is independent of how files are split on HDFS. I suggest you look at Tom White's book on HDFS and how files are split in to blocks. Blocks are split on set sizes. 64MB by default. Your record boundaries are not necessarily on file block boundaries so one process may read the rest of the last record in block A and then complete reading it at the start of block B. A different task may start with block B and skip the first n bytes until it hits the start of a record. HTH -Mike On Apr 26, 2012, at 3:46 PM, Barry, Sean F wrote: Within my small 2 node cluster I set up my 4 core slave node to have 4 task trackers and I also limited my java heap size to -Xmx1024m Is there a possibility that when the data gets broken up that it will break it at a place in the file that is not a whitespace? Or is that already handled when the data on HDFS is broken up into blocks? -SB -- Warm Regards, Deepak Nettem http://www.cs.stonybrook.edu/%7Ednettem/
Re: Structuring MapReduce Jobs
What a well structured question! On Sun, Apr 8, 2012 at 6:37 AM, Tom Ferguson tomfergu...@gmail.com wrote: Hello, I'm very new to Hadoop and I am trying to carry out of proof of concept for processing some trading data. I am from a .net background, so I am trying to prove whether it can be done primarily using C#, therefore I am looking at the Hadoop Streaming job (from the Hadoop examples) to call in to some C# executables. My problem is, I am not certain of the best way to structure my jobs to process the data in the way I want. I have data stored in an RDBMS in the following format: ID TradeID Date Value - 1 1 2012-01-01 12.34 2 1 2012-01-02 12.56 3 1 2012-01-03 13.78 4 2 2012-01-04 18.94 5 2 2012-05-17 19.32 6 2 2012-05-18 19.63 7 3 2012-05-19 17.32 What I want to do is take all the Dates Values for a given TradeID into a mathematical function that will spit out the same set of Dates but will have recalculated all the Values. I hope that makes sense.. e.g. Date Value --- 2012-01-01 12.34 2012-01-02 12.56 2012-01-03 13.78 will have the mathematical function applied and spit out Date Value --- 2012-01-01 28.74 2012-01-02 31.29 2012-01-03 29.93 I am not exactly sure how to achieve this using Hadoop Streaming, but my thoughts so far are... 1. Us Sqoop to take the data out of the RDBMS and in to HDFS and split by TradeID - will this guarantee that all the the data points for a given TradeID will be processed by the same Map task?? 2. Write a Map task as a C# executable that will stream data in in the format (ID, TradeID, Date, Value) 3. Gather all the data points for a given TradeID together into an array (or other datastructure) A Naive Way - The Mapper will need to emit key,value pairs where TradeID = key, and the entire record is the value. Hadoop will make sure that all key,value pairs with the same key land up in the same reducer. In Java for example, all records for the same TradeID would become available as an Iterable collection. The Reducer can apply the mathematical function that you're talking about. Another Way - If it is guaranteed that records with the same TradeID occur one after the other (and occur a fixed number of times, say 'k' times), then you can use a custom input format that makes available to the mapper 'k' records at a time, instead of 1. The mapper can then apply mathematical function. No reducer would be required in this case. 4. Pass the array into the mathematical function 5. Get the results back as another array 6. Stream the results back out in the format (TradeID, Date, ResultValue) I will have around 500,000 Trade IDs, with up to 3,000 data points each, so I am hoping that the data/processing will be distributed appropriately by Hadoop. Now, this seams a little bit long winded, but is this the best way of doing it, based on the constraints of having to use C# for writing my tasks? In the example above I do not have a Reduce job at all. Is that right in my scenario? Thanks for any help you can give and apologies if I am asking stupid questions here! Kind Regards, Tom Deepak Nettem MS CS SUNY Stony Brook
Re: CombineFileInputFormat
Hi Stan, Just out of curiosity, care to explain the use case a bit? On Mon, Apr 9, 2012 at 5:25 PM, Stan Rosenberg stan.rosenb...@gmail.comwrote: Hi, I just came across a use case requiring CombineFileInputFormat under hadoop 0.20.2. I was surprised that the API does not provide a default implementation. A precursory check against newer APIs also returned the same result. What's the rationale? I ended up writing my own implementation. However, it struck me that this might be a common use case, so why not provide an implementation? Thanks, stan Deepak Nettem
Get Current Block or Split ID, and using it, the Block Path
Hi, Is it possible to get the 'id' of the currently executing split or block from within the mapper? Using this block Id / split id, I want to be able to query the namenode to get the names of hosts having that block / spllit, and the actual path to the data. I need this for some analytics that I'm doing. Is there a client API that allows doing this? If not, what's the best way to do this? Best, Deepak Nettem
ResetableIterator for Joins
Hi, I don't quite understand the purpose of the ResetableIterator. ( http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/mapred/join/ResetableIterator.html ). The description says - This defines an interface to a stateful Iterator that can replay elements added to it directly. Note that this does not extend Iteratorhttp://java.sun.com/javase/6/docs/api/java/util/Iterator.html?is-external=true. What's a stateful iterator and why do we need to replay elements added to it? Why does it not extend Iterator? I would appreciate any insight! -- Warm Regards, Deepak Nettem
Re: Avro, Hadoop0.20.2, Jackson Error
Hi, I have moved to CDH3 which doesn't have this issue. Hope that helps anybody stuck with the same issue. best, Deepak On Mon, Mar 26, 2012 at 11:19 PM, Scott Carey sc...@richrelevance.comwrote: Does it still happen if you configure avro-tools to use dependency groupIdorg.apache.avro/groupId artifactIdavro-tools/artifactId version1.6.3/version classifiernodeps/classifier /dependency ? You have two hadoop's, two jacksons, and even two avro:avro artifacts in your classpath if you use the avro bundle jar with a default classifier. avro-tools jar is not intended for inclusion in a project, as it is a jar with dependencies inside. https://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildD ocumentation-ProjectStructurehttps://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildD%0Aocumentation-ProjectStructure On 3/26/12 7:52 PM, Deepak Nettem deepaknet...@gmail.com wrote: When I include some Avro code in my Mapper, I get this error: Error: org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$F eature;)Lorg/codehaus/jackson/JsonFactory; Particularly, just these two lines of code: InputStream in = getClass().getResourceAsStream(schema.avsc); Schema schema = Schema.parse(in); This code works perfectly when run as a stand alone application outside of Hadoop. Why do I get this error? and what's the best way to get rid of it? I am using Hadoop 0.20.2, and writing code in the new API. I found that the Hadoop lib directory contains jackson-core-asl-1.0.1.jar and jackson-mapper-asl-1.0.1.jar. I removed these, but got this error: hadoop Exception in thread main java.lang. NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException I am using Maven as a build tool, and my pom.xml has this dependency: dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.5.2/version scopecompile/scope /dependency I added the dependency: dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId version1.5.2/version scopecompile/scope /dependency But that still gives me this error: Error: org.codehaus.jackson. JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus /jackson/JsonFactory; - I also tried replacing the earlier dependencies with these: dependency groupIdorg.apache.avro/ groupId artifactIdavro-tools/artifactId version1.6.3/version /dependency dependency groupIdorg.apache.avro/groupId artifactIdavro/artifactId version1.6.3/version /dependency dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.8.8/version scopecompile/scope /dependency dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId version1.8.8/version scopecompile/scope /dependency And this is my app dependency tree: [INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ AvroTest --- [INFO] org.avrotest:AvroTest:jar:1.0-SNAPSHOT [INFO] +- junit:junit:jar:3.8.1:test (scope not updated to compile) [INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile [INFO] +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile [INFO] +- net.sf.json-lib:json-lib:jar:jdk15:2.3:compile [INFO] | +- commons-beanutils:commons-beanutils:jar:1.8.0:compile [INFO] | +- commons-collections:commons-collections:jar:3.2.1:compile [INFO] | +- commons-lang:commons-lang:jar:2.4:compile [INFO] | +- commons-logging:commons-logging:jar:1.1.1:compile [INFO] | \- net.sf.ezmorph:ezmorph:jar:1.0.6:compile [INFO] +- org.apache.avro:avro-tools:jar:1.6.3:compile [INFO] | \- org.slf4j:slf4j-api:jar:1.6.4:compile [INFO] +- org.apache.avro:avro:jar:1.6.3:compile [INFO] | +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile [INFO] | \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile [INFO] \- org.apache.hadoop:hadoop-core:jar:0.20.2:compile [INFO]+- commons-cli:commons-cli:jar:1.2:compile [INFO]+- xmlenc:xmlenc:jar:0.52:compile [INFO]+- commons-httpclient:commons-httpclient:jar:3.0.1:compile [INFO]+- commons-codec:commons-codec:jar:1.3:compile [INFO]+- commons-net:commons-net:jar:1.4.1:compile [INFO]+- org.mortbay.jetty:jetty:jar:6.1.14:compile [INFO]+- org.mortbay.jetty:jetty-util:jar:6.1.14:compile [INFO]+- tomcat:jasper-runtime:jar:5.5.12:compile [INFO]+- tomcat:jasper-compiler:jar:5.5.12:compile [INFO]+- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile [INFO]+- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile [INFO]| \- ant:ant:jar:1.6.5:compile [INFO]+- commons-el:commons-el:jar:1.0:compile [INFO]+- net.java.dev.jets3t:jets3t:jar
Re: _temporary doesn't exist
In the logs directory in your hadoop setup. You will have NameNode, JobTracker and TaskTracker logs. Individual task tracker logs are generated on the nodes. You can also see the job specific logs from one of the web UIs if you're using Apache Hadoop. On Fri, Mar 16, 2012 at 5:04 PM, Vipul Bharakhada vipulr...@gmail.comwrote: Where can I find those logs? -Vipul On Fri, Mar 16, 2012 at 1:33 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Vipul AFAIK, the clean up should happen after the job completion not task. But what is causing this clean up, may be you can get some info from the NN logs. Regards Bejoy On Sat, Mar 17, 2012 at 12:48 AM, Vipul Bharakhada vipulr...@gmail.com wrote: Its one of the server which I don't have permission to upgrade. but any help is fine. I saw a bug filed on 0.20 that mapreduce doesnt check for the _temporary directory existence. but not so sure is it hadoop or its one of the scheduled jobs on server which is cleaning up hadoop. I am curious about why its deleting the _temporary folder before the task is finished. the clean up should happen at the task completion If I am not wrong. Correct me If I am wrong. I am new to hadoop. Thank you. -Vipul On Fri, Mar 16, 2012 at 12:09 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Vipul Is there any reason you are with 0.17 version of hadoop? It is a pretty old version of hadoop (more than 2 years back) and tons of bug fixes and optimizations have gone into trunk after the same. You should badly upgrade to any 1.0.X releases. It would be hard for any one on the list to help you out with such an outdated version,Try an upgrade and see whether this issue still persists. Regards Bejoy On Sat, Mar 17, 2012 at 12:27 AM, Vipul Bharakhada vipulr...@gmail.com wrote: One more observation: usually this job takes 3 to 4 minutes, however when it fails, at that particular time it takes more than 42 to 50 minutes. -Vipul On Fri, Mar 16, 2012 at 11:38 AM, Vipul Bharakhada vipulr...@gmail.com wrote: Hi, I am using the old hadoop version 0.17.2 and I am getting the following exception when I am trying to run a Job. But it only happens at particular time, as cron jobs do run those task at particular intervals but it only fails at one particular time in day. Mar 14 06:49:23 7 08884: java.io.IOException: The directory hdfs://{IPADDRESS}:{PORT}/myserver/matcher/output/_temporary doesnt exist Mar 14 06:49:23 7 08884: \09at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:1439) Mar 14 06:49:23 7 08884: \09at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.launchTask(TaskTracker.java:1511) Mar 14 06:49:23 7 08884: \09at org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:723) Mar 14 06:49:23 7 08884: \09at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:716) Mar 14 06:49:23 7 08884: \09at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1274) Mar 14 06:49:23 7 08884: \09at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:915) Mar 14 06:49:23 7 08884: \09at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1310) Mar 14 06:49:23 7 08884: \09at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2251) What can be a problem? as this folder is created by hadoop internally and used internally, clean up is also done internally by hadoop. So why this directory is missing at some particular time? Any clue? -Vipul
Control Resources / Cores assigned for a job
Hi, I want to be able to control the number of nodes assigned to a MR job on the cluster. For example, I want the job to not execute more than 10 Mappers at a time, irrespective of whether there are more nodes available or not. I don't wish to control the number of mappers that are created by the job. The number of mappers is tied to the problem size / input data and on my input splits. I want it to remain that way. Is this possible? If so, what's the best way to do this? Deepak
Mapper Only Job, Without Input or Output Path
Hi, I have a use case - I have files lying on the local disk of every node on my cluster. I want to write a Mapper only MapReduce job that reads the file off the local disk on every machine, applies some transformation and wrotes to HDFS. Specifically, 1. The Job shouldn't have any input/output paths, and null key value pairs. 2. Mapper Only 3. I want to be able to control the number of Mappers, depending on the size of my cluster. What's the best way to do this? I would appreciate any example code. Deepak
Suggestion for InputSplit and InputFormat - Split every line.
Hi, I have this use case - I need to spawn as many mappers as the number of lines in a file in HDFS. This file isn't big (only 10-50 lines). Actually each line represents the path of another data source that the Mappers will work on. So each mapper will read 1 line, (the map() method will need to be called only once), and work on the data source. What's the best way to construct InputSplit, InputFormat and RecordReader to achieve this? I would appreciate any example code :) Best, Deepak