Re: Changing the Java heap

2012-04-26 Thread Deepak Nettem
HDFS doesn't care about the contents of the file. The file gets divided
into 64MB Blocks.

For example, If your input file contains data in custom format (like
Paragraphs) and you want the files to split as per paragraphs, HDFS isn't
responsible - and rightly so.

The application developer needs to use a custom InputFormat which
internally uses RecordReader and InputSplit. The default, text input format
makes sure that your mappers get each line as an input. The lines that span
two blocks are handled by the InputSplit which makes sure that the
necessary bytes from two blocks are made available, and Record Reader
actually converts that byte view into (key,value).



On Thu, Apr 26, 2012 at 4:59 PM, Barry, Sean F sean.f.ba...@intel.comwrote:

 I guess what I meant to say was, how does hadoop make 64M blocks without
 cutting off parts of words at the end of each block? Does it only make
 blocks at whitespace?

 -SB

 -Original Message-
 From: Michael Segel [mailto:michael_se...@hotmail.com]
 Sent: Thursday, April 26, 2012 1:56 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Changing the Java heap

 Not sure of your question.

 Java child Heap size is independent of how files are split on HDFS.

 I suggest you look at Tom White's book on HDFS and how files are split in
 to blocks.

 Blocks are split on set sizes. 64MB by default.
 Your record boundaries are not necessarily on file block boundaries so one
 process may read the rest of the last record in block A and then complete
 reading it at the start of block B. A different task may start with block B
 and skip the first n bytes until it hits the start of a record.

 HTH

 -Mike

 On Apr 26, 2012, at 3:46 PM, Barry, Sean F wrote:

  Within my small 2 node cluster I set up my 4 core slave node to have 4
 task trackers and I also limited my java heap size to -Xmx1024m
 
  Is there a possibility that when the data gets broken up that it will
 break it at a place in the file that is not a whitespace? Or is that
 already handled when the data on HDFS is broken up into blocks?
 
  -SB




-- 
Warm Regards,
Deepak Nettem http://www.cs.stonybrook.edu/%7Ednettem/


Re: Structuring MapReduce Jobs

2012-04-09 Thread Deepak Nettem
What a well structured question!

On Sun, Apr 8, 2012 at 6:37 AM, Tom Ferguson tomfergu...@gmail.com wrote:

 Hello,

 I'm very new to Hadoop and I am trying to carry out of proof of concept for
 processing some trading data. I am from a .net background, so I am trying
 to prove whether it can be done primarily using C#, therefore I am looking
 at the Hadoop Streaming job (from the Hadoop examples) to call in to some
 C# executables.

 My problem is, I am not certain of the best way to structure my jobs to
 process the data in the way I want.

 I have data stored in an RDBMS in the following format:

 ID TradeID  Date  Value
 -
 1 1  2012-01-01 12.34
 2 1  2012-01-02 12.56
 3 1  2012-01-03 13.78
 4 2  2012-01-04 18.94
 5 2  2012-05-17 19.32
 6 2  2012-05-18 19.63
 7 3  2012-05-19 17.32
 What I want to do is take all the Dates  Values for a given TradeID into a
 mathematical function that will spit out the same set of Dates but will
 have recalculated all the Values. I hope that makes sense.. e.g.

 Date Value
 ---
 2012-01-01 12.34
 2012-01-02 12.56
 2012-01-03 13.78
 will have the mathematical function applied and spit out

 Date Value
 ---
 2012-01-01 28.74
 2012-01-02 31.29
 2012-01-03 29.93
 I am not exactly sure how to achieve this using Hadoop Streaming, but my
 thoughts so far are...


   1. Us Sqoop to take the data out of the RDBMS and in to HDFS and split
   by TradeID - will this guarantee that all the the data points for a given
   TradeID will be processed by the same Map task??
   2. Write a Map task as a C# executable that will stream data in in the
   format (ID, TradeID, Date, Value)
   3. Gather all the data points for a given TradeID together into an array
   (or other datastructure)


A Naive Way -

The Mapper will need to emit key,value pairs where TradeID = key, and the
entire record is the value.

Hadoop will make sure that all key,value pairs with the same key land up in
the same reducer. In Java for example, all records for the same TradeID
would become available as an Iterable collection.

The Reducer can apply the mathematical function that you're talking about.

Another Way -

If it is guaranteed that records with the same TradeID occur one after the
other (and occur a fixed number of times, say 'k' times), then you can use
a custom input format that makes available to the mapper 'k' records at a
time, instead of 1. The mapper can then apply mathematical function. No
reducer would be required in this case.


   4. Pass the array into the mathematical function
   5. Get the results back as another array
   6. Stream the results back out in the format (TradeID, Date, ResultValue)

 I will have around 500,000 Trade IDs, with up to 3,000 data points each, so
 I am hoping that the data/processing will be distributed appropriately by
 Hadoop.

 Now, this seams a little bit long winded, but is this the best way of doing
 it, based on the constraints of having to use C# for writing my tasks? In
 the example above I do not have a Reduce job at all. Is that right in my
 scenario?

 Thanks for any help you can give and apologies if I am asking stupid
 questions here!

 Kind Regards,

 Tom



Deepak Nettem
MS CS
SUNY Stony Brook


Re: CombineFileInputFormat

2012-04-09 Thread Deepak Nettem
Hi Stan,

Just out of curiosity, care to explain the use case a bit?

On Mon, Apr 9, 2012 at 5:25 PM, Stan Rosenberg stan.rosenb...@gmail.comwrote:

 Hi,

 I just came across a use case requiring CombineFileInputFormat under
 hadoop 0.20.2.  I was surprised that the API does not provide a
 default
 implementation.  A precursory check against newer APIs also returned
 the same result.
 What's the rationale? I ended up writing my own implementation.
 However, it struck me that this might be a common use case, so why not
 provide
 an implementation?

 Thanks,

 stan


Deepak Nettem


Get Current Block or Split ID, and using it, the Block Path

2012-04-08 Thread Deepak Nettem
Hi,

Is it possible to get the 'id' of the currently executing split or block
from within the mapper? Using this block Id / split id, I want to be able
to query the namenode to get the names of hosts having that block / spllit,
and the actual path to the data.

I need this for some analytics that I'm doing. Is there a client API that
allows doing this?  If not, what's the best way to do this?

Best,
Deepak Nettem


ResetableIterator for Joins

2012-03-31 Thread Deepak Nettem
Hi,

I don't quite understand the purpose of the ResetableIterator. (
http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/mapred/join/ResetableIterator.html
).

The description says - This defines an interface to a stateful Iterator
that can replay elements added to it directly. Note that this does not
extend 
Iteratorhttp://java.sun.com/javase/6/docs/api/java/util/Iterator.html?is-external=true.


What's a stateful iterator and why do we need to replay elements added to
it? Why does it not extend Iterator? I would appreciate any insight!

-- 
Warm Regards,
Deepak Nettem


Re: Avro, Hadoop0.20.2, Jackson Error

2012-03-29 Thread Deepak Nettem
Hi,

I have moved to CDH3 which doesn't have this issue. Hope that helps anybody
stuck with the same issue.

best,
Deepak

On Mon, Mar 26, 2012 at 11:19 PM, Scott Carey sc...@richrelevance.comwrote:

 Does it still happen if you configure

 avro-tools to use

 dependency
  groupIdorg.apache.avro/groupId
   artifactIdavro-tools/artifactId
  version1.6.3/version
   classifiernodeps/classifier
/dependency


 ?

 You have two hadoop's, two jacksons, and even two avro:avro artifacts in
 your classpath if you use the avro bundle jar with a default classifier.

 avro-tools jar is not intended for inclusion in a project, as it is a jar
 with dependencies inside.
 https://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildD
 ocumentation-ProjectStructurehttps://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildD%0Aocumentation-ProjectStructure

 On 3/26/12 7:52 PM, Deepak Nettem deepaknet...@gmail.com wrote:

 When I include some Avro code in my Mapper, I get this error:
 
 Error:
 org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$F
 eature;)Lorg/codehaus/jackson/JsonFactory;
 
 Particularly, just these two lines of code:
 
 InputStream in =
 getClass().getResourceAsStream(schema.avsc);
 Schema schema = Schema.parse(in);
 
 This code works perfectly when run as a stand alone application outside of
 Hadoop. Why do I get this error? and what's the best way to get rid of it?
 
 I am using Hadoop 0.20.2, and writing code in the new API.
 
 I found that the Hadoop lib directory contains jackson-core-asl-1.0.1.jar
 and jackson-mapper-asl-1.0.1.jar.
 
 I removed these, but got this error:
 hadoop Exception in thread main java.lang.
 NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException
 
 I am using Maven as a build tool, and my pom.xml has this dependency:
 
 dependency
 groupIdorg.codehaus.jackson/groupId
   artifactIdjackson-mapper-asl/artifactId
   version1.5.2/version
   scopecompile/scope
 /dependency
 
 
 
 
 I added the dependency:
 
 
 dependency
 groupIdorg.codehaus.jackson/groupId
   artifactIdjackson-core-asl/artifactId
   version1.5.2/version
   scopecompile/scope
 /dependency
 
 But that still gives me this error:
 
 Error: org.codehaus.jackson.
 JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus
 /jackson/JsonFactory;
 
 -
 
 I also tried replacing the earlier dependencies with these:
 
dependency
 groupIdorg.apache.avro/
 groupId
 artifactIdavro-tools/artifactId
 version1.6.3/version
 /dependency
 
 dependency
 groupIdorg.apache.avro/groupId
 artifactIdavro/artifactId
 version1.6.3/version
 /dependency
 
 
 dependency
 groupIdorg.codehaus.jackson/groupId
   artifactIdjackson-mapper-asl/artifactId
   version1.8.8/version
   scopecompile/scope
 /dependency
 
 dependency
 groupIdorg.codehaus.jackson/groupId
   artifactIdjackson-core-asl/artifactId
   version1.8.8/version
   scopecompile/scope
 /dependency
 
 And this is my app dependency tree:
 
 [INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ AvroTest ---
 [INFO] org.avrotest:AvroTest:jar:1.0-SNAPSHOT
 [INFO] +- junit:junit:jar:3.8.1:test (scope not updated to compile)
 [INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile
 [INFO] +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile
 [INFO] +- net.sf.json-lib:json-lib:jar:jdk15:2.3:compile
 [INFO] |  +- commons-beanutils:commons-beanutils:jar:1.8.0:compile
 [INFO] |  +- commons-collections:commons-collections:jar:3.2.1:compile
 [INFO] |  +- commons-lang:commons-lang:jar:2.4:compile
 [INFO] |  +- commons-logging:commons-logging:jar:1.1.1:compile
 [INFO] |  \- net.sf.ezmorph:ezmorph:jar:1.0.6:compile
 [INFO] +- org.apache.avro:avro-tools:jar:1.6.3:compile
 [INFO] |  \- org.slf4j:slf4j-api:jar:1.6.4:compile
 [INFO] +- org.apache.avro:avro:jar:1.6.3:compile
 [INFO] |  +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile
 [INFO] |  \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile
 [INFO] \- org.apache.hadoop:hadoop-core:jar:0.20.2:compile
 [INFO]+- commons-cli:commons-cli:jar:1.2:compile
 [INFO]+- xmlenc:xmlenc:jar:0.52:compile
 [INFO]+- commons-httpclient:commons-httpclient:jar:3.0.1:compile
 [INFO]+- commons-codec:commons-codec:jar:1.3:compile
 [INFO]+- commons-net:commons-net:jar:1.4.1:compile
 [INFO]+- org.mortbay.jetty:jetty:jar:6.1.14:compile
 [INFO]+- org.mortbay.jetty:jetty-util:jar:6.1.14:compile
 [INFO]+- tomcat:jasper-runtime:jar:5.5.12:compile
 [INFO]+- tomcat:jasper-compiler:jar:5.5.12:compile
 [INFO]+- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile
 [INFO]+- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile
 [INFO]|  \- ant:ant:jar:1.6.5:compile
 [INFO]+- commons-el:commons-el:jar:1.0:compile
 [INFO]+- net.java.dev.jets3t:jets3t:jar

Re: _temporary doesn't exist

2012-03-16 Thread Deepak Nettem
In the logs directory in your hadoop setup. You will have NameNode,
JobTracker and TaskTracker logs. Individual task tracker logs are generated
on the nodes.

You can also see the job specific logs from one of the web UIs if you're
using Apache Hadoop.

On Fri, Mar 16, 2012 at 5:04 PM, Vipul Bharakhada vipulr...@gmail.comwrote:

 Where can I find those logs?
 -Vipul

 On Fri, Mar 16, 2012 at 1:33 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Hi Vipul
   AFAIK, the clean up should happen after the job completion not task.
  But what is causing this clean up, may be you can get some info from the
 NN
  logs.
 
  Regards
  Bejoy
 
  On Sat, Mar 17, 2012 at 12:48 AM, Vipul Bharakhada vipulr...@gmail.com
  wrote:
 
   Its one of the server which I don't have permission to upgrade. but any
   help is fine. I saw a bug filed on 0.20 that mapreduce doesnt check for
  the
   _temporary directory existence. but not so sure is it hadoop or its one
  of
   the scheduled jobs on server which is cleaning up hadoop. I am curious
   about why its deleting the _temporary folder before the task is
 finished.
   the clean up should happen at the task completion If I am not wrong.
   Correct me If I am wrong. I am new to hadoop.
   Thank you.
   -Vipul
  
   On Fri, Mar 16, 2012 at 12:09 PM, Bejoy Ks bejoy.had...@gmail.com
  wrote:
  
Hi Vipul
 Is there any reason you are with 0.17 version of hadoop? It is a
pretty old version of hadoop (more than 2 years back) and tons of bug
   fixes
and optimizations have gone into trunk after the same. You should
 badly
upgrade to any 1.0.X releases. It would be hard for any one on the
 list
   to
help you out with such an outdated version,Try an upgrade and
see whether this issue still persists.
   
Regards
Bejoy
   
On Sat, Mar 17, 2012 at 12:27 AM, Vipul Bharakhada 
  vipulr...@gmail.com
wrote:
   
 One more observation: usually this job takes 3 to 4 minutes,
 however
   when
 it fails, at that particular time it takes more than 42 to 50
  minutes.
 -Vipul

 On Fri, Mar 16, 2012 at 11:38 AM, Vipul Bharakhada 
   vipulr...@gmail.com
 wrote:

  Hi,
  I am using the old hadoop version 0.17.2 and I am getting the
   following
  exception when I am trying to run a Job. But it only happens at
 particular
  time, as cron jobs do run those task at particular intervals but
 it
only
  fails at one particular time in day.
 
  Mar 14 06:49:23 7 08884: java.io.IOException: The directory
  hdfs://{IPADDRESS}:{PORT}/myserver/matcher/output/_temporary
   doesnt
  exist
  Mar 14 06:49:23 7 08884: \09at
 

   
  
 
 org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:1439)
  Mar 14 06:49:23 7 08884: \09at
 

   
  
 
 org.apache.hadoop.mapred.TaskTracker$TaskInProgress.launchTask(TaskTracker.java:1511)
  Mar 14 06:49:23 7 08884: \09at
 

   
  
 
 org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:723)
  Mar 14 06:49:23 7 08884: \09at
 
   org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:716)
  Mar 14 06:49:23 7 08884: \09at
 
   
  org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1274)
  Mar 14 06:49:23 7 08884: \09at
 
   org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:915)
  Mar 14 06:49:23 7 08884: \09at
  org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1310)
  Mar 14 06:49:23 7 08884: \09at
  org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2251)
 
  What can be a problem? as this folder is created by hadoop
  internally
and
  used internally, clean up is also done internally by hadoop. So
 why
this
  directory is missing at some particular time?
  Any clue?
  -Vipul
 
 

   
  
 



Control Resources / Cores assigned for a job

2012-03-16 Thread Deepak Nettem
Hi,

I want to be able to control the number of nodes assigned to a MR job on
the cluster. For example, I want the job to not execute more than 10
Mappers at a time, irrespective of whether there are more nodes available
or not.

I don't wish to control the number of mappers that are created by the job.
The number of mappers is tied to the problem size / input data and on my
input splits. I want it to remain that way.

Is this possible? If so, what's the best way to do this?

Deepak


Mapper Only Job, Without Input or Output Path

2012-03-15 Thread Deepak Nettem
Hi,

I have a use case - I have  files lying on the local disk of every node on
my cluster. I want to write a Mapper only MapReduce job that reads the file
off the local disk on every machine, applies some transformation and wrotes
to HDFS.

Specifically,

1. The Job shouldn't have any input/output paths, and null key value pairs.
2. Mapper Only
3. I want to be able to control the number of Mappers, depending on the
size of my cluster.

What's the best way to do this? I would appreciate any example code.

Deepak


Suggestion for InputSplit and InputFormat - Split every line.

2012-03-15 Thread Deepak Nettem
Hi,

I have this use case - I need to spawn as many mappers as the number of
lines in a file in HDFS. This file isn't big (only 10-50 lines). Actually
each line represents the path of another data source that the Mappers will
work on. So each mapper will read 1 line, (the map() method will need to be
called only once), and work on the data source.

What's the best way to construct InputSplit, InputFormat and RecordReader
to achieve this? I would appreciate any example code :)

Best,
Deepak