date:20120228

Re: namenode null pointer

2012-02-28 Thread Ben Cuthbert

So the filesystem has corrupted?

Regards

Ben
On 29 Feb 2012, at 05:51, madhu phatak wrote:

> Hi,
> This may be the issue with namenode is not correctly formatted.
> 
> On Sat, Feb 18, 2012 at 1:50 PM, Ben Cuthbert  wrote:
> 
>> All sometimes when I startup my hadoop I get the following error
>> 
>> 12/02/17 10:29:56 INFO namenode.NameNode: STARTUP_MSG:
>> /
>> STARTUP_MSG: Starting NameNode
>> STARTUP_MSG: host =iMac.local/192.168.0.191
>> STARTUP_MSG: args = []
>> STARTUP_MSG: version = 0.20.203.0
>> STARTUP_MSG: build =
>> http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203-r
>>  1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011
>> /
>> 12/02/17 10:29:56 WARN impl.MetricsSystemImpl: Metrics system not started:
>> Cannot locate configuration: tried hadoop-metrics2-namenode.properties,
>> hadoop-metrics2.properties
>> 2012-02-17 10:29:56.994 java[4065:1903] Unable to load realm info from
>> SCDynamicStore
>> 12/02/17 10:29:57 INFO util.GSet: VM type = 64-bit
>> 12/02/17 10:29:57 INFO util.GSet: 2% max memory = 17.77875 MB
>> 12/02/17 10:29:57 INFO util.GSet: capacity = 2^21 = 2097152 entries
>> 12/02/17 10:29:57 INFO util.GSet: recommended=2097152, actual=2097152
>> 12/02/17 10:29:57 INFO namenode.FSNamesystem: fsOwner=scottsue
>> 12/02/17 10:29:57 INFO namenode.FSNamesystem: supergroup=supergroup
>> 12/02/17 10:29:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
>> 12/02/17 10:29:57 INFO namenode.FSNamesystem:
>> dfs.block.invalidate.limit=100
>> 12/02/17 10:29:57 INFO namenode.FSNamesystem: isAccessTokenEnabled=false
>> accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
>> 12/02/17 10:29:57 INFO namenode.FSNamesystem: Registered
>> FSNamesystemStateMBean and NameNodeMXBean
>> 12/02/17 10:29:57 INFO namenode.NameNode: Caching file names occuring more
>> than 10 times
>> 12/02/17 10:29:57 INFO common.Storage: Number of files = 190
>> 12/02/17 10:29:57 INFO common.Storage: Number of files under construction
>> = 0
>> 12/02/17 10:29:57 INFO common.Storage: Image file of size 26377 loaded in
>> 0 seconds.
>> 12/02/17 10:29:57 ERROR namenode.NameNode: java.lang.NullPointerException
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1113)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1125)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1028)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:205)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:613)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1009)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:365)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:379)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:353)
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:254)
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:434)
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1153)
>> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162)
>> 
>> 12/02/17 10:29:57 INFO namenode.NameNode: SHUTDOWN_MSG:
>> /
>> SHUTDOWN_MSG: Shutting down NameNode at iMac.local/192.168.0.191
>> /
> 
> 
> 
> 
> -- 
> Join me at http://hadoopworkshop.eventbrite.com/

Re: What determines task attempt list URLs?

2012-02-28 Thread madhu phatak

Hi,
 Its better to use the hos  tnames rather than the ipaddress. If you use
hostnames , task_attempt URL will contain the hostname rather than
localhost .

On Fri, Feb 17, 2012 at 10:52 PM, Keith Wiley  wrote:

> What property or setup parameter determines the URLs displayed on the task
> attempts webpage of the job/task trackers?  My cluster seems to be
> configured such that all URLs for higher pages (the top cluster admin page,
> the individual job overview page, and the map/reduce task list page) show
> URLs by ip address, but the lowest page (the task attempt list for a single
> task) shows the URLs for the Machine and Task Logs columns by "localhost",
> not by ip address (although the Counters column still uses the ip address
> just like URLs on all the higher pages).
>
> The "localhost" links obviously don't work (the cluster is not on the
> local machine, it's on Tier 3)...unless I just happen to have a cluster
> also running on my local machine; then the links work but obviously they go
> to my local machine and thus describe a completely unrelated Hadoop
> cluster!!!  It goes without saying, that's ridiculous.
>
> So to get it to work, I have to manually copy/paste the ip address into
> the URLs every time I want to view those pages...which makes it incredibly
> tedious to view the task logs.
>
> I've asked this a few times now and have gotten no response.  Does no one
> have any idea how to properly configure Hadoop to get around this?  I've
> experimented with the mapred-site.xml mapred.job.tracker and
> mapred.task.tracker.http.address properties to no avail.
>
> What's going on here?
>
> 
>
>
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
>
> "I used to be with it, but then they changed what it was.  Now, what I'm
> with
> isn't it, and what's it seems weird and scary to me."
>   --  Abe (Grandpa) Simpson
>
> 
>
>


-- 
Join me at http://hadoopworkshop.eventbrite.com/

Re: namenode null pointer

2012-02-28 Thread madhu phatak

Hi,
 This may be the issue with namenode is not correctly formatted.

On Sat, Feb 18, 2012 at 1:50 PM, Ben Cuthbert  wrote:

> All sometimes when I startup my hadoop I get the following error
>
> 12/02/17 10:29:56 INFO namenode.NameNode: STARTUP_MSG:
> /
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG: host =iMac.local/192.168.0.191
> STARTUP_MSG: args = []
> STARTUP_MSG: version = 0.20.203.0
> STARTUP_MSG: build =
> http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203-r
>  1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011
> /
> 12/02/17 10:29:56 WARN impl.MetricsSystemImpl: Metrics system not started:
> Cannot locate configuration: tried hadoop-metrics2-namenode.properties,
> hadoop-metrics2.properties
> 2012-02-17 10:29:56.994 java[4065:1903] Unable to load realm info from
> SCDynamicStore
> 12/02/17 10:29:57 INFO util.GSet: VM type = 64-bit
> 12/02/17 10:29:57 INFO util.GSet: 2% max memory = 17.77875 MB
> 12/02/17 10:29:57 INFO util.GSet: capacity = 2^21 = 2097152 entries
> 12/02/17 10:29:57 INFO util.GSet: recommended=2097152, actual=2097152
> 12/02/17 10:29:57 INFO namenode.FSNamesystem: fsOwner=scottsue
> 12/02/17 10:29:57 INFO namenode.FSNamesystem: supergroup=supergroup
> 12/02/17 10:29:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
> 12/02/17 10:29:57 INFO namenode.FSNamesystem:
> dfs.block.invalidate.limit=100
> 12/02/17 10:29:57 INFO namenode.FSNamesystem: isAccessTokenEnabled=false
> accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
> 12/02/17 10:29:57 INFO namenode.FSNamesystem: Registered
> FSNamesystemStateMBean and NameNodeMXBean
> 12/02/17 10:29:57 INFO namenode.NameNode: Caching file names occuring more
> than 10 times
> 12/02/17 10:29:57 INFO common.Storage: Number of files = 190
> 12/02/17 10:29:57 INFO common.Storage: Number of files under construction
> = 0
> 12/02/17 10:29:57 INFO common.Storage: Image file of size 26377 loaded in
> 0 seconds.
> 12/02/17 10:29:57 ERROR namenode.NameNode: java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1113)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1125)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1028)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:205)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:613)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1009)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:365)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:379)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:353)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:254)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:434)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1153)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162)
>
> 12/02/17 10:29:57 INFO namenode.NameNode: SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at iMac.local/192.168.0.191
> /




-- 
Join me at http://hadoopworkshop.eventbrite.com/

Re: Invocation exception

2012-02-28 Thread Subir S

Sorry I missed this email.
Harsh answer is apt. Please see the error log from Job Tracker web ui for
failed tasks (mapper/reducer) to know the exact reason.

On Tue, Feb 28, 2012 at 10:23 AM, Mohit Anchlia wrote:

> Does it matter if reducer is set even if the no of reducers is 0? Is there
> a way to get more clear reason?
>
> On Mon, Feb 27, 2012 at 8:23 PM, Subir S 
> wrote:
>
> > On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia  > >wrote:
> >
> > > For some reason I am getting invocation exception and I don't see any
> > more
> > > details other than this exception:
> > >
> > > My job is configured as:
> > >
> > >
> > > JobConf conf = *new* JobConf(FormMLProcessor.*class*);
> > >
> > > conf.addResource("hdfs-site.xml");
> > >
> > > conf.addResource("core-site.xml");
> > >
> > > conf.addResource("mapred-site.xml");
> > >
> > > conf.set("mapred.reduce.tasks", "0");
> > >
> > > conf.setJobName("mlprocessor");
> > >
> > > DistributedCache.*addFileToClassPath*(*new*
> Path("/jars/analytics.jar"),
> > > conf);
> > >
> > > DistributedCache.*addFileToClassPath*(*new* Path("/jars/common.jar"),
> > > conf);
> > >
> > > conf.setOutputKeyClass(Text.*class*);
> > >
> > > conf.setOutputValueClass(Text.*class*);
> > >
> > > conf.setMapperClass(Map.*class*);
> > >
> > > conf.setCombinerClass(Reduce.*class*);
> > >
> > > conf.setReducerClass(IdentityReducer.*class*);
> > >
> >
> > Why would you set the Reducer when the number of reducers is set to zero.
> > Not sure if this is the real cause.
> >
> >
> > >
> > > conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
> > >
> > > conf.setOutputFormat(TextOutputFormat.*class*);
> > >
> > > FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
> > >
> > > FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
> > >
> > > JobClient.*runJob*(conf);
> > >
> > > -
> > > *
> > >
> > > java.lang.RuntimeException*: Error in configuring object
> > >
> > > at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
> > > ReflectionUtils.java:93*)
> > >
> > > at
> > >
> org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
> > >
> > > at org.apache.hadoop.util.ReflectionUtils.newInstance(*
> > > ReflectionUtils.java:117*)
> > >
> > > at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
> > >
> > > at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
> > >
> > > at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
> > >
> > > at java.security.AccessController.doPrivileged(*Native Method*)
> > >
> > > at javax.security.auth.Subject.doAs(*Subject.java:396*)
> > >
> > > at org.apache.hadoop.security.UserGroupInformation.doAs(*
> > > UserGroupInformation.java:1157*)
> > >
> > > at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
> > >
> > > Caused by: *java.lang.reflect.InvocationTargetException
> > > *
> > >
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
> > >
> > > at sun.reflect.NativeMethodAccessorImpl.invoke(*
> > > NativeMethodAccessorImpl.java:39*)
> > >
> > > at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
> > >
> >
>

Re: "Browse the filesystem" weblink broken after upgrade to 1.0.0: HTTP 404 "Problem accessing /browseDirectory.jsp"

2012-02-28 Thread madhu phatak

Hi,
 Just make sure that Datanode is up. Looking into the datanode logs.

On Sun, Feb 19, 2012 at 10:52 PM, W.P. McNeill  wrote:

> I am running in pseudo-distributed on my Mac and just upgraded from
> 0.20.203.0 to 1.0.0. The web interface for HDFS which was working in
> 0.20.203.0 is broken in 1.0.0.
>
> HDFS itself appears to work: a command line like "hadoop fs -ls /" returns
> a result, and the namenode web interface at http://
> http://localhost:50070/dfshealth.jsp comes up. However, when I click on
> the
> "Browse the filesystem" link on this page I get a 404 Error. The error
> message displayed in the browser reads:
>
> Problem accessing /browseDirectory.jsp. Reason:
>/browseDirectory.jsp
>
> The URL in the browser bar at this point is "
> http://0.0.0.0:50070/browseDirectory.jsp?namenodeInfoPort=50070&dir=/";.
> The
> HTML source to the link on the main namenode page is  href="/nn_browsedfscontent.jsp">Browse the filesystem. If I change the
> server location from 0.0.0.0 to localhost in my browser bar I get the same
> error.
>
> I updated my configuration files in the new hadoop 1.0.0 conf directory to
> transfer over my settings from 0.20.203.0. My conf/slaves file consists of
> the line "localhost".  I ran "hadoop-daemon.sh start namenode -upgrade"
> once when prompted my errors in the namenode logs. After that all the
> namenode and datanode logs contain no errors.
>
> For what it's worth, I've verified that the bug occurs on Firefox, Chrome,
> and Safari.
>
> Any ideas on what is wrong or how I should go about further debugging it?
>



-- 
Join me at http://hadoopworkshop.eventbrite.com/

Re: Invocation exception

2012-02-28 Thread Harsh J

Mohit,

If you visit the failed task attempt on the JT Web UI, you can see the
complete, informative stack trace on it. It would point the exact line
the trouble came up in and what the real error during the
configure-phase of task initialization was.

A simple attempts page goes like the following (replace job ID and
task ID of course):

http://host:50030/taskdetails.jsp?jobid=job_201202041249_3964&tipid=task_201202041249_3964_m_00

Once there, find and open the "All" logs link to see stdout, stderr,
and syslog of the specific failed task attempt. You'll have more info
sifting through this to debug your issue.

This is also explained in Tom's book under the title "Debugging a Job"
(p154, Hadoop: The Definitive Guide, 2nd ed.).

On Wed, Feb 29, 2012 at 1:40 AM, Mohit Anchlia  wrote:
> It looks like adding this line causes invocation exception. I looked in
> hdfs and I see that file in that path
>
> DistributedCache.*addFileToClassPath*(*new* Path("/jars/common.jar"), conf);
>
> I have similar code for another jar
> "DistributedCache.*addFileToClassPath*(*new* Path("/jars/analytics.jar"),
> conf);" but this works just fine.
>
>
> On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia wrote:
>
>> I commented reducer and combiner both and still I see the same exception.
>> Could it be because I have 2 jars being added?
>>
>>  On Mon, Feb 27, 2012 at 8:23 PM, Subir S wrote:
>>
>>> On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia >> >wrote:
>>>
>>> > For some reason I am getting invocation exception and I don't see any
>>> more
>>> > details other than this exception:
>>> >
>>> > My job is configured as:
>>> >
>>> >
>>> > JobConf conf = *new* JobConf(FormMLProcessor.*class*);
>>> >
>>> > conf.addResource("hdfs-site.xml");
>>> >
>>> > conf.addResource("core-site.xml");
>>> >
>>> > conf.addResource("mapred-site.xml");
>>> >
>>> > conf.set("mapred.reduce.tasks", "0");
>>> >
>>> > conf.setJobName("mlprocessor");
>>> >
>>> > DistributedCache.*addFileToClassPath*(*new* Path("/jars/analytics.jar"),
>>> > conf);
>>> >
>>> > DistributedCache.*addFileToClassPath*(*new* Path("/jars/common.jar"),
>>> > conf);
>>> >
>>> > conf.setOutputKeyClass(Text.*class*);
>>> >
>>> > conf.setOutputValueClass(Text.*class*);
>>> >
>>> > conf.setMapperClass(Map.*class*);
>>> >
>>> > conf.setCombinerClass(Reduce.*class*);
>>> >
>>> > conf.setReducerClass(IdentityReducer.*class*);
>>> >
>>>
>>> Why would you set the Reducer when the number of reducers is set to zero.
>>> Not sure if this is the real cause.
>>>
>>>
>>> >
>>> > conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
>>> >
>>> > conf.setOutputFormat(TextOutputFormat.*class*);
>>> >
>>> > FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
>>> >
>>> > FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
>>> >
>>> > JobClient.*runJob*(conf);
>>> >
>>> > -
>>> > *
>>> >
>>> > java.lang.RuntimeException*: Error in configuring object
>>> >
>>> > at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
>>> > ReflectionUtils.java:93*)
>>> >
>>> > at
>>> >
>>> org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
>>> >
>>> > at org.apache.hadoop.util.ReflectionUtils.newInstance(*
>>> > ReflectionUtils.java:117*)
>>> >
>>> > at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
>>> >
>>> > at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
>>> >
>>> > at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
>>> >
>>> > at java.security.AccessController.doPrivileged(*Native Method*)
>>> >
>>> > at javax.security.auth.Subject.doAs(*Subject.java:396*)
>>> >
>>> > at org.apache.hadoop.security.UserGroupInformation.doAs(*
>>> > UserGroupInformation.java:1157*)
>>> >
>>> > at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
>>> >
>>> > Caused by: *java.lang.reflect.InvocationTargetException
>>> > *
>>> >
>>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
>>> >
>>> > at sun.reflect.NativeMethodAccessorImpl.invoke(*
>>> > NativeMethodAccessorImpl.java:39*)
>>> >
>>> > at
>>> >
>>> >
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
>>> >
>>>
>>
>>



-- 
Harsh J

Re: [blog post] Accumulo, Nutch, and Gora

2012-02-28 Thread Enis Söztutar

Fabulous work!

There are obviously a lot of local modifications to be done for nutch +
gora + accumulo to work. So feel free to propose these to upstream nutch
and gora.

It should feel good to run the web crawl, and store the results on
accumulo.

Cheers,
Enis

On Tue, Feb 28, 2012 at 6:24 PM, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> UMMM wow!
>
> That's awesome Jason! Thanks so much!
>
> Cheers,
> Chris
>
> On Feb 28, 2012, at 5:41 PM, Jason Trost wrote:
>
> > Blog post for anyone who's interested.  I cover a basic howto for
> > getting Nutch to use Apache Gora to store web crawl data in Accumulo.
> >
> > Let me know if you have any questions.
> >
> > Accumulo, Nutch, and GORA
> > http://www.covert.io/post/18414889381/accumulo-nutch-and-gora
> >
> > --Jason
>
>
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>

Re: [blog post] Accumulo, Nutch, and Gora

2012-02-28 Thread Mattmann, Chris A (388J)

UMMM wow!

That's awesome Jason! Thanks so much!

Cheers,
Chris

On Feb 28, 2012, at 5:41 PM, Jason Trost wrote:

> Blog post for anyone who's interested.  I cover a basic howto for
> getting Nutch to use Apache Gora to store web crawl data in Accumulo.
> 
> Let me know if you have any questions.
> 
> Accumulo, Nutch, and GORA
> http://www.covert.io/post/18414889381/accumulo-nutch-and-gora
> 
> --Jason


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

toward Rack-Awareness approach

2012-02-28 Thread Patai Sangbutsarakum

Hi Hadoopers,

Currently I am running hadoop version 0.20.203 in production with 600 TB in her.
I am planning to enable rack awareness in my production, but I still
didn't see it through.

plan/questions.

1. I have script that can solve datanode/tasktracker IP to rack name.
2. Add topology.script.file.name in hdfs-site.xml and restart cluster.
3. After the cluster come back, my question start here,
- do i have to run balancer or fsck or some command to have those
600 TB become redistribute to different rack in one time ?
- currently i run balancer 2 hrs. everyday, can i keep this
routine and hope that at some point the data will be nicely
redistributed and aware of rack location ?
- how could we know that the data in the cluster is now fully rack
awareness ??
- if i just add the script and run balancer 2 hrs everyday, before
the whole data become rack awareness. the data will be kind
  of mix between "default-rack" of existing data (haven't get
balanced) and probably new loaded data will be rack-awareness.
  is it OK ? to have mix of default-rack and rack-specific data together ?

4. thought ?

Hope this make sense,

Thanks in advance
Patai

[blog post] Accumulo, Nutch, and Gora

2012-02-28 Thread Jason Trost

Blog post for anyone who's interested.  I cover a basic howto for
getting Nutch to use Apache Gora to store web crawl data in Accumulo.

Let me know if you have any questions.

Accumulo, Nutch, and GORA
http://www.covert.io/post/18414889381/accumulo-nutch-and-gora

--Jason

Re: 100x slower mapreduce compared to pig

2012-02-28 Thread Prashant Kommireddi

It would be great if we can take a look at what you are doing in the UDF vs
the Mapper.

100x slow does not make sense for the same job/logic, its either the Mapper
code or may be the cluster was busy at the time you scheduled MapReduce job?

Thanks,
Prashant

On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia wrote:

> I am comparing runtime of similar logic. The entire logic is exactly same
> but surprisingly map reduce job that I submit is 100x slow. For pig I use
> udf and for hadoop I use mapper only and the logic same as pig. Even the
> splits on the admin page are same. Not sure why it's so slow. I am
> submitting job like:
>
> java -classpath
>
> .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
> com.services.dp.analytics.hadoop.mapred.FormMLProcessor
>
> /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
> /examples/output1/
>
> How should I go about looking the root cause of why it's so slow? Any
> suggestions would be really appreciated.
>
>
>
> One of the things I noticed is that on the admin page of map task list I
> see status as "hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728" but
> for pig the status is blank.
>

100x slower mapreduce compared to pig

2012-02-28 Thread Mohit Anchlia

I am comparing runtime of similar logic. The entire logic is exactly same
but surprisingly map reduce job that I submit is 100x slow. For pig I use
udf and for hadoop I use mapper only and the logic same as pig. Even the
splits on the admin page are same. Not sure why it's so slow. I am
submitting job like:

java -classpath
.:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
com.services.dp.analytics.hadoop.mapred.FormMLProcessor
/examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
/examples/output1/

How should I go about looking the root cause of why it's so slow? Any
suggestions would be really appreciated.



One of the things I noticed is that on the admin page of map task list I
see status as "hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728" but
for pig the status is blank.

Re: How to modify hadoop-wordcount example to display File-wise results.

2012-02-28 Thread orayvah


Hi Srilathar,

I know this thread is quite old but I need your help with this.

I'm interested in also making some modifications to the hadoop Sort example.
Please could you give me pointers on how to rebuild hadoop to reflect the
changes made in the source.

I'm new to hadoop and would really appreciate your assistance.



us latha wrote:
> 
> Greetings!
> 
> Hi, Am trying to modify the WordCount.java mentioned at Example: WordCount
> v1.0at
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
> Would like to have output the following way,
> 
> FileOneword1  itsCount
> FileOneword2  itsCount
>   ..(and so on)
> FileTwoword1  itsCount
> FileTwowordx  its Count
>  ..
> FileThree  word1 its Count
>  ..
> 
> Am trying to do following changes to the code of WordCount.java
> 
> 1)  private Text filename = new Text();  // Added this to Map class .Not
> sure if I would have access to filename here.
> 2)  (line 18)OutputCollector output  // Changed
> the
> argument in the map() function to have another Text field.
> 3)  (line 23) output.collect(filename, word , one); // Trying to change
> the
> output format as 'filename word count'
> 
> Am not sure what other changes are to be affected to achieve the required
> output. filename is not available to the map method.
> My requirement is to go through all the data available in hdfs and prepare
> an index file with < filename word count>  format.
> Could you please throw light on how I can achieve this.
> 
> Thankyou
> Srilatha
> 
> 

-- 
View this message in context: 
http://old.nabble.com/How-to-modify-hadoop-wordcount-example-to-display-File-wise-results.-tp19826857p33410747.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Invocation exception

2012-02-28 Thread Mohit Anchlia

It looks like adding this line causes invocation exception. I looked in
hdfs and I see that file in that path

DistributedCache.*addFileToClassPath*(*new* Path("/jars/common.jar"), conf);

I have similar code for another jar
"DistributedCache.*addFileToClassPath*(*new* Path("/jars/analytics.jar"),
conf);" but this works just fine.


On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia wrote:

> I commented reducer and combiner both and still I see the same exception.
> Could it be because I have 2 jars being added?
>
>  On Mon, Feb 27, 2012 at 8:23 PM, Subir S wrote:
>
>> On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia > >wrote:
>>
>> > For some reason I am getting invocation exception and I don't see any
>> more
>> > details other than this exception:
>> >
>> > My job is configured as:
>> >
>> >
>> > JobConf conf = *new* JobConf(FormMLProcessor.*class*);
>> >
>> > conf.addResource("hdfs-site.xml");
>> >
>> > conf.addResource("core-site.xml");
>> >
>> > conf.addResource("mapred-site.xml");
>> >
>> > conf.set("mapred.reduce.tasks", "0");
>> >
>> > conf.setJobName("mlprocessor");
>> >
>> > DistributedCache.*addFileToClassPath*(*new* Path("/jars/analytics.jar"),
>> > conf);
>> >
>> > DistributedCache.*addFileToClassPath*(*new* Path("/jars/common.jar"),
>> > conf);
>> >
>> > conf.setOutputKeyClass(Text.*class*);
>> >
>> > conf.setOutputValueClass(Text.*class*);
>> >
>> > conf.setMapperClass(Map.*class*);
>> >
>> > conf.setCombinerClass(Reduce.*class*);
>> >
>> > conf.setReducerClass(IdentityReducer.*class*);
>> >
>>
>> Why would you set the Reducer when the number of reducers is set to zero.
>> Not sure if this is the real cause.
>>
>>
>> >
>> > conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
>> >
>> > conf.setOutputFormat(TextOutputFormat.*class*);
>> >
>> > FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
>> >
>> > FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
>> >
>> > JobClient.*runJob*(conf);
>> >
>> > -
>> > *
>> >
>> > java.lang.RuntimeException*: Error in configuring object
>> >
>> > at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
>> > ReflectionUtils.java:93*)
>> >
>> > at
>> >
>> org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
>> >
>> > at org.apache.hadoop.util.ReflectionUtils.newInstance(*
>> > ReflectionUtils.java:117*)
>> >
>> > at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
>> >
>> > at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
>> >
>> > at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
>> >
>> > at java.security.AccessController.doPrivileged(*Native Method*)
>> >
>> > at javax.security.auth.Subject.doAs(*Subject.java:396*)
>> >
>> > at org.apache.hadoop.security.UserGroupInformation.doAs(*
>> > UserGroupInformation.java:1157*)
>> >
>> > at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
>> >
>> > Caused by: *java.lang.reflect.InvocationTargetException
>> > *
>> >
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
>> >
>> > at sun.reflect.NativeMethodAccessorImpl.invoke(*
>> > NativeMethodAccessorImpl.java:39*)
>> >
>> > at
>> >
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
>> >
>>
>
>

Re: Invocation exception

2012-02-28 Thread Mohit Anchlia

I commented reducer and combiner both and still I see the same exception.
Could it be because I have 2 jars being added?

On Mon, Feb 27, 2012 at 8:23 PM, Subir S  wrote:

> On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia  >wrote:
>
> > For some reason I am getting invocation exception and I don't see any
> more
> > details other than this exception:
> >
> > My job is configured as:
> >
> >
> > JobConf conf = *new* JobConf(FormMLProcessor.*class*);
> >
> > conf.addResource("hdfs-site.xml");
> >
> > conf.addResource("core-site.xml");
> >
> > conf.addResource("mapred-site.xml");
> >
> > conf.set("mapred.reduce.tasks", "0");
> >
> > conf.setJobName("mlprocessor");
> >
> > DistributedCache.*addFileToClassPath*(*new* Path("/jars/analytics.jar"),
> > conf);
> >
> > DistributedCache.*addFileToClassPath*(*new* Path("/jars/common.jar"),
> > conf);
> >
> > conf.setOutputKeyClass(Text.*class*);
> >
> > conf.setOutputValueClass(Text.*class*);
> >
> > conf.setMapperClass(Map.*class*);
> >
> > conf.setCombinerClass(Reduce.*class*);
> >
> > conf.setReducerClass(IdentityReducer.*class*);
> >
>
> Why would you set the Reducer when the number of reducers is set to zero.
> Not sure if this is the real cause.
>
>
> >
> > conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
> >
> > conf.setOutputFormat(TextOutputFormat.*class*);
> >
> > FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
> >
> > FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
> >
> > JobClient.*runJob*(conf);
> >
> > -
> > *
> >
> > java.lang.RuntimeException*: Error in configuring object
> >
> > at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
> > ReflectionUtils.java:93*)
> >
> > at
> > org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
> >
> > at org.apache.hadoop.util.ReflectionUtils.newInstance(*
> > ReflectionUtils.java:117*)
> >
> > at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
> >
> > at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
> >
> > at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
> >
> > at java.security.AccessController.doPrivileged(*Native Method*)
> >
> > at javax.security.auth.Subject.doAs(*Subject.java:396*)
> >
> > at org.apache.hadoop.security.UserGroupInformation.doAs(*
> > UserGroupInformation.java:1157*)
> >
> > at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
> >
> > Caused by: *java.lang.reflect.InvocationTargetException
> > *
> >
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
> >
> > at sun.reflect.NativeMethodAccessorImpl.invoke(*
> > NativeMethodAccessorImpl.java:39*)
> >
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
> >
>

Did anyone notice this benchmark comparison between HyperTable and HBASE

2012-02-28 Thread Subir S

Is this true?
http://www.hypertable.com/why_hypertable/hypertable_vs_hbase_2/
I dont believe completely, as it may be Hypertables marketing as well

Are there any benchmarks done by 3rd parties on HBase comparing other NoSQL
databases? Pointers?

Thanks, Subir

Re: Hadoop and Hibernate

2012-02-28 Thread Owen O'Malley

On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
 wrote:

> If I create an executable jar file that contains all dependencies required
> by the MR job do all said dependencies get distributed to all nodes?

You can make a single jar and that will be distributed to all of the
machines that run the task, but it is better in most cases to use the
distributed cache.

See 
http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache

> If I specify but one reducer, which node in the cluster will the reducer
> run on?

The scheduling is done by the JobTracker and it isn't possible to
control the location of the reducers.

-- Owen

Re: Spilled Records

2012-02-28 Thread Jie Li

Hi Dan,

You might want to post your Pig script to the Pig user mailing list.
Previously I did some experiments on Pig and Hive and I'll also be
interested in looking into your script.

Yeah Starfish now only supports Hadoop job-level tuning, and supporting
workflow like Pig and Hive is our top priority. We'll let you know once
we're ready.

Thanks,
Jie

On Tue, Feb 28, 2012 at 11:57 AM, Daniel Baptista <
daniel.bapti...@performgroup.com> wrote:

> Hi Jie,
>
> To be honest I don't think I understand enough of what our job is doing to
> be able to explain it.
>
> Thanks for the response though, I had figured that I was grasping at
> straws.
>
> I have looped at Starfish however all our jobs are submitted via Apache
> Pig so I don't know if it would be much good.
>
> Thanks again, Dan.
>
> -Original Message-
> From: Jie Li [mailto:ji...@cs.duke.edu]
> Sent: 28 February 2012 16:35
> To: common-user@hadoop.apache.org
> Subject: Re: Spilled Records
>
> Hello Dan,
>
> The fact that the spilled records are double as the output records means
> the map task produces more than one spill file, and these spill files are
> read, merged and written to a single file, thus each record is spilled
> twice.
>
> I can't infer anything from the numbers of the two tasks. Could you provide
> more info, such as what the application is doing?
>
> If you like, you can also try our tool Starfish to see what's going on
> behind.
>
> Thanks,
> Jie
> --
> Starfish is an intelligent performance tuning tool for Hadoop.
> Homepage: www.cs.duke.edu/starfish/
> Mailing list: http://groups.google.com/group/hadoop-starfish
>
>
> On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista <
> daniel.bapti...@performgroup.com> wrote:
>
> > Hi All,
> >
> > I am trying to improve the performance of my hadoop cluster and would
> like
> > to get some feedback on a couple of numbers that I am seeing.
> >
> > Below is the output from a single task (1 of 16) that took 3 mins 40
> > Seconds
> >
> > FileSystemCounters
> > FILE_BYTES_READ 214,653,748
> > HDFS_BYTES_READ 67,108,864
> > FILE_BYTES_WRITTEN 429,278,388
> >
> > Map-Reduce Framework
> > Combine output records 0
> > Map input records 2,221,478
> > Spilled Records 4,442,956
> > Map output bytes 210,196,148
> > Combine input records 0
> > Map output records 2,221,478
> >
> > And another task in the same job (16 of 16) that took 7 minutes and 19
> > seconds
> >
> > FileSystemCounters
> > FILE_BYTES_READ 199,003,192
> > HDFS_BYTES_READ 58,434,476
> > FILE_BYTES_WRITTEN 397,975,310
> >
> > Map-Reduce Framework
> > Combine output records 0
> > Map input records 2,086,789
> > Spilled Records 4,173,578 Map output bytes
> > 194,813,958
> > Combine input records 0 Map output records 2,086,789
> >
> > Can anybody determine anything from these figures?
> >
> > The first task is twice as quick as the second yet the input and output
> > are comparable (certainly not double). In all of the tasks (in this and
> > other jobs) the spilled records are always double the output records,
> this
> > can't be 'normal'?
> >
> > Am I clutching at straws (it feels like I am).
> >
> > Thanks in advance, Dan.
> >
> >
>
>

Hadoop and Hibernate

2012-02-28 Thread Geoffry Roberts

All,

I am trying to use Hibernate within my reducer and it goeth not well.  Has
anybody ever successfully done this?

I have a java package that contains my Hadoop driver, mapper, and reducer
along with a persistence class.  I call Hibernate from the cleanup() method
in my reducer class.  It complains that it cannot find the persistence
class.  The class is in the same package as the reducer and this all would
work outside of Hadoop. The error is thrown when I attempt to begin a
transaction.

The error:

org.hibernate.MappingException: Unknown entity: qq.mob.depart.EpiState

The code:

protected void cleanup(Context ctx) throws IOException,
   InterruptedException {
...
org.hibernate.cfg.Configuration cfg = new org.hibernate.cfg.Configuration();
SessionFactory sessionFactory =
cfg.configure("hibernate.cfg.xml").buildSessionFactory();
cfg.addAnnotatedClass(EpiState.class); // This class is in the same
package as the reducer.
Session session = sessionFactory.openSession();
Transaction tx = session.getTransaction();
tx.begin(); //Error is thrown here.
...
}

If I create an executable jar file that contains all dependencies required
by the MR job do all said dependencies get distributed to all nodes?

If I specify but one reducer, which node in the cluster will the reducer
run on?

Thanks
-- 
Geoffry Roberts

RE: Spilled Records

2012-02-28 Thread Daniel Baptista

Hi Jie,

To be honest I don't think I understand enough of what our job is doing to be 
able to explain it. 

Thanks for the response though, I had figured that I was grasping at straws.

I have looped at Starfish however all our jobs are submitted via Apache Pig so 
I don't know if it would be much good.

Thanks again, Dan. 

-Original Message-
From: Jie Li [mailto:ji...@cs.duke.edu] 
Sent: 28 February 2012 16:35
To: common-user@hadoop.apache.org
Subject: Re: Spilled Records

Hello Dan,

The fact that the spilled records are double as the output records means
the map task produces more than one spill file, and these spill files are
read, merged and written to a single file, thus each record is spilled
twice.

I can't infer anything from the numbers of the two tasks. Could you provide
more info, such as what the application is doing?

If you like, you can also try our tool Starfish to see what's going on
behind.

Thanks,
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish


On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista <
daniel.bapti...@performgroup.com> wrote:

> Hi All,
>
> I am trying to improve the performance of my hadoop cluster and would like
> to get some feedback on a couple of numbers that I am seeing.
>
> Below is the output from a single task (1 of 16) that took 3 mins 40
> Seconds
>
> FileSystemCounters
> FILE_BYTES_READ 214,653,748
> HDFS_BYTES_READ 67,108,864
> FILE_BYTES_WRITTEN 429,278,388
>
> Map-Reduce Framework
> Combine output records 0
> Map input records 2,221,478
> Spilled Records 4,442,956
> Map output bytes 210,196,148
> Combine input records 0
> Map output records 2,221,478
>
> And another task in the same job (16 of 16) that took 7 minutes and 19
> seconds
>
> FileSystemCounters
> FILE_BYTES_READ 199,003,192
> HDFS_BYTES_READ 58,434,476
> FILE_BYTES_WRITTEN 397,975,310
>
> Map-Reduce Framework
> Combine output records 0
> Map input records 2,086,789
> Spilled Records 4,173,578 Map output bytes
> 194,813,958
> Combine input records 0 Map output records 2,086,789
>
> Can anybody determine anything from these figures?
>
> The first task is twice as quick as the second yet the input and output
> are comparable (certainly not double). In all of the tasks (in this and
> other jobs) the spilled records are always double the output records, this
> can't be 'normal'?
>
> Am I clutching at straws (it feels like I am).
>
> Thanks in advance, Dan.
>
>

Re: Spilled Records

2012-02-28 Thread Jie Li

Hello Dan,

The fact that the spilled records are double as the output records means
the map task produces more than one spill file, and these spill files are
read, merged and written to a single file, thus each record is spilled
twice.

I can't infer anything from the numbers of the two tasks. Could you provide
more info, such as what the application is doing?

If you like, you can also try our tool Starfish to see what's going on
behind.

Thanks,
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish


On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista <
daniel.bapti...@performgroup.com> wrote:

> Hi All,
>
> I am trying to improve the performance of my hadoop cluster and would like
> to get some feedback on a couple of numbers that I am seeing.
>
> Below is the output from a single task (1 of 16) that took 3 mins 40
> Seconds
>
> FileSystemCounters
> FILE_BYTES_READ 214,653,748
> HDFS_BYTES_READ 67,108,864
> FILE_BYTES_WRITTEN 429,278,388
>
> Map-Reduce Framework
> Combine output records 0
> Map input records 2,221,478
> Spilled Records 4,442,956
> Map output bytes 210,196,148
> Combine input records 0
> Map output records 2,221,478
>
> And another task in the same job (16 of 16) that took 7 minutes and 19
> seconds
>
> FileSystemCounters
> FILE_BYTES_READ 199,003,192
> HDFS_BYTES_READ 58,434,476
> FILE_BYTES_WRITTEN 397,975,310
>
> Map-Reduce Framework
> Combine output records 0
> Map input records 2,086,789
> Spilled Records 4,173,578 Map output bytes
> 194,813,958
> Combine input records 0 Map output records 2,086,789
>
> Can anybody determine anything from these figures?
>
> The first task is twice as quick as the second yet the input and output
> are comparable (certainly not double). In all of the tasks (in this and
> other jobs) the spilled records are always double the output records, this
> can't be 'normal'?
>
> Am I clutching at straws (it feels like I am).
>
> Thanks in advance, Dan.
>
>

Should splittable Gzip be a "core" hadoop feature?

2012-02-28 Thread Niels Basjes

Hi,

Some time ago I had an idea and implemented it.

Normally you can only run a single gzipped input file through a single
mapper and thus only on a single CPU core.
What I created makes it possible to process a Gzipped file in such a way
that it can run on several mappers in parallel.

I've put the javadoc I created on my homepage so you can read more about
the details.
http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec

Now the question that was raised by one of the people reviewing this code
was: Should this implementation be part of the core Hadoop feature set?
The main reason that was given is that this needs a bit more understanding
on what is happening and as such cannot be enabled by default.

I would like to hear from the Hadoop Core/Map reduce users what you think.

Should this be
- a part of the default Hadoop feature set so that anyone can simply enable
it by setting the right configuration?
- a separate library?
- a nice idea I had fun building but that no one needs?
- ... ?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Handling bad records

2012-02-28 Thread Harsh J

Subir,

No, not unless you use a specialized streaming library (pydoop, dumbo,
etc. for python, for example).

On Tue, Feb 28, 2012 at 2:19 PM, Subir S  wrote:
> Can multiple output be used with Hadoop Streaming?
>
> On Tue, Feb 28, 2012 at 2:07 PM, madhu phatak  wrote:
>
>> Hi Mohit ,
>>  A and B refers to two different output files (multipart name). The file
>> names will be seq-A* and seq-B*.  Its similar to "r" in part-r-0
>>
>> On Tue, Feb 28, 2012 at 11:37 AM, Mohit Anchlia > >wrote:
>>
>> > Thanks that's helpful. In that example what is "A" and "B" referring to?
>> Is
>> > that the output file name?
>> >
>> > mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
>> > mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
>> >
>> >
>> > On Mon, Feb 27, 2012 at 9:53 PM, Harsh J  wrote:
>> >
>> > > Mohit,
>> > >
>> > > Use the MultipleOutputs API:
>> > >
>> > >
>> >
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>> > > to have a named output of bad records. There is an example of use
>> > > detailed on the link.
>> > >
>> > > On Tue, Feb 28, 2012 at 3:48 AM, Mohit Anchlia > >
>> > > wrote:
>> > > > What's the best way to write records to a different file? I am doing
>> > xml
>> > > > processing and during processing I might come accross invalid xml
>> > format.
>> > > > Current I have it under try catch block and writing to log4j. But I
>> > think
>> > > > it would be better to just write it to an output file that just
>> > contains
>> > > > errors.
>> > >
>> > >
>> > >
>> > > --
>> > > Harsh J
>> > >
>> >
>>
>>
>>
>> --
>> Join me at http://hadoopworkshop.eventbrite.com/
>>



-- 
Harsh J

Spilled Records

2012-02-28 Thread Daniel Baptista

Hi All,

I am trying to improve the performance of my hadoop cluster and would like to 
get some feedback on a couple of numbers that I am seeing.

Below is the output from a single task (1 of 16) that took 3 mins 40 Seconds

FileSystemCounters
FILE_BYTES_READ 214,653,748
HDFS_BYTES_READ 67,108,864
FILE_BYTES_WRITTEN 429,278,388

Map-Reduce Framework
Combine output records 0
Map input records 2,221,478
Spilled Records 4,442,956
Map output bytes 210,196,148
Combine input records 0
Map output records 2,221,478

And another task in the same job (16 of 16) that took 7 minutes and 19 seconds

FileSystemCounters
FILE_BYTES_READ 199,003,192
HDFS_BYTES_READ 58,434,476
FILE_BYTES_WRITTEN 397,975,310

Map-Reduce Framework
Combine output records 0
Map input records 2,086,789
Spilled Records 4,173,578 Map output bytes
194,813,958
Combine input records 0 Map output records 2,086,789

Can anybody determine anything from these figures?

The first task is twice as quick as the second yet the input and output are 
comparable (certainly not double). In all of the tasks (in this and other jobs) 
the spilled records are always double the output records, this can't be 
'normal'?

Am I clutching at straws (it feels like I am).

Thanks in advance, Dan.

Re: LZO exception decompressing (returned -8)

2012-02-28 Thread Joey Echeverria

Try 0.4.15. You can get it from here:

https://github.com/toddlipcon/hadoop-lzo

Sent from my iPhone

On Feb 28, 2012, at 6:49, Marc Sturlese  wrote:

> I'm with 0.4.9 (think is the latest)
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3783927.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: LZO exception decompressing (returned -8)

2012-02-28 Thread Marc Sturlese

I'm with 0.4.9 (think is the latest)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3783927.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: LZO exception decompressing (returned -8)

2012-02-28 Thread Joey Echeverria

Which version of the Hadoop LZO library are you using? It looks like something 
I'm pretty sure was fixed in a newer version. 

-Joey



On Feb 28, 2012, at 4:58, Marc Sturlese  wrote:

> Hey there,
> I've been running a cluster for over a year and was getting a lzo
> decompressing exception less than once a month. Suddenly it happens almost
> once per day. Any ideas what could be causing it? I'm with hadoop 0.20.2
> I've thought in moving to snappy but would like to know why this happens
> more often now
> 
> The exception happens always when the reducer gets data from the map and
> looks like:
> 
> Error: java.lang.InternalError: lzo1x_decompress returned: -8
>at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native
> Method)
>at
> com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:305)
>at
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:76)
>at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
>at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1553)
>at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1432)
>at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1285)
>at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1216)
> 
> Thanks in advance.
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3783652.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: ClassNotFoundException: -libjars not working?

2012-02-28 Thread Ioan Eugen Stan


Pe 28.02.2012 10:58, madhu phatak a scris:

Hi,
  -libjars doesn't always work.Better way is to create a runnable jar with
all dependencies ( if no of dependency is less) or u have to keep the jars
into the lib folder of the hadoop in all machines.



Thanks for the reply Madhu,

I adopted the second solution as explained in [1]. From what I found 
browsing the net it seems that -libjars is broken in hadoop version > 
0.18. I didn't got time to check the code yet. Cloudera released hadoop 
sources are packaged a bit odd and Netbeans doens't seem to play well 
with that and this really affects my will to try to fix the problem.


"-libjars" is a nice feature that permits the use of skinny jars and 
would help system admins do better packaging. It also allows better 
control over the classpath. Too bad it didn't work.



[1] 
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/


Cheers,

--
Ioan Eugen Stan
http://ieugen.blogspot.com

Re: Need help on hadoop eclipse plugin

2012-02-28 Thread praveenesh kumar

So I made the above changes using WinRAR, it embedded those changes inside
the jar itself. I didn't need to extract the jar contents and construct new
jar again.
I just replaced this new jar with the old jar. Restart the eclipse with
eclipse -clean.
I am now able to run the hadoop eclipse plugin without any error in eclipse
helios 3.6.2.

However, now I am looking forward to use the same plugin in IBM RAD 8.0.
I am getting the following error in the .log :

!ENTRY org.eclipse.core.jobs 4 2 2012-02-28 05:26:12.056
!MESSAGE An internal error occurred during: "Connecting to DFS lxe9700".
!STACK 0
java.lang.NoClassDefFoundError:
org.apache.hadoop.security.UserGroupInformation (initialization failure)
at java.lang.J9VMInternals.initialize(Unknown Source)
at org.apache.hadoop.fs.FileSystem$Cache$Key.(Unknown Source)
at org.apache.hadoop.fs.FileSystem$Cache.get(Unknown Source)
at org.apache.hadoop.fs.FileSystem.get(Unknown Source)
at org.apache.hadoop.fs.FileSystem.get(Unknown Source)
at org.apache.hadoop.eclipse.server.HadoopServer.getDFS(Unknown Source)
at org.apache.hadoop.eclipse.dfs.DFSPath.getDFS(Unknown Source)
at
org.apache.hadoop.eclipse.dfs.DFSFolder.loadDFSFolderChildren(Unknown
Source)
at org.apache.hadoop.eclipse.dfs.DFSFolder$1.run(Unknown Source)
at org.eclipse.core.internal.jobs.Worker.run(Unknown Source)

I downloaded oracle jdk and changed the IBM RAD to use Oracle JDK 1.7 ,
still I am seeing the above error.
Can anyone help me in debugging this issue ?

Thanks,
Praveenesh

On Tue, Feb 28, 2012 at 1:12 PM, praveenesh kumar wrote:

> Hi all,
>
> I am trying to use hadoop eclipse plugin on my windows machine to connect
> to the my remote hadoop cluster. I am currently using putty to login to the
> cluster. So ssh is enable and my windows machine is able to listen to my
> hadoop cluster.
>
> I am using hadoop 0.20.205, hadoop-eclipse plugin -0.20.205.jar . eclipse
> helios Version: 3.6.2,  Oracle JDK 1.7
>
> If I am using original eclipse-plugin.jar by putting it inside my
> $ECLIPSE_HOME/dropins or /plugins folder, I am able to see Hadoop
> map-reduce perspective.
>
> But after specifying hadoop NN / JT connections, I am seeing the following
> error, whenever I am trying to access the HDFS.
>
> An internal error occurred during: "Connecting to DFS lxe9700".
> org/apache/commons/configuration/Configuration
>
> "Connecting to DFS lxe9700' has encountered a problem.
> An internal error occured during " Connecting to DFS"
>
> After seeing the .log file .. I am seeing the following lines :
>
> !MESSAGE An internal error occurred during: "Connecting to DFS lxe9700".
> !STACK 0
> java.lang.NoClassDefFoundError:
> org/apache/commons/configuration/Configuration
> at
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:37)
> at
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:34)
> at
> org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
> at
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:196)
> at
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159)
> at
> org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216)
> at
> org.apache.hadoop.security.KerberosName.(KerberosName.java:83)
> at
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:189)
> at
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159)
> at
> org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216)
> at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:409)
> at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:395)
> at
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1436)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1337)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:244)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:122)
> at
> org.apache.hadoop.eclipse.server.HadoopServer.getDFS(HadoopServer.java:469)
> at org.apache.hadoop.eclipse.dfs.DFSPath.getDFS(DFSPath.java:146)
> at
> org.apache.hadoop.eclipse.dfs.DFSFolder.loadDFSFolderChildren(DFSFolder.java:61)
> at org.apache.hadoop.eclipse.dfs.DFSFolder$1.run(DFSFolder.java:178)
> at org.eclipse.core.internal.jobs.Worker.run(Worker.java:54)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.commons.configuration.Configuration
> at
> org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:506)
> at
> org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:422)
> at
> org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:410)
>

LZO exception decompressing (returned -8)

2012-02-28 Thread Marc Sturlese

Hey there,
I've been running a cluster for over a year and was getting a lzo
decompressing exception less than once a month. Suddenly it happens almost
once per day. Any ideas what could be causing it? I'm with hadoop 0.20.2
I've thought in moving to snappy but would like to know why this happens
more often now

The exception happens always when the reducer gets data from the map and
looks like:

Error: java.lang.InternalError: lzo1x_decompress returned: -8
at 
com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native
Method)
at
com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:305)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:76)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1553)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1432)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1285)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1216)

Thanks in advance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3783652.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: hadoop streaming : need help in using custom key value separator

2012-02-28 Thread Austin Chungath

Thanks subir,

"-D stream.mapred.output.field.separator=*" is not an available option, my
bad
what I should have done is:

-D stream.map.output.field.separator=*
On Tue, Feb 28, 2012 at 2:36 PM, Subir S  wrote:

>
> http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs
>
> Read this link, your options are wrong below.
>
>
>
> On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath 
> wrote:
>
> > When I am using more than one reducer in hadoop streaming where I am
> using
> > my custom separater rather than the tab, it looks like the hadoop
> shuffling
> > process is not happening as it should.
> >
> > This is the reducer output when I am using '\t' to separate my key value
> > pair that is output from the mapper.
> >
> > *output from reducer 1:*
> > 10321,22
> > 23644,37
> > 41231,42
> > 23448,20
> > 12325,39
> > 71234,20
> > *output from reducer 2:*
> > 24123,43
> > 33213,46
> > 11321,29
> > 21232,32
> >
> > the above output is as expected the first column is the key and the
> second
> > value is the count. There are 10 unique keys and 6 of them are in output
> of
> > the first reducer and the remaining 4 int the second reducer output.
> >
> > But now when I use a custom separater for my key value pair output from
> my
> > mapper. Here I am using '*' as the separator
> > -D stream.mapred.output.field.separator=*
> > -D mapred.reduce.tasks=2
> >
> > *output from reducer 1:*
> > 10321,5
> > 21232,19
> > 24123,16
> > 33213,28
> > 23644,21
> > 41231,12
> > 23448,18
> > 11321,29
> > 12325,24
> > 71234,9
> > * *
> > *output from reducer 2:*
>  > 10321,17
> > 21232,13
> > 33213,18
> > 23644,16
> > 41231,30
> > 23448,2
> > 24123,27
> > 12325,15
> > 71234,11
> >
> > Now both the reducers are getting all the keys and part of the values go
> to
> > reducer 1 and part of the reducer go to reducer 2.
> > Why is it behaving like this when I am using a custom separator,
> shouldn't
> > each reducer get a unique key after the shuffling?
> > I am using Hadoop 0.20.205.0 and below is the command that I am using to
> > run hadoop streaming. Is there some more options that I should specify
> for
> > hadoop streaming to work properly if I am using a custom separator?
> >
> > hadoop jar
> > $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
> > -D stream.mapred.output.field.separator=*
> > -D mapred.reduce.tasks=2
> > -mapper ./map.py
> > -reducer ./reducer.py
> > -file ./map.py
> > -file ./reducer.py
> > -input /user/inputdata
> > -output /user/outputdata
> > -verbose
> >
> >
> > Any help is much appreciated,
> > Thanks,
> > Austin
> >
>

Re: PathFilter File Glob

2012-02-28 Thread Idris Ali

Hi,

Why not just use:
FileSystem fs = FileSystem.get(conf);
FileStatus[] files = fs.globStatus(new Path(path+filter));

Thanks,
-Idris

On Mon, Feb 27, 2012 at 1:06 PM, Harsh J  wrote:

> Hi Simon,
>
> You need to implement your custom PathFilter derivative class, and
> then set it via your {File}InputFormat class using setInputPathFilter:
>
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPathFilter(org.apache.hadoop.mapred.JobConf,%20java.lang.Class)
>
> (TextInputFormat is a derivative of FileInputFormat, and hence has the
> same method.)
>
> HTH.
>
> 2012/2/23 Heeg, Simon :
> > Hello,
> >
> > I would like to use a PathFilter for filtering the files with a regular
> expression which are read by the TextInputFormat, but I don't know how to
> apply the filter. I cannot find a setter. Unfortunately google was not my
> friend with this issue and "The definitive Guide" does  not help that much.
>  I am using Hadoop 0.20.2-cdh3u3.
> >
>
> --
> Harsh J
>

Re: HDFS problem in hadoop 0.20.203

2012-02-28 Thread madhu phatak

Hi,
 Did you formatted the HDFS?

On Tue, Feb 21, 2012 at 7:40 PM, Shi Yu  wrote:

> Hi Hadoopers,
>
> We are experiencing a strange problem on Hadoop 0.20.203
>
> Our cluster has 58 nodes, everything is started from a fresh
> HDFS (we deleted all local folders on datanodes and
> reformatted the namenode).  After running some small jobs, the
> HDFS becomes behaving abnormally and the jobs become very
> slow.  The namenode log is crushed by Gigabytes of errors like
> is:
>
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_4524177823306792294 is added
> to invalidSet of 10.105.19.31:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_4524177823306792294 is added
> to invalidSet of 10.105.19.18:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_4524177823306792294 is added
> to invalidSet of 10.105.19.32:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_2884522252507300332 is added
> to invalidSet of 10.105.19.35:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_2884522252507300332 is added
> to invalidSet of 10.105.19.27:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_2884522252507300332 is added
> to invalidSet of 10.105.19.33:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.21:50010 is added to blk_-
> 6843171124277753504_2279882 size 124490
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000313_0/result_stem-m-00313. blk_-
> 6379064588594672168_2279890
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.26:50010 is added to blk_5338983375361999760_2279887
> size 1476
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.29:50010 is added to blk_-977828927900581074_2279887
> size 13818
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000364_0/result_stem-m-00364 is closed by
> DFSClient_attempt_201202202043_0013_m_000364_0
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.23:50010 is added to blk_5338983375361999760_2279887
> size 1476
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.20:50010 is added to blk_5338983375361999760_2279887
> size 1476
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000364_0/result_suffix-m-00364.
> blk_1921685366929756336_2279890
> 2012-02-21 00:00:38,634 INFO
> org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000279_0/result_suffix-m-00279 is closed by
> DFSClient_attempt_201202202043_0013_m_000279_0
> 2012-02-21 00:00:38,635 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_495061820035691700 is added
> to invalidSet of 10.105.19.20:50010
> 2012-02-21 00:00:38,635 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_495061820035691700 is added
> to invalidSet of 10.105.19.25:50010
> 2012-02-21 00:00:38,635 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_495061820035691700 is added
> to invalidSet of 10.105.19.33:50010
> 2012-02-21 00:00:38,635 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000284_0/result_stem-m-00284.
> blk_8796188324642771330_2279891
> 2012-02-21 00:00:38,638 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.34:50010 is added to blk_-977828927900581074_2279887
> size 13818
> 2012-02-21 00:00:38,638 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000296_0/result_stem-m-00296. blk_-
> 6800409224007034579_2279891
> 2012-02-21 00:00:38,638 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.29:50010 is added to blk_192168536692975

Re: Difference between hdfs dfs and hdfs fs

2012-02-28 Thread madhu phatak

Hi Mohit,
 FS is a generic filesystem which can point to any file systems like
LocalFileSystem,HDFS etc. But dfs is specific to HDFS. So when u use fs it
can copy from local file system to hdfs . But when u specify dfs src file
has to be on HDFS.

On Tue, Feb 21, 2012 at 10:46 PM, Mohit Anchlia wrote:

> What's the different between hdfs dfs and hdfs fs commands? When I run hdfs
> dfs -copyFromLocal /assa . and use pig it can't find it but when I use hdfs
> fs pig is able to find the file.
>

-- 
Join me at http://hadoopworkshop.eventbrite.com/

Re: hadoop streaming : need help in using custom key value separator

2012-02-28 Thread Subir S

http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs

Read this link, your options are wrong below.



On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath  wrote:

> When I am using more than one reducer in hadoop streaming where I am using
> my custom separater rather than the tab, it looks like the hadoop shuffling
> process is not happening as it should.
>
> This is the reducer output when I am using '\t' to separate my key value
> pair that is output from the mapper.
>
> *output from reducer 1:*
> 10321,22
> 23644,37
> 41231,42
> 23448,20
> 12325,39
> 71234,20
> *output from reducer 2:*
> 24123,43
> 33213,46
> 11321,29
> 21232,32
>
> the above output is as expected the first column is the key and the second
> value is the count. There are 10 unique keys and 6 of them are in output of
> the first reducer and the remaining 4 int the second reducer output.
>
> But now when I use a custom separater for my key value pair output from my
> mapper. Here I am using '*' as the separator
> -D stream.mapred.output.field.separator=*
> -D mapred.reduce.tasks=2
>
> *output from reducer 1:*
> 10321,5
> 21232,19
> 24123,16
> 33213,28
> 23644,21
> 41231,12
> 23448,18
> 11321,29
> 12325,24
> 71234,9
> * *
> *output from reducer 2:*
> 10321,17
> 21232,13
> 33213,18
> 23644,16
> 41231,30
> 23448,2
> 24123,27
> 12325,15
> 71234,11
>
> Now both the reducers are getting all the keys and part of the values go to
> reducer 1 and part of the reducer go to reducer 2.
> Why is it behaving like this when I am using a custom separator, shouldn't
> each reducer get a unique key after the shuffling?
> I am using Hadoop 0.20.205.0 and below is the command that I am using to
> run hadoop streaming. Is there some more options that I should specify for
> hadoop streaming to work properly if I am using a custom separator?
>
> hadoop jar
> $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
> -D stream.mapred.output.field.separator=*
> -D mapred.reduce.tasks=2
> -mapper ./map.py
> -reducer ./reducer.py
> -file ./map.py
> -file ./reducer.py
> -input /user/inputdata
> -output /user/outputdata
> -verbose
>
>
> Any help is much appreciated,
> Thanks,
> Austin
>

Re: Setting eclipse for map reduce using maven

2012-02-28 Thread madhu phatak

Hi,
 Find maven definition for Hadoop core jars
here-http://search.maven.org/#browse|-856937612
.

On Tue, Feb 21, 2012 at 10:48 PM, Mohit Anchlia wrote:

> I am trying to search for dependencies that would help me get started with
> developing map reduce in eclipse and I prefer to use maven for this.
>
> Could someone help me point to directions?
>

-- 
Join me at http://hadoopworkshop.eventbrite.com/

Re: ClassNotFoundException: -libjars not working?

2012-02-28 Thread madhu phatak

Hi,
 -libjars doesn't always work.Better way is to create a runnable jar with
all dependencies ( if no of dependency is less) or u have to keep the jars
into the lib folder of the hadoop in all machines.

On Wed, Feb 22, 2012 at 8:13 PM, Ioan Eugen Stan wrote:

> Hello,
>
> I'm trying to run a map-reduce job and I get ClassNotFoundException, but I
> have the class submitted with -libjars. What's wrong with how I do things?
> Please help.
>
> I'm running hadoop-0.20.2-cdh3u1, and I have everithing on the -libjars
> line. The job is submitted via a java app like:
>
>  exec /usr/lib/jvm/java-6-sun/bin/**java -Dproc_jar -Xmx200m -server
> -Dhadoop.log.dir=/opt/ui/var/**log/mailsearch
> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/**hadoop
> -Dhadoop.id.str=hbase -Dhadoop.root.logger=INFO,**console
> -Dhadoop.policy.file=hadoop-**policy.xml -classpath
> '/usr/lib/hadoop/conf:/usr/**lib/jvm/java-6-sun/lib/tools.**
> jar:/usr/lib/hadoop:/usr/lib/**hadoop/hadoop-core-0.20.2-**
> cdh3u1.jar:/usr/lib/hadoop/**lib/ant-contrib-1.0b3.jar:/**
> usr/lib/hadoop/lib/apache-**log4j-extras-1.1.jar:/usr/lib/**
> hadoop/lib/aspectjrt-1.6.5.**jar:/usr/lib/hadoop/lib/**
> aspectjtools-1.6.5.jar:/usr/**lib/hadoop/lib/commons-cli-1.**
> 2.jar:/usr/lib/hadoop/lib/**commons-codec-1.4.jar:/usr/**
> lib/hadoop/lib/commons-daemon-**1.0.1.jar:/usr/lib/hadoop/lib/**
> commons-el-1.0.jar:/usr/lib/**hadoop/lib/commons-httpclient-**
> 3.0.1.jar:/usr/lib/hadoop/lib/**commons-logging-1.0.4.jar:/**
> usr/lib/hadoop/lib/commons-**logging-api-1.0.4.jar:/usr/**
> lib/hadoop/lib/commons-net-1.**4.1.jar:/usr/lib/hadoop/lib/**
> core-3.1.1.jar:/usr/lib/**hadoop/lib/hadoop-**fairscheduler-0.20.2-cdh3u1.
> **jar:/usr/lib/hadoop/lib/**hsqldb-1.8.0.10.jar:/usr/lib/**
> hadoop/lib/jackson-core-asl-1.**5.2.jar:/usr/lib/hadoop/lib/**
> jackson-mapper-asl-1.5.2.jar:/**usr/lib/hadoop/lib/jasper-**
> compiler-5.5.12.jar:/usr/lib/**hadoop/lib/jasper-runtime-5.5.**
> 12.jar:/usr/lib/hadoop/lib
> /jcl-over-slf4j-1.6.1.jar:/**usr/lib/hadoop/lib/jets3t-0.6.**
> 1.jar:/usr/lib/hadoop/lib/**jetty-6.1.26.jar:/usr/lib/**
> hadoop/lib/jetty-servlet-**tester-6.1.26.jar:/usr/lib/**
> hadoop/lib/jetty-util-6.1.26.**jar:/usr/lib/hadoop/lib/jsch-**
> 0.1.42.jar:/usr/lib/hadoop/**lib/junit-4.5.jar:/usr/lib/**
> hadoop/lib/kfs-0.2.2.jar:/usr/**lib/hadoop/lib/log4j-1.2.15.**
> jar:/usr/lib/hadoop/lib/**mockito-all-1.8.2.jar:/usr/**
> lib/hadoop/lib/oro-2.0.8.jar:/**usr/lib/hadoop/lib/servlet-**
> api-2.5-20081211.jar:/usr/lib/**hadoop/lib/servlet-api-2.5-6.**
> 1.14.jar:/usr/lib/hadoop/lib/**slf4j-api-1.6.1.jar:/usr/lib/**
> hadoop/lib/slf4j-log4j12-1.6.**1.jar:/usr/lib/hadoop/lib/**
> xmlenc-0.52.jar:/usr/lib/**hadoop/lib/jsp-2.1/jsp-2.1.**
> jar:/usr/lib/hadoop/lib/jsp-2.**1/jsp-api-2.1.jar:/usr/share/**
> mailbox-convertor/lib/*:/usr/**lib/hadoop/contrib/capacity-**
> scheduler/hadoop-capacity-**scheduler-0.20.2-cdh3u1.jar:/**
> usr/lib/hbase/lib/hadoop-lzo-**0.4.13.jar:/usr/lib/hbase/**
> hbase.jar:/etc/hbase/conf:/**usr/lib/hbase/lib:/usr/lib/**
> zookeeper/zookeeper.jar:/usr/**lib/hadoop/contrib
> /capacity-scheduler/hadoop-**capacity-scheduler-0.20.2-**
> cdh3u1.jar:/usr/lib/hbase/lib/**hadoop-lzo-0.4.13.jar:/usr/**
> lib/hbase/hbase.jar:/etc/**hbase/conf:/usr/lib/hbase/lib:**
> /usr/lib/zookeeper/zookeeper.**jar' org.apache.hadoop.util.RunJar
> /usr/share/mailbox-convertor/**mailbox-convertor-0.1-**SNAPSHOT.jar
> -libjars=/usr/share/mailbox-**convertor/lib/antlr-2.7.7.jar,**
> /usr/share/mailbox-convertor/**lib/aopalliance-1.0.jar,/usr/**
> share/mailbox-convertor/lib/**asm-3.1.jar,/usr/share/**
> mailbox-convertor/lib/**backport-util-concurrent-3.1.**
> jar,/usr/share/mailbox-**convertor/lib/cglib-2.2.jar,/**
> usr/share/mailbox-convertor/**lib/hadoop-ant-3.0-u1.pom,/**
> usr/share/mailbox-convertor/**lib/speed4j-0.9.jar,/usr/**
> share/mailbox-convertor/lib/**jamm-0.2.2.jar,/usr/share/**
> mailbox-convertor/lib/uuid-3.**2.0.jar,/usr/share/mailbox-**
> convertor/lib/high-scale-lib-**1.1.1.jar,/usr/share/mailbox-**
> convertor/lib/jsr305-1.3.9.**jar,/usr/share/mailbox-**
> convertor/lib/guava-11.0.1.**jar,/usr/share/mailbox-**
> convertor/lib/protobuf-java-2.**4.0a.jar,/usr/share/mailbox-**
> convertor/lib/**concurrentlinkedhashmap-lru-1.**1.jar,/usr/share/mailbox-*
> *convertor/lib/json-simple-1.1.**jar,/usr/share/mailbox-**
> convertor/lib/itext-2.1.5.jar,**/usr/share/mailbox-convertor/**
> lib/jmxtools-1.2.1.jar,/usr/**share/mailbox-convertor/lib/**
> jersey-client-1.4.jar,/usr/**share/mailbox-converto
> r/lib/jersey-core-1.4.jar,/**usr/share/mailbox-convertor/**
> lib/jersey-json-1.4.jar,/usr/**share/mailbox-convertor/lib/**
> jersey-server-1.4.jar,/usr/**share/mailbox-convertor/lib/**
> jmxri-1.2.1.jar,/usr/share/**mailbox-convertor/lib/jaxb-**
> impl-2.1.12.jar,/usr/share/**mailbox-convertor/lib/xstream-**
> 1.2.2.jar,/usr/share/mailbox-**convertor/lib/commons-metrics-**
> 1.3.jar,/usr/share/mailbox-**convertor/lib/commons-**
> monitoring-2.9.1.jar,/us

Re: Handling bad records

2012-02-28 Thread Subir S

Can multiple output be used with Hadoop Streaming?

On Tue, Feb 28, 2012 at 2:07 PM, madhu phatak  wrote:

> Hi Mohit ,
>  A and B refers to two different output files (multipart name). The file
> names will be seq-A* and seq-B*.  Its similar to "r" in part-r-0
>
> On Tue, Feb 28, 2012 at 11:37 AM, Mohit Anchlia  >wrote:
>
> > Thanks that's helpful. In that example what is "A" and "B" referring to?
> Is
> > that the output file name?
> >
> > mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
> > mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
> >
> >
> > On Mon, Feb 27, 2012 at 9:53 PM, Harsh J  wrote:
> >
> > > Mohit,
> > >
> > > Use the MultipleOutputs API:
> > >
> > >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
> > > to have a named output of bad records. There is an example of use
> > > detailed on the link.
> > >
> > > On Tue, Feb 28, 2012 at 3:48 AM, Mohit Anchlia  >
> > > wrote:
> > > > What's the best way to write records to a different file? I am doing
> > xml
> > > > processing and during processing I might come accross invalid xml
> > format.
> > > > Current I have it under try catch block and writing to log4j. But I
> > think
> > > > it would be better to just write it to an output file that just
> > contains
> > > > errors.
> > >
> > >
> > >
> > > --
> > > Harsh J
> > >
> >
>
>
>
> --
> Join me at http://hadoopworkshop.eventbrite.com/
>

Re: dfs.block.size

2012-02-28 Thread madhu phatak

You can use FileSystem.getFileStatus(Path p) which gives you the block size
specific to a file.

On Tue, Feb 28, 2012 at 2:50 AM, Kai Voigt  wrote:

> "hadoop fsck  -blocks" is something that I think of quickly.
>
> http://hadoop.apache.org/common/docs/current/commands_manual.html#fsckhas 
> more details
>
> Kai
>
> Am 28.02.2012 um 02:30 schrieb Mohit Anchlia:
>
> > How do I verify the block size of a given file? Is there a command?
> >
> > On Mon, Feb 27, 2012 at 7:59 AM, Joey Echeverria 
> wrote:
> >
> >> dfs.block.size can be set per job.
> >>
> >> mapred.tasktracker.map.tasks.maximum is per tasktracker.
> >>
> >> -Joey
> >>
> >> On Mon, Feb 27, 2012 at 10:19 AM, Mohit Anchlia  >
> >> wrote:
> >>> Can someone please suggest if parameters like dfs.block.size,
> >>> mapred.tasktracker.map.tasks.maximum are only cluster wide settings or
> >> can
> >>> these be set per client job configuration?
> >>>
> >>> On Sat, Feb 25, 2012 at 5:43 PM, Mohit Anchlia  >>> wrote:
> >>>
>  If I want to change the block size then can I use Configuration in
>  mapreduce job and set it when writing to the sequence file or does it
> >> need
>  to be cluster wide setting in .xml files?
> 
>  Also, is there a way to check the block of a given file?
> 
> >>
> >>
> >>
> >> --
> >> Joseph Echeverria
> >> Cloudera, Inc.
> >> 443.305.9434
> >>
>
> --
> Kai Voigt
> k...@123.org
>
>
>
>
>


-- 
Join me at http://hadoopworkshop.eventbrite.com/

Re: Handling bad records

2012-02-28 Thread madhu phatak

Hi Mohit ,
 A and B refers to two different output files (multipart name). The file
names will be seq-A* and seq-B*.  Its similar to "r" in part-r-0

On Tue, Feb 28, 2012 at 11:37 AM, Mohit Anchlia wrote:

> Thanks that's helpful. In that example what is "A" and "B" referring to? Is
> that the output file name?
>
> mos.getCollector("seq", "A", reporter).collect(key, new Text("Bye"));
> mos.getCollector("seq", "B", reporter).collect(key, new Text("Chau"));
>
>
> On Mon, Feb 27, 2012 at 9:53 PM, Harsh J  wrote:
>
> > Mohit,
> >
> > Use the MultipleOutputs API:
> >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
> > to have a named output of bad records. There is an example of use
> > detailed on the link.
> >
> > On Tue, Feb 28, 2012 at 3:48 AM, Mohit Anchlia 
> > wrote:
> > > What's the best way to write records to a different file? I am doing
> xml
> > > processing and during processing I might come accross invalid xml
> format.
> > > Current I have it under try catch block and writing to log4j. But I
> think
> > > it would be better to just write it to an output file that just
> contains
> > > errors.
> >
> >
> >
> > --
> > Harsh J
> >
>



-- 
Join me at http://hadoopworkshop.eventbrite.com/

40 matches

Mail list logo