from:"Jay Vyas"

UserGroupInformation.getLoginUser: failure to login.

2016-06-16 Thread jay vyas

Hi hadoop.

I recently ran a spark job that uses the hadoop.security libraries for
login (spark context does this)...

It threw an exception:

java.io.IOException: failure to login
at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:700)
at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
at
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2181)
at
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2181)
at scala.Option.getOrElse(Option.scala:120)

And the root exception was:

Caused by: javax.security.auth.login.LoginException:
java.lang.NullPointerException: invalid null input: name

This is running in a docker container.  Is there anything in particular I
need to do to run such containers (i.e. do we have a privileged requirement
for UserGrouInfo or anything like that?)...


--
jay vyas

Re: Use of hadoop in AWS - Build it from scratch on a EC2 instance / MapR hadoop distribution / Amazon hadoop distribution

2015-10-19 Thread jay vyas

Also, ASF BigTop packages hadoop for you.

You can always grab our releases
http://www.apache.org/dist/bigtop/bigtop-1.0.0/repos/

We package pig, spark, hive, hbase, 

Its not had to set up a bigtop build server, as we have dockerized the
packaging of both RPM and Deb packages, and you can experiment locally with
this stuff using the vagrant recipes.



On Mon, Oct 19, 2015 at 6:26 AM, Jonathan Aquilina <jaquil...@eagleeyet.net>
wrote:

> Hey Jose
>
> Have you looked at Amazon emr ( elastic map reduce) where I work we have
> used it and when you provision the emr instance you can use custom jars
> like the one you mentioned.
>
> In terms of storage you can use either hdfs, if you are going to keep a
> persistent cluster. If not you can store your data in an Amazon s3 bucket.
>
> Documentation for emr is really good. At the time when we did this and
> this was at the beginning of this year and they supported Hadoop 2.6.
>
> In my honest opinion you are giving yourself a lot of extra work for
> nothing to get us in Hadoop. Try out emr with temporary cluster and go from
> there. I managed to tool up and learn how to work with emr in a week.
>
> Sent from my iPhone
>
> On 19 Oct 2015, at 02:10, José Luis Larroque <larroques...@gmail.com>
> wrote:
>
> Thanks for your answer Anders.
>
> -The amount of data that i'm going to manipulate it's like the wikipedia
> (i will use a dump)
> - I already have the basics of hadoop (i hope), i have a local multinode
> cluster setup and i already executed some algorithms.
> - Because the amount of data its important, i believe that i should use
> several nodes.
>
> Maybe another option to considerate should be that i'm running Giraph on
> top of the selected hadoop distribution/EC2.
>
> Bye!
> Jose
>
> 2015-10-18 18:53 GMT-03:00 Anders Nielsen <anders.shinde.niel...@gmail.com
> >:
>
>> Dear Jose,
>>
>> It will help people answer your question if you specify your goals :
>>
>> -If you do it to learn how to USE a running Hadoop then go for one of the
>> prebuilt distributions (Amazon or MapR)
>> -If you do it to learn more about the setting up and administrating
>> Hadoop then you are better off setting everything up from scratch on EC2.
>> -Do you need to run on many nodes or just a 1 node to test some Mapreduce
>> scripts on a small data set?
>>
>> Regards,
>>
>> Anders
>>
>>
>>
>>
>> On Sun, Oct 18, 2015 at 10:03 PM, José Luis Larroque <
>> larroques...@gmail.com> wrote:
>>
>>> Hi all !
>>>
>>> I started to use hadoop with aws, and a big question appears in front of
>>> me!
>>>
>>> I'm using a MapR distribution, for hadoop 2.4.0 in AWS. I already tried
>>> some trivial examples, and before moving forward i have one question.
>>>
>>> What is the better option for using Hadoop on AWS?
>>> - Build it from scratch on a EC2 instance
>>> - Use MapR distribution of Hadoop
>>> - Use Amazon distribution of Hadoop
>>>
>>> Sorry if my question is too broad.
>>>
>>> Bye!
>>> Jose
>>>
>>>
>>>
>>>
>>>
>>
>


-- 
jay vyas

Re: spark

2015-08-17 Thread Jay Vyas

For a start compare sparks word count with mapreduce word count.

Then compare sparksql with hive.

If you get that far for the final exersize, Find out for yourself by running 
bigpetstore-mapreduce and bigpetstore-spark side by side :).  They are two 
similar applications which generate data sets and process them for etl and 
product recommendations which we are curating in Apache bigtop.



 On Aug 17, 2015, at 6:33 PM, Publius t...@yahoo.com wrote:
 
 Hello
 
 what is the difference between Hadoop and Spark?
 
 How is Spark better?

Re: How to test DFS?

2015-05-26 Thread jay vyas

 you could just list the file contents in your hadoop data/ directories,
of the individual nodes, ...
somewhere in there the file blocks will be floating around.

On Tue, May 26, 2015 at 4:59 PM, Caesar Samsi caesarsa...@mac.com wrote:

 Hello,



 How would I go about and confirm that a file has been distributed
 successfully to all datanodes?



 I would like to demonstrate this capability in a short briefing for my
 colleagues.



 Can I access the file from the datanode itself (todate I can only access
 the files from the master node, not the slaves)?



 Thank you, Caesar.




-- 
jay vyas

Re: What skills to Learn to become Hadoop Admin

2015-03-07 Thread jay vyas

Setting up vendor distros is a great first step.

1) Running TeraSort and benchmarking is a good step.  You can also run
larger, full stack hadoop applications like bigpetstore, which we curate
here : https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore/.

2) Write some mapreduce or spark jobs which write data to a persistent
transactional store, such as SOLR or HBase.  This is a hugely important
part of real world hadoop administration, where you will encounter problems
like running out of memory, possibly CPU overclocking on some nodes, and so
on.

3) Now, did you want to go deeper into the build/setup/deployment of hadoop
?  Its worth it  to try building/deploying/debugging hadoop ecosytem
components from scratch, by setting up Apache BigTop, which packages
RPM/DEB artifacts and provides puppet recipes for distributions.  Its the
original roots of both the cloudera and hortonworks distributions, so you
will learn something about both by playing with it.

We have some exersizes you can use to guide you and get started
https://cwiki.apache.org/confluence/display/BIGTOP/BigTop+U%3A+Exersizes .
Feel free to join the mailing list for questions.




On Sat, Mar 7, 2015 at 9:32 AM, max scalf oracle.bl...@gmail.com wrote:

 Krish,

 I dont mean to hijack your mail here but i wanted to find out how/what you
 did for the below portion, as i am trying to go down your path as well, i
 was able to get 4-5 node cluster using ambari and cdh and now wanted to
 take it to next level.  What have you done for below?

 I have done a web log integration using flume and twitter sentiment
 analysis.

 On Sat, Mar 7, 2015 at 12:11 AM, Krish Donald gotomyp...@gmail.com
 wrote:

 Hi,

 I would like to enter into Big Data world as Hadoop Admin and I have
 setup 7 nodes cluster using Ambari, Cloudera Manager and Apache Hadoop.
 I have installed the services like hive, oozie, zookeeper etc.

 I have done a web log integration using flume and twitter sentiment
 analysis.

 I wanted to understand what are the other skills I should learn ?

 Thanks
 Krish





-- 
jay vyas

Re: Interview Questions asked for Hadoop Admin

2015-02-12 Thread jay vyas

Hi krish.

Im going to interpret this as What is a real world hadoop project workload
i can run to study for my upcoming job interview :) ...

You could look here
https://github.com/apache/bigtop/tree/master/bigtop-bigpetstore/bigpetstore-mapreduce

If you understand that application, you will do just fine :) .

We use custom input formats to generate arbitrarily large data sets, pig
processing, and mahout's recommender all in the bigpetstore-mapreduce
implementation.  Also , its all unit tested (the jobs themselves), so you
can run and inspect changes locally, and get a feel for maintaining a real
world hadoop app. Running it and modifying the data generation and other
phases will be a great form of preparation for you, and you can run it all
by spinning up VMs in apache bigtop.




On Thu, Feb 12, 2015 at 1:03 PM, Krish Donald gotomyp...@gmail.com wrote:

 Hi,

 Does anybody has interview questions which was asked during their
 interview on Hadoop admin role?

 I found few on internet but if somebody who has attended the interview
 can give us an idea , that will be great.

 Thanks
 Krish






-- 
jay vyas

Re: Home for Apache Big Data Solutions?

2015-02-09 Thread Jay Vyas

Bigtop.. Yup!

Mr Asanjar : why don't you post an email about what your doing on the Apache 
bigtop list, we'd love to hear from you.

There could possibly  be some overlap and our goal is to plumb the hadoop 
ecosystem as well



 On Feb 9, 2015, at 4:41 PM, Artem Ervits artemerv...@gmail.com wrote:
 
 I believe Apache Bigtop is what you're looking for.
 
 Artem Ervits
 
 On Feb 9, 2015 8:15 AM, Jean-Baptiste Onofré j...@nanthrax.net wrote:
 Hi Amir,
 
 thanks for the update.
 
 Please, let me know if you need some help on the proposal and to qualify 
 your ideas.
 
 Regards
 JB
 
 On 02/09/2015 02:05 PM, MrAsanjar . wrote:
 Hi Chris,
 thanks for the information, will get on it ...
 
 Hi JB
 Glad that you are familiar with Juju, however my personal goal is not to
 promote any tool but
 to take the next step, which is to build a community for apache big data
 solutions.
 
  do you already have a kind of proposal/description of your projects ?
 working on it :) I got the idea while flying back from South Africa on
 Saturday. During my trip I noticed most of the communities spending
 their precious resources on solution plumbing, without much of emphasis
 on solution best practices due to the lack of expertise. By the time
 Big Data solution framework becomes operational, funding has diminished
 enough to limit solution activity (i.e data analytic payload
 development). I am sure we could find
 similar scenarios with  other institutions and SMB (small and
 medium-size businesses) anywhere.
 In the nutshell my goals are as follow:
 1) Make Big Data solutions available to everyone
 2) Encapsulate the best practices
 3) All Orchestration tools are welcomed - Some solutions could have
 hybrid tooling model
 4) Enforce automated testing and quality control.
 5) Share analytic payloads (i.e mapreduce apps, storm topology, Pig
 scripts,...)
 
 
  Is it like distribution, or tooling ?
 Good question, I envision to have a distribution model as it has
 dependency on Apache hadoop projects distributions.
 
  What's the current license ?
 Charms/Bundles are moving to Apache 2.0 license, target data 2/27.
 
 Regards
 Amir Sanjar
 Big Data Solution Lead
 Canonical
 
 On Sun, Feb 8, 2015 at 10:46 AM, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov mailto:chris.a.mattm...@jpl.nasa.gov
 wrote:
 
 Dear Amir,
 
 Thank you for your interest in contributing these projects
 to the ASF! Sincerely appreciate it.
 
 My suggestion would be to look into the Apache Incubator,
 which is the home for incoming projects at the ASF. The
 TL;DR answer is:
 
 1. You’ll need to create a proposal for each project
 that you would like to bring in using:
 http://incubator.apache.org/guides/proposal.html
 
 
 2. You should put your proposal up on a public wiki
 for each project:
 http://wiki.apache.org/incubator/
 create a new page e.g., YourProjectProposal, which would in
 turn become http://wiki.apache.org/incubator/YouProjectProposal
 You will need to request permissions to add the page on the
 wiki
 
 3. Recruit at least 3 IPMC/ASF members to mentor your project:
 http://people.apache.org/committers-by-project.html#incubator-pmc
 
 http://people.apache.org/committers-by-project.html#member
 
 
 4. Submit your proposal for consideration at the Incubator
 5. Enjoy!
 
 Cheers and good luck.
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov mailto:chris.a.mattm...@nasa.gov
 WWW: http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 
 
 -Original Message-
 From: MrAsanjar . afsan...@gmail.com mailto:afsan...@gmail.com
 Reply-To: user@hadoop.apache.org mailto:user@hadoop.apache.org
 user@hadoop.apache.org mailto:user@hadoop.apache.org
 Date: Sunday, February 8, 2015 at 8:36 AM
 To: user@hadoop.apache.org mailto:user@hadoop.apache.org
 user@hadoop.apache.org mailto:user@hadoop.apache.org,
 dev-i...@bigtop.apache.org mailto:dev-i...@bigtop.apache.org
 dev-i...@bigtop.apache.org mailto:dev-i...@bigtop.apache.org
 Subject: Home for Apache Big Data Solutions?
 
  Hi all,
  My name is Amir Sanjar, Big Data Solution Development Lead at
 Canonical.
  My team has been developing various Big Data solutions build on top of
  Apache Hadoop projects (i.e. Hadoop, Hive, Pig,..) . We would like to
  contribute these pure open source solutions

Re: Any working VM of Apache Hadoop ?

2015-01-18 Thread Jay Vyas

Also BigTop has a very flexible vagrant infrastructure:

https://github.com/apache/bigtop/tree/master/bigtop-deploy/vm/vagrant-puppet

 On Jan 18, 2015, at 3:37 PM, Andre Kelpe ake...@concurrentinc.com wrote:
 
 Try our vagrant setup: 
 https://github.com/Cascading/vagrant-cascading-hadoop-cluster
 
 - André
 
 On Sat, Jan 17, 2015 at 10:07 PM, Krish Donald gotomyp...@gmail.com wrote:
 Hi,
 
 I am looking for working VM of Apache Hadoop.
 Not looking for cloudera or Horton works VMs.
 If anybody has it and if they can share that would be great .
 
 Thanks
 Krish
 
 
 
 -- 
 André Kelpe
 an...@concurrentinc.com
 http://concurrentinc.com

Re: HDFS-based database for Big and Small data?

2015-01-03 Thread Jay Vyas

1)  Phoenix can be used on top of hbase for richer querying semantics. That 
combo might be good for complex workloads.

2) SolrCloud also might fit the bill here ? 

Solr can be backed by any HAdoop compatible FS including HDFS, and it's 
resiliant by that mechanism, and offers sophisticated indexing and searching 
options.

Although the querying is limited...



 On Jan 3, 2015, at 9:39 AM, Wilm Schumacher wilm.schumac...@gmail.com wrote:
 
 Am 03.01.2015 um 08:44 schrieb Alec Taylor:
 Want to replace MongoDB with an HDFS-based database in my architecture.
 
 Note that this is a new system, not a rewrite of an old one.
 
 Are there any open-source fast read/write database built on HDFS
 yeah. As Ted wrote: hbase.
 
 with a model similar to a document-store,
 well, then PERHAPS hbase isn't the right choice. What exactly do you
 need from the definition of a doc-store? If you e.g. rely highly on ad
 hoc queries or secondary indexes then perhaps hbase could lead to some
 additional work for you.
 
 that can hold my regular
 business logic and enables an object model in Python? (E.g.: via Data
 Mapper or Active Record patterns)
 in addition to Teds link, you could also use thrift, if this is enough
 control for you. Depends on your requirement.
 
 Best wishes,
 
 Wilm

Re: New to this group.

2015-01-02 Thread Jay Vyas

Many demos out there are for the business community...

For a demonstration of hadoop at a finer grained level, how it's deployed,
packaged, installed and used, for a developer who wants to learn hadoop the
hard way,

I'd suggest :

1 - Getting Apache bigtop stood up on VMs, and
2 - running the BigPetStore application , which is meant to demonstrate end to
end building testing and deployment of a hadoop batch analytics system with
mapreduce, pig, and mahout.

This will also expose you to puppet, gradle, vagrant, all in a big data app
which solves Real world problems like jar dependencies and multiple ecosystem
components.

Since BPS generates its own data, you don't waste time worrying about external
data sets, Twitter credentials, etc, and can test both on your laptop and on a
100 node cluster (similar to teragen but for the whole ecosystem).

Since it features integration tests and tested on Bigtops hadoop distribution,
(which is 100% pure Apache based), it's imo the purest learning source, not
blurred with company specific downloads or branding.

Disclaimer : Of course I'm biased as I work on it... :) but we've been working
hard to make bigtop easily consumable as a gateway drug to bigdata processing,
and if you have solid linux and Java background, im sure others would agree
it's great place to get immersed in the hadoop ecosystem.

On Jan 2, 2015, at 1:05 PM, Krish Donald gotomyp...@gmail.com wrote:

I would like to work on some kind of case studies like I have seen couple on
Horton works like twitter sentiment analysis, web log analysis etc.

But if somebody can give idea about other case studies which can be worked
upon and can be put in resume later .
As I don't have real time project experience.

On Fri, Jan 2, 2015 at 10:33 AM, Ted Yu yuzhih...@gmail.com wrote:
You can search for Open JIRAs which are related to admin. Here is an example
query:

https://issues.apache.org/jira/browse/HADOOP-9642?jql=project%20%3D%20HADOOP%20AND%20status%20%3D%20Open%20AND%20text%20~%20%22admin%22

FYI

On Fri, Jan 2, 2015 at 10:24 AM, Krish Donald gotomyp...@gmail.com wrote:
I have fair understanding of hadoop eco system...
I have setup multinode cluster using VMs in my personal laptop for Hadoop
2.0 .
But beyond that i would like to work on some project to get a good hold on
the subject.

I basically would like to go to into Hadoop Administartion side as my
backgroud is RDBMS databases Admnistrator .

On Fri, Jan 2, 2015 at 10:11 AM, Wilm Schumacher
wilm.schumac...@gmail.com wrote:
Hi,

the standard books may be a good start:

I liked the following

definitive guide:
http://www.amazon.de/Hadoop-Definitive-Guide-Tom-White/dp/1449311520

hadoop in action:
http://www.manning.com/lam2/

hadoop in practive:
http://www.manning.com/holmes2/

A list is here:
http://wiki.apache.org/hadoop/Books

Hope this helps.

Best wishes,

Wilm

Am 02.01.2015 um 19:02 schrieb Krish Donald:
Hi,

I am new to this group and hadoop.
Please help me to learn hadoop and suggest some self study project .

Thanks
Krish Donald

Re: hadoop / hive / pig setup directions

2014-12-16 Thread Jay Vyas

Hi bhupendra,

The Apache BigTop project was born to solve the general problem of dealing with 
and verifying the functionality of various components in the hadoop ecosystem.

Also,  it creates rpm , apt repos for installing hadoop and puppet recipes for 
initializing the file system and installing components in a clear and 
dependency aware manner. And we have smoke tests to validate that hive,pig, and 
so on all are working.

You should definitely consider checking it out of your building a hadoop 
environment or big data stack.

The best way to get started is with the vagrant recipes , which spin up a 
cluster from scratch for you. 

Once that works, you can take the puppet code and run it on bare metal,

One advantage of this approach is that you are using bits that the community 
tests for you, and will avoid reinventing the wheel of writing a bunch of shell 
scripts for things like synchronizing config files, yum installing components 
across a cluster, smoke tests.

 On Dec 16, 2014, at 9:05 AM,  GUPTA bhupendra1...@gmail.com wrote:
 
 Hello all,
 
 Firstly am a neophyte in the world of Hadoop..
 
 So far, I have got the hadoop single node cluster running on Ubuntu.
 The end state of this was datanode and namenode servers where running..
 
 But from here, am not sure how do I proceed, in the sense, how do I get the 
 other pieces of the hadoop ecosystem installed and working.
 
 Like Hive, Pig , Hbase and may be Ambari as well, set up and running.
 
 Would appreciate if I can get access to materials which says these are  MUST 
 HAVEs for any hadoop project
 
 
 Just trying to get all the pieces together...
 
 Regards
 Bhupendra

Re: Hadoop Learning Environment

2014-11-04 Thread jay vyas

Hi tim.  Id suggest using apache bigtop for this.

BigTop integrates the hadoop ecosystem into a single upstream distribution,
packages everything, curates smoke tests, vagrant, docker recipes for
deployment.
Also, we curate a blueprint hadoop application (bigpetstore) which you
build yourself, easily, and can run to generate, process, and visualize the
bigdata ecosystem.

You can also easily deploy bigtop onto ec2 if you want to pay for it .




On Tue, Nov 4, 2014 at 2:28 PM, Tim Dunphy bluethu...@gmail.com wrote:

 Hey all,

  I want to setup an environment where I can teach myself hadoop. Usually
 the way I'll handle this is to grab a machine off the Amazon free tier and
 setup whatever software I want.

 However I realize that Hadoop is a memory intensive, big data solution. So
 what I'm wondering is, would a t2.micro instance be sufficient for setting
 up a cluster of hadoop nodes with the intention of learning it? To keep
 things running longer in the free tier I would either setup however many
 nodes as I want and keep them stopped when I'm not actively using them. Or
 just setup a few nodes with a few different accounts (with a different
 gmail address for each one.. easy enough to do).

 Failing that, what are some other free/cheap solutions for setting up a
 hadoop learning environment?

 Thanks,
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B




-- 
jay vyas

Re: Hadoop Learning Environment

2014-11-04 Thread jay vyas

Hi daemon:  Actually, for most folks who would want to actually use a
hadoop cluster,  i would think setting up bigtop is super easy ! If you
have issues with it ping me and I can help you get started.
Also, we have docker containers - so you dont even *need* a VM to run a 4
or 5 node hadoop cluster.

install vagrant
install VirtualBox
git clone https://github.com/apache/bigtop
cd bigtop/bigtop-deploy/vm/vagrant-puppet
vagrant up
Then vagrant destroy when your done.

This to me is easier than manually downloading an appliance, picking memory
starting the virtualbox gui, loading the appliance , etc...  and also its
easy to turn the simple single node bigtop VM into a multinode one,
by just modifying the vagrantile.


On Tue, Nov 4, 2014 at 5:32 PM, daemeon reiydelle daeme...@gmail.com
wrote:

 What you want as a sandbox depends on what you are trying to learn.

 If you are trying to learn to code in e.g PigLatin, Sqooz, or similar, all
 of the suggestions (perhaps excluding BigTop due to its setup complexities)
 are great. Laptop? perhaps but laptop's are really kind of infuriatingly
 slow (because of the hardware - you pay a price for a 30-45watt average
 heating bill). A laptop is an OK place to start if it is e.g. an i5 or i7
 with lots of memory. What do you think of the thought that you will pretty
 quickly graduate to wanting a small'ish desktop for your sandbox?

 A simple, single node, Hadoop instance will let you learn many things. The
 next level of complexity comes when you are attempting to deal with data
 whose processing needs to be split up, so you can learn about how to split
 data in Mapping, reduce the splits via reduce jobs, etc. For that, you
 could get a windows desktop box or e.g. RedHat/CentOS and use
 virtualization. Something like a 4 core i5 with 32gb of memory, running 3
 or for some things 4, vm's. You could load e.g. hortonworks into each of
 the vm's and practice setting up a 3/4 way cluster. Throw in 2-3 1tb drives
 off of eBay and you can have a lot of learning.











 *...“The race is not to the swift,nor the battle to the strong,but to
 those who can see it coming and jump aside.” - Hunter ThompsonDaemeon*
 On Tue, Nov 4, 2014 at 1:24 PM, oscar sumano osum...@gmail.com wrote:

 you can try the pivotal vm as well.


 http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html

 On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov lfedo...@hortonworks.com
 wrote:

 Tim,
 download Sandbox from http://hortonworks/com
 You will have everything needed in a small VM instance which will run on
 your home desktop.


 *Thank you!*


 *Sincerely,*

 *Leonid Fedotov*

 Systems Architect - Professional Services

 lfedo...@hortonworks.com

 office: +1 855 846 7866 ext 292

 mobile: +1 650 430 1673

 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy bluethu...@gmail.com
 wrote:

 Hey all,

  I want to setup an environment where I can teach myself hadoop.
 Usually the way I'll handle this is to grab a machine off the Amazon free
 tier and setup whatever software I want.

 However I realize that Hadoop is a memory intensive, big data solution.
 So what I'm wondering is, would a t2.micro instance be sufficient for
 setting up a cluster of hadoop nodes with the intention of learning it? To
 keep things running longer in the free tier I would either setup however
 many nodes as I want and keep them stopped when I'm not actively using
 them. Or just setup a few nodes with a few different accounts (with a
 different gmail address for each one.. easy enough to do).

 Failing that, what are some other free/cheap solutions for setting up a
 hadoop learning environment?

 Thanks,
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.






-- 
jay vyas

Re: TestDFSIO with FS other than defaultFS

2014-10-02 Thread Jay Vyas

Hi jeff.  Wrong fs means that your configuration doesn't know how to bind ofs 
to the OrangeFS file system class.

You can debug the configuration using fs.dumpConfiguration(), and you will 
likely see references to hdfs in there.

By the way, have you tried our bigtop hcfs tests yet? We now support over 100 
Hadoop file system compatibility tests...

You can see a good sample of what parameters should be set for a hcfs 
implementation here: 
https://github.com/gluster/glusterfs-hadoop/blob/master/conf/core-site.xml

 On Oct 2, 2014, at 12:42 PM, Jeffrey Denton den...@clemson.edu wrote:
 
 Hello all,
 
 I'm trying to run TestDFSIO using a different file system other than the 
 configured defaultFS and it doesn't work for me:
 
 $ hadoop org.apache.hadoop.fs.TestDFSIO 
 -Dtest.build.data=ofs://test/user/$USER/TestDFSIO -write -nrFiles 1 -fileSize 
 10240
 14/10/02 11:24:19 INFO fs.TestDFSIO: TestDFSIO.1.7
 14/10/02 11:24:19 INFO fs.TestDFSIO: nrFiles = 1
 14/10/02 11:24:19 INFO fs.TestDFSIO: nrBytes (MB) = 10240.0
 14/10/02 11:24:19 INFO fs.TestDFSIO: bufferSize = 100
 14/10/02 11:24:19 INFO fs.TestDFSIO: baseDir = 
 ofs://test/user/denton/TestDFSIO
 14/10/02 11:24:19 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/10/02 11:24:20 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded.
 14/10/02 11:24:20 INFO fs.TestDFSIO: creating control file: 10737418240 
 bytes, 1 files
 java.lang.IllegalArgumentException: Wrong FS: 
 ofs://test/user/denton/TestDFSIO/io_control, expected: hdfs://dsci
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:191)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:102)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:595)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:591)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:591)
 at org.apache.hadoop.fs.TestDFSIO.createControlFile(TestDFSIO.java:290)
 at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:751)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
 at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:650)
 
 At Clemson University, we're running HDP-2.1 (Hadoop 2.4.0.2.1) on 16 data 
 nodes and 3 separate master nodes for the resource manager and two namenodes; 
 however, for this test, the data nodes are really being used to run the map 
 tasks with job output being written to 16 separate OrangeFS servers.
 
 Ideally, we would like the 16 HDFS data nodes and two namenodes to be the 
 defaultFS, but would also like the capability to run jobs using other 
 OrangeFS installations. 
 
 The above error does not occur when OrangeFS is configured to be the 
 defaultFS. Also, we have no problems running teragen/terasort/teravalidate 
 when OrangeFS IS NOT the defaultFS.
 
 So, is it possible to run TestDFSIO using a FS other than the defaultFS?
 
 If you're interested in the OrangeFS classes, they can be found here:
 
 I have not yet run any of the FS tests released with 2.5.1 but hope to soon.
 
 Regards,
 
 Jeff Denton
 OrangeFS Developer
 Clemson University
 den...@clemson.edu

Re:

2014-09-26 Thread jay vyas

See https://wiki.apache.org/hadoop/HCFS/

YES Yarn is written to the FileSystem interface.  It works on S3FileSystem
and GlusterFileSystem and any other HCFS.

We have run , and continue to run, the many tests in apache bigtop's test
suite against our hadoop clusters running on alternative file system
implementations,
and it works.

When you say HDFS does not support fs.AbstractFileSystem.s3.impl That
is true.  If your file system is configured using HDFS, then s3 urls will
not be used, ever.

When you create a FileSystem object in hadoop, it reads the uri (i.e.
glusterfs:///) and then finds the file system binding in your
core-site.xml (i.e. fs.AbstractFileSystem.glusterfs.impl).

So the URI must have a corresponding entry in the core-site.xml.

As a reference implementation, you can see
https://github.com/gluster/glusterfs-hadoop/blob/master/conf/core-site.xml




On Fri, Sep 26, 2014 at 10:10 AM, Naganarasimha G R (Naga) 
garlanaganarasi...@huawei.com wrote:

   Hi All,

  I have following doubts on pluggable FileSystem and YARN
 1. If all the implementations should extend FileSystem then why there is a
 parallel class AbstractFileSystem. which ViewFS extends ?
 2. Is YARN supposed to run on any of the pluggable
 org.apache.hadoop.fs.FileSystem like s3 ?
 if its suppose to run then when submitting a job in the client side
  YARNRunner is calling FileContext.getFileContext(this.conf);
 which is further calling FileContext.getAbstractFileSystem() which throws
 exception for S3.
 So i am not able to run YARN job with ViewFS with S3 as mount. And based
 on the code even if i configure only S3 then also its going to fail.
 3. HDFS does not support fs.AbstractFileSystem.s3.impl with some default
 class similar to org.apache.hadoop.fs.s3.S3FileSystem ?

Regards,

 Naga



 Huawei Technologies Co., Ltd.
 Phone:
 Fax:
 Mobile:  +91 9980040283
 Email: naganarasimh...@huawei.com
 Huawei Technologies Co., Ltd.
 http://www.huawei.com





-- 
jay vyas

Re: To Generate Test Data in HDFS (PDGF)

2014-09-22 Thread Jay Vyas

While on the subject,
You can also use the bigpetstore application to do this, in apache bigtop.  
This data is suited well for hbase ( semi structured, transactional, and 
features some global patterns which can make for meaningful queries and so on).

Clone apache/bigtop
cd bigtop-bigpetstore
gradle clean package # build the jar

Then follow the instructions in the README to generate as many records as you 
want in a distributed context.  Each record is around 80 bytes, so about 10^10 
records should be on the scale you are looking for.

 On Sep 22, 2014, at 5:14 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 
 Hi,
 
 I need to generate large amount of test data (4TB) into Hadoop, has anyone 
 used PDGF to do so? Could you share your cook book about PDGF in Hadoop (or 
 HBase)? 
 
 Many Thanks
 Arthur

Re: how to setup Kerberozed Hadoop ?

2014-09-15 Thread jay vyas

Once you read the the docs and get a base understanding.. here is my recipe
you can try for a maintainable , easy to manage setup.

- Puppet-IPA (puppet recipe for FreeIPA for setting up kerberos realms and
users)
- then layer in apache bigtop's puppet hadoop modules (for installation and
setup of the hadoop cluster)
- then do the glue necessary to kerberize existing, running hadoop services
(free ipa will set up the kerberos realm for you, add users, and so on -
all you have to do is add the kerberos security info into the core-site.xml)



On Mon, Sep 15, 2014 at 3:52 PM, Shahab Yunus shahab.yu...@gmail.com
wrote:

 Hi

 Have you already looked at the existing documentation?

 For apache

 http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/SecureMode.html

 -For cloudera

 http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.6.0/CDH4-Security-Guide/cdh4sg_topic_3.html

 Some random blogs:
 http://blog.godatadriven.com/kerberos-cloudera-setup.html

 Regards,
 Shahab

 On Mon, Sep 15, 2014 at 3:47 PM, Xiaohua Chen xiaohua.c...@gmail.com
 wrote:

 Hi experts:

 I am new to Hadoop. We want to setup a Kerberozed hadoop for testing.

 Can you share any guide lines or instructions on how to setup a
 Kerberozed hadoop env ?

 Thanks.

 Sophia





-- 
jay vyas

Re: Tez and MapReduce

2014-09-01 Thread jay vyas

Yes as an example of running a mapreduce job followed by a tez you can see
our last post on this
https://blogs.apache.org/bigtop/entry/testing_apache_tez_with_apache .  You
can see in the bigtop/tez testing
blogpost that you can confirm that Tez is being used easily on the web ui.

From TezClent.java:


/**
 * TezClient is used to submit Tez DAGs for execution. DAG's are executed
via a
 * Tez App Master. TezClient can run the App Master in session or
non-session
 * mode. br
 * In non-session mode, each DAG is executed in a different App Master that
 * exits after the DAG execution completes. br
 * In session mode, the TezClient creates a single instance of the App
Master
 * and all DAG's are submitted to the same App Master.br
 * Session mode may give better performance when a series of DAGs need to
 * executed because it enables resource re-use across those DAGs.
Non-session
 * mode should be used when the user wants to submit a single DAG or wants
to
 * disconnect from the cluster after submitting a set of unrelated DAGs.
br
 * If API recommendations are followed, then the choice of running in
session or
 * non-session mode is transparent to writing the application. By changing
the
 * session mode configuration, the same application can be running in
session or
 * non-session mode.
 */



On Mon, Sep 1, 2014 at 12:43 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 e.g. in hive to switch engines
 set hive.execution.engine=mr;
 or
 set hive.execution.engine=tez;

 tez is faster especially on complex queries.
 On Aug 31, 2014 10:33 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Can Tez and MapReduce live together and get along in the same cluster?
 B.




-- 
jay vyas

Re: hadoop/yarn and task parallelization on non-hdfs filesystems

2014-08-15 Thread jay vyas

Your FileSystem implementation should provide specific tuning parameters
for IO.

For example, in the GlusterFileSystem, we have a buffer parameter that is
typically
embedded into the core-site.xml.

https://github.com/gluster/glusterfs-hadoop/blob/master/src/main/java/org/apache/hadoop/fs/glusterfs/GlusterVolume.java


Similarly, in HDFS, there are tuning parameters that would go in
hdfs-site.xml

IIRC from your stackoverflow question, the Hadoop Compatible FileSystem you
are using is backed by a company of some sort, so
you should contact the engineers working on the implementation about how to
tune the underlying FS.

Regarding mapreduce and yarn - task optimization at that level is
independent of the underlying file system.  There are some parameters that
you can specify with your job, like setting the min number of tasks, which
can increase/decrease the number of total tasks.

From some experience tuning web crawlers with this stuff, I can say that  a
high number will increase parallelism but might decrease availability of
your cluster (and locality of individual jobs).
A high # of tasks generally works good when doing something CPU or network
intensive.


On Fri, Aug 15, 2014 at 11:22 AM, java8964 java8...@hotmail.com wrote:

 I believe that Calvin mentioned before that this parallel file system
 mounted into local file system.

 In this case, will Hadoop just use java.io.File as local File system to
 treat them as local file and not split the file?

 Just want to know the logic in hadoop handling the local file.

 One suggestion I can think is to split the files manually outside of
 hadoop. For example, generate lots of small files as 128M or 256M size.

 In this case, each mapper will process one small file, so you can get good
 utilization of your cluster, assume you have a lot of small files.

 Yong

  From: ha...@cloudera.com
  Date: Fri, 15 Aug 2014 16:45:02 +0530
  Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems
  To: user@hadoop.apache.org

 
  Does your non-HDFS filesystem implement a getBlockLocations API, that
  MR relies on to know how to split files?
 
  The API is at
 http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus
 ,
  long, long), and MR calls it at
 
 https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L392
 
  If not, perhaps you can enforce a manual chunking by asking MR to use
  custom min/max split sizes values via config properties:
 
 https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L66
 
  On Fri, Aug 15, 2014 at 10:16 AM, Calvin iphcal...@gmail.com wrote:
   I've looked a bit into this problem some more, and from what another
   person has written, HDFS is tuned to scale appropriately [1] given the
   number of input splits, etc.
  
   In the case of utilizing the local filesystem (which is really a
   network share on a parallel filesystem), the settings might be set
   conservatively in order not to thrash the local disks or present a
   bottleneck in processing.
  
   Since this isn't a big concern, I'd rather tune the settings to
   efficiently utilize the local filesystem.
  
   Are there any pointers to where in the source code I could look in
   order to tweak such parameters?
  
   Thanks,
   Calvin
  
   [1]
 https://stackoverflow.com/questions/25269964/hadoop-yarn-and-task-parallelization-on-non-hdfs-filesystems
  
   On Tue, Aug 12, 2014 at 12:29 PM, Calvin iphcal...@gmail.com wrote:
   Hi all,
  
   I've instantiated a Hadoop 2.4.1 cluster and I've found that running
   MapReduce applications will parallelize differently depending on what
   kind of filesystem the input data is on.
  
   Using HDFS, a MapReduce job will spawn enough containers to maximize
   use of all available memory. For example, a 3-node cluster with 172GB
   of memory with each map task allocating 2GB, about 86 application
   containers will be created.
  
   On a filesystem that isn't HDFS (like NFS or in my use case, a
   parallel filesystem), a MapReduce job will only allocate a subset of
   available tasks (e.g., with the same 3-node cluster, about 25-40
   containers are created). Since I'm using a parallel filesystem, I'm
   not as concerned with the bottlenecks one would find if one were to
   use NFS.
  
   Is there a YARN (yarn-site.xml) or MapReduce (mapred-site.xml)
   configuration that will allow me to effectively maximize resource
   utilization?
  
   Thanks,
   Calvin
 
 
 
  --
  Harsh J




-- 
jay vyas

Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?

2014-08-12 Thread Jay Vyas

also, consider apache bigtop. That is the apache upstream Hadoop initiative, 
and it comes with smoke tests+ Puppet recipes for setting up your own Hadoop 
distro from scratch.

IMHO ... If learning or building your own tooling around Hadoop , bigtop is 
ideal.  If interested in purchasing support , than the vendor distros are a 
good gateway.

 On Aug 12, 2014, at 5:31 PM, Aaron Eng a...@maprtech.com wrote:
 
 On that note, 2 is also misleading/incomplete.  You might want to explain 
 which specific features you are referencing so the original poster can figure 
 out if those features are relevant.  The inverse of 2 is also true, things 
 like consistent snapshots and full random read/write over NFS are in MapR and 
 not in HDFS.
 
 
 On Tue, Aug 12, 2014 at 2:10 PM, Kai Voigt k...@123.org wrote:
 3. seems a biased and incomplete statement.
 
 Cloudera’s distribution CDH is fully open source. The proprietary „stuff 
 you refer to is most likely Cloudera Manager, an additional tool to make 
 deployment, configuration and monitoring easy.
 
 Nobody is required to use it to run a Hadoop cluster.
 
 Kai (a Cloudera Employee)
 
 Am 12.08.2014 um 21:56 schrieb Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com:
 
 Hortonworks. Here is my reasoning:
 1. Hortonwork is 100% open source.
 2. MapR has stuff on their roadmap that Hortonworks has already 
 accomplished and has moved on to other things.
 3. Cloudera has proprietary stuff in their stack. No.
 4. Hortonworks makes training super accessible and there is a community 
 around it.
 5. Who the heck is BigInsights? (Which should tell you something.)
  
 Adaryl Bob Wakefield, MBA
 Principal
 Mass Street Analytics
 913.938.6685
 www.linkedin.com/in/bobwakefieldmba
 Twitter: @BobLovesData
  
 From: mani kandan
 Sent: Tuesday, August 12, 2014 3:12 PM
 To: user@hadoop.apache.org
 Subject: Started learning Hadoop. Which distribution is best for native 
 install in pseudo distributed mode?
  
 Which distribution are you people using? Cloudera vs Hortonworks vs 
 Biginsights?
 
 
 Kai VoigtAm Germaniahafen 1  
 k...@123.org
  24143 Kiel  
 +49 160 96683050
  Germany 
 @KaiVoigt

Re: Bench-marking Hadoop Performance

2014-07-22 Thread jay vyas

There are alot of tests out there and it can be tough to determine what is
a standard.

- TeraGen/TearSort and testdfsio are starting points.

- Various other non apache projects (such as ycsb or hibench) will have
good benchmarks for certain type sof cases.

-If looking for a more comprehensive long term strategy, I'd suggest the
you ask on the  bigtop mailing list, where we are
building a broader community around uniform smoke testing and benchmarking
of hadoop, hadoop compatible file systems, and YARN applications.







On Tue, Jul 22, 2014 at 11:23 AM, Charley Newtonne cnewto...@gmail.com
wrote:

 This is a new cluster I'm putting up and I need to get an idea on what to
 expect from a performance standpoint.

 Older docs point to gridmix and TestDFSIO . However, most of this doc is
 obsolete and no longer applies on 2.4.

 Where can I find benchmarking docs for 2.4? What are my options?
 Also, I have searched safari books online including rough cuts, but not
 seeing books for the 2.4 release. If you know of a book for this release,
 please share.

 Thank you.





-- 
jay vyas

Re: Hadoop 2.4 test jar files.

2014-07-22 Thread jay vyas

FYI, the FS tests have just been overhauled and im not sure if those jars
have the latest FS tests (hadoop-9361).  For those tests its easy to add
them by building hadoop and just adding the hadoop-common and
hadoop-common-test jars as maven dependencies locally.


On Tue, Jul 22, 2014 at 2:00 PM, Charley Newtonne cnewto...@gmail.com
wrote:

 ..You can expand the one(s) you're interested in and run tests contained
 in them...

 How is that done? How do I know what these classes do and what arguments
 they take?


 On Tue, Jul 22, 2014 at 1:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 These jar files contain source code for the respective hadoop modules.
 You can expand the one(s) you're interested in and run tests contained in
 them.

 Cheers


 On Tue, Jul 22, 2014 at 9:47 AM, Charley Newtonne cnewto...@gmail.com
 wrote:

 I have spent hours trying to find out how to run these jar files. The
 older version are documented on the web and some of the books. These,
 however, are not.

 How do I know ...
 - The purpose of each one of these jar files.
 - The class to call and what it does.
 - The arguments to pass.




 /a01/hadoop/2.4.0/share/hadoop/hdfs/hadoop-hdfs-2.4.0-tests.jar

 /a01/hadoop/2.4.0/share/hadoop/hdfs/sources/hadoop-hdfs-2.4.0-test-sources.jar

 /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-sls-2.4.0-test-sources.jar

 /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-datajoin-2.4.0-test-sources.jar

 /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-archives-2.4.0-test-sources.jar

 /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-gridmix-2.4.0-test-sources.jar

 /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-extras-2.4.0-test-sources.jar

 /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-streaming-2.4.0-test-sources.jar

 /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-distcp-2.4.0-test-sources.jar

 /a01/hadoop/2.4.0/share/hadoop/tools/sources/hadoop-rumen-2.4.0-test-sources.jar






-- 
jay vyas

Re: clarification on HBASE functionality

2014-07-15 Thread Jay Vyas

Hbase is not harcoded to hdfs: it works on any file system that implements the 
file system interface, we've run it on glusterfs for example.  I assume some 
have also run it on s3 and other alternative file systems .

** However ** 

For best performance, direct block io hooks on hdfs can boost high throughout 
applications on hdfs.

Ultimately, the hbase root directory only needs a fully qualified FileSystem 
uri which maps to a FileSystem class.

 On Jul 14, 2014, at 5:59 PM, Ted Yu yuzhih...@gmail.com wrote:
 
 Right.
 hbase is different from Cassandra in this regard.
 
 
 On Mon, Jul 14, 2014 at 2:57 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:
 Now this is different from Cassandra which does NOT use HDFS correct?  
 (Sorry. Don’t know why that needed two emails.)
  
 B.
  
 From: Ted Yu
 Sent: Monday, July 14, 2014 4:53 PM
 To: mailto:user@hadoop.apache.org
 Subject: Re: clarification on HBASE functionality
  
 Yes.
 See http://hbase.apache.org/book.html#arch.hdfs
 
 
 On Mon, Jul 14, 2014 at 2:52 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:
 HBASE uses HDFS to store it's data correct?
 
 B.

Re: Hadoop virtual machine

2014-07-06 Thread jay vyas

I really like the Cascading recipes above, thanks for sharing that !

Also we have *apache bigtop vagrant recipes* which we curate for this kind
of thing, which are really useful. You can spin up a 1 or multi node
cluster, just by running the startup.sh script.

Which are probably the most configurable and flexible. These are super
easy to use, and allow you maximal control over your environment.

1) git clone https://github.com/apache/bigtop
2) cd bigtop-deploy/vm/vagrant/vagrant-puppet
3) Follow the directions in the README to create your hadoop cluster

You can look into the provision script to see how you can customize exactly
which components come (hbase,mahout,pig,) installed in your
distribution.

Feel free to drop a line on the bigtop mailing list if you need any help
with getting them up and running.

On Sun, Jul 6, 2014 at 12:47 PM, Andre Kelpe ake...@concurrentinc.com
wrote:

We have a multi-vm or single-vm setup with apache hadoop, if you want to
give that a spin:
https://github.com/Cascading/vagrant-cascading-hadoop-cluster

- André

On Sun, Jul 6, 2014 at 9:05 AM, MrAsanjar . afsan...@gmail.com wrote:

For my hadoop development and testing I use LXC (linux container) instead
of VM, mainly due to its light weight resource consumption. As mater of
fact as I am typing, my ubuntu system is automatically building a 6 nodes
hadoop cluster on my 16G labtop.
If you have an Ubuntu system you could install a fully configurable
Hadoop 2.2.0 single node or multi-node cluster in less then 10 minutes.
Here what you need to do:
1) Install and learn Ubuntu Juju (shouldn't take an hour)- instructions :
https://juju.ubuntu.com/docs/getting-started.html
2) there are two types hadoop charms:
a) Single node for hadoop development :
https://jujucharms.com/?text=hadoop2-devel
b) multi-node for testing testing :
https://jujucharms.com/?text=hadoop
Let me know if you need more help

On Sun, Jul 6, 2014 at 7:59 AM, Marco Shaw marco.s...@gmail.com wrote:

Note that the CDH link is for Cloudera which only provides Hadoop for
Linux.

HDP has pre-built VMs for both Linux and Windows hosts.

You can also search for HDInsight emulator which runs on Windows and
is based on HDP.

Marco

On Jul 6, 2014, at 12:38 AM, Gavin Yue yue.yuany...@gmail.com wrote:

http://hortonworks.com/products/hortonworks-sandbox/

CDH5

http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html

On Sat, Jul 5, 2014 at 11:27 PM, Manar Elkady m.elk...@fci-cu.edu.eg
wrote:

Hi,

I am a newcomer in using Hadoop, and I read many online tutorial to set
up Hadoop on Window by using virtual machines, but all of them link to old
versions of Hadoop virtual machines.
Could any one help me to find a Hadoop virtual machine, which include a
newer version of hadoop? Or should I do it myself from scratch?
Also, any well explained Hadoop installing tutorial and any other
helpful material are appreciated.

Manar,

--
André Kelpe
an...@concurrentinc.com
http://concurrentinc.com

--
jay vyas

Re: Hadoop with SAN

2014-06-15 Thread Jay Vyas

You can either use san to back your datanodes, or implement a custom FileSystem 
over your san storage.  Either would have different drawbacks depending on your 
requirements.

Re: No job can run in YARN (Hadoop-2.2)

2014-05-11 Thread Jay Vyas

Sounds oddSo (1) you got a filenotfound exception and (2) you fixed it by 
commenting out memory specific config parameters?

Not sure how that would work... Any other details or am I missing something 
else?

 On May 11, 2014, at 4:16 AM, Tao Xiao xiaotao.cs@gmail.com wrote:
 
 I'm sure this problem is caused by the incorrect configuration. I commented 
 out all the configurations regarding memory, then jobs can run successfully. 
 
 
 2014-05-11 0:01 GMT+08:00 Tao Xiao xiaotao.cs@gmail.com:
 I installed Hadoop-2.2 in a cluster of 4 nodes, following Hadoop YARN 
 Installation: The definitive guide. 
 
 The configurations are as follows:
 
 ~/.bashrc core-site.xml   hdfs-site.xml
 mapred-site.xml slavesyarn-site.xml
 
 
 I started NameNode, DataNodes, ResourceManager and NodeManagers 
 successfully, but no job can run successfully. For example, I  run the 
 following job:
 
 [root@Single-Hadoop ~]#yarn jar 
 /var/soft/apache/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
  pi 2 4
 
 The output is as follows:
 
 14/05/10 23:56:25 INFO mapreduce.Job: Task Id : 
 attempt_1399733823963_0004_m_00_0, Status : FAILED
 Exception from container-launch: 
 org.apache.hadoop.util.Shell$ExitCodeException: 
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
  at org.apache.hadoop.util.Shell.run(Shell.java:379)
  at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
  at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
  at java.lang.Thread.run(Thread.java:662)
 
 
 
 14/05/10 23:56:25 INFO mapreduce.Job: Task Id : 
 attempt_1399733823963_0004_m_01_0, Status : FAILED
 Exception from container-launch: 
 org.apache.hadoop.util.Shell$ExitCodeException: 
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
  at org.apache.hadoop.util.Shell.run(Shell.java:379)
  at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
  at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
  at java.lang.Thread.run(Thread.java:662)
 
 ... ...
 
 
 14/05/10 23:56:36 INFO mapreduce.Job:  map 100% reduce 100%
 14/05/10 23:56:37 INFO mapreduce.Job: Job job_1399733823963_0004 failed with 
 state FAILED due to: Task failed task_1399733823963_0004_m_00
 Job failed as tasks failed. failedMaps:1 failedReduces:0
 
 14/05/10 23:56:37 INFO mapreduce.Job: Counters: 10
  Job Counters 
  Failed map tasks=7
  Killed map tasks=1
  Launched map tasks=8
  Other local map tasks=6
  Data-local map tasks=2
  Total time spent by all maps in occupied slots (ms)=21602
  Total time spent by all reduces in occupied slots (ms)=0
  Map-Reduce Framework
  CPU time spent (ms)=0
  Physical memory (bytes) snapshot=0
  Virtual memory (bytes) snapshot=0
 Job Finished in 24.515 seconds
 java.io.FileNotFoundException: File does not exist: 
 hdfs://Single-Hadoop.zd.com/user/root/QuasiMonteCarlo_1399737371038_1022927375/out/reduce-out
  at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1110)
  at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
  at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
  at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1749)
  at

Yarn hangs @Scheduled

2014-04-24 Thread Jay Vyas

Hi folks :  My yarn jobs seem to be hanging in the SHEDULED state.  I've
restarted my nodemanager a few times , but no luck.

What are the possible reasons that YARN job submission hangs ?  I know one
is resource availability, but this is a fresh cluster on a VM with only one
job, one NM, and one RM.

14/04/24 16:20:32 INFO ipc.Server: Auth successful for
yarn@IDH1.LOCAL(auth:SIMPLE)
14/04/24 16:20:32 INFO authorize.ServiceAuthorizationManager: Authorization
successful for yarn@IDH1.LOCAL (auth:KERBEROS) for protocol=interface
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB
14/04/24 16:20:32 INFO resourcemanager.ClientRMService: Allocated new
applicationId: 4
14/04/24 16:20:33 INFO resourcemanager.ClientRMService: Application with id
4 submitted by user yarn
14/04/24 16:20:33 INFO resourcemanager.RMAuditLogger: USER=yarn
IP=192.168.122.100  OPERATION=Submit Application Request
TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1398370674313_0004
14/04/24 16:20:33 INFO rmapp.RMAppImpl: Storing application with id
application_1398370674313_0004
14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004
State change from NEW to NEW_SAVING
14/04/24 16:20:33 INFO recovery.RMStateStore: Storing info for app:
application_1398370674313_0004
14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004
State change from NEW_SAVING to SUBMITTED
14/04/24 16:20:33 INFO fair.FairScheduler: Accepted application
application_1398370674313_0004 from user: yarn, in queue: default,
currently num of applications: 4
14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004
State change from SUBMITTED to ACCEPTED
14/04/24 16:20:33 INFO resourcemanager.ApplicationMasterService:
Registering app attempt : appattempt_1398370674313_0004_01
14/04/24 16:20:33 INFO attempt.RMAppAttemptImpl:
appattempt_1398370674313_0004_01 State change from NEW to SUBMITTED
14/04/24 16:20:33 INFO fair.FairScheduler: Added Application Attempt
appattempt_1398370674313_0004_01 to scheduler from user: yarn
14/04/24 16:20:33 INFO attempt.RMAppAttemptImpl:
appattempt_1398370674313_0004_01 State change from SUBMITTED to
SCHEDULED




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Yarn hangs @Scheduled

2014-04-24 Thread Jay Vyas

I fixed the issue by setting 

yarn.scheduler.minimum-allocation-mb=1024

I'm thinking this happens a lot in VMs where you run w low memory.

If memory too low, I think other failures will occur at runtime when you start 
daemons or tasks...If too high, then the tasks will hang...

 On Apr 24, 2014, at 5:25 PM, Vinod Kumar Vavilapalli vino...@apache.org 
 wrote:
 
 How much memory do you see as available on the RM web page? And what are the 
 memory requirements for this app? And this is a MR job?
 
 +Vinod
 Hortonworks Inc.
 http://hortonworks.com/
 
 
 On Thu, Apr 24, 2014 at 1:23 PM, Jay Vyas jayunit...@gmail.com wrote:
 Hi folks :  My yarn jobs seem to be hanging in the SHEDULED state.  I've 
 restarted my nodemanager a few times , but no luck.  
 
 What are the possible reasons that YARN job submission hangs ?  I know one 
 is resource availability, but this is a fresh cluster on a VM with only one 
 job, one NM, and one RM.  
 
 14/04/24 16:20:32 INFO ipc.Server: Auth successful for yarn@IDH1.LOCAL 
 (auth:SIMPLE)
 14/04/24 16:20:32 INFO authorize.ServiceAuthorizationManager: Authorization 
 successful for yarn@IDH1.LOCAL (auth:KERBEROS) for protocol=interface 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB
 14/04/24 16:20:32 INFO resourcemanager.ClientRMService: Allocated new 
 applicationId: 4
 14/04/24 16:20:33 INFO resourcemanager.ClientRMService: Application with id 
 4 submitted by user yarn
 14/04/24 16:20:33 INFO resourcemanager.RMAuditLogger: USER=yarn 
 IP=192.168.122.100  OPERATION=Submit Application Request
 TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1398370674313_0004
 14/04/24 16:20:33 INFO rmapp.RMAppImpl: Storing application with id 
 application_1398370674313_0004
 14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004 State 
 change from NEW to NEW_SAVING
 14/04/24 16:20:33 INFO recovery.RMStateStore: Storing info for app: 
 application_1398370674313_0004
 14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004 State 
 change from NEW_SAVING to SUBMITTED
 14/04/24 16:20:33 INFO fair.FairScheduler: Accepted application 
 application_1398370674313_0004 from user: yarn, in queue: default, currently 
 num of applications: 4
 14/04/24 16:20:33 INFO rmapp.RMAppImpl: application_1398370674313_0004 State 
 change from SUBMITTED to ACCEPTED
 14/04/24 16:20:33 INFO resourcemanager.ApplicationMasterService: Registering 
 app attempt : appattempt_1398370674313_0004_01
 14/04/24 16:20:33 INFO attempt.RMAppAttemptImpl: 
 appattempt_1398370674313_0004_01 State change from NEW to SUBMITTED
 14/04/24 16:20:33 INFO fair.FairScheduler: Added Application Attempt 
 appattempt_1398370674313_0004_01 to scheduler from user: yarn
 14/04/24 16:20:33 INFO attempt.RMAppAttemptImpl: 
 appattempt_1398370674313_0004_01 State change from SUBMITTED to SCHEDULED
 
 
 
 
 -- 
 Jay Vyas
 http://jayunit100.blogspot.com
 
 
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to 
 which it is addressed and may contain information that is confidential, 
 privileged and exempt from disclosure under applicable law. If the reader of 
 this message is not the intended recipient, you are hereby notified that any 
 printing, copying, dissemination, distribution, disclosure or forwarding of 
 this communication is strictly prohibited. If you have received this 
 communication in error, please contact the sender immediately and delete it 
 from your system. Thank You.

Re: Strange error in Hadoop 2.2.0: FileNotFoundException: file:/tmp/hadoop-hadoop/mapred/

2014-04-22 Thread Jay Vyas

Is this happening in the job client? or the mappers?


On Tue, Apr 22, 2014 at 11:21 AM, Natalia Connolly 
natalia.v.conno...@gmail.com wrote:

 Hello,

   I am running Hadoop 2.2.0 in a single-node cluster mode.  My
 application dies with the following strange error:

 Caused by: java.lang.RuntimeException: java.io.FileNotFoundException:
 file:/tmp/hadoop-hadoop/mapred/local/1398179594286/part-0 (No such file
 or directory)

This looks like the kind of file that should have been created on the
 fly (and then deleted).  Does anyone know what this error is really a
 symptom of?  Perhaps some permissions issues?

Thank you,

Natalia




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Shuffle Error after enabling Kerberos authentication

2014-04-19 Thread Jay Vyas

(bump) this is a good question.

im new to kerberos as well, and have been wondering how to prevent
scenarios such as this from happening.

my thought is that since Kerberos iirc requires a ticket for each pair of
client + services  working together  ... maybe there is a chance that,  if
*any* two nodes in a cluster havent been initialized with the right tickets
to talk together, then a possible error can happen during shuffle-sort b/c
so much distributed copying is going on ???

In any case, id love to know any good smoke tests for a large size
kerberized hadoop cluster  that dont require running a mapreduce job.



On Sat, Apr 19, 2014 at 11:11 PM, Mike m...@unitedrmr.com wrote:

 Unsubscribe

  On Apr 19, 2014, at 5:32 AM, Terance Dias terance.d...@gmail.com
 wrote:
 
  Hi,
 
  I'm using apache hadoop-2.1.0-beta. I'm able to set up a basic
 multi-node cluster and run map reduce jobs. But when I enable Kerberos
 authentication, the reduce task fails with following error.
 
  Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
 error in shuffle in fetcher#1
at
 org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
  Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES;
 bailing-out.
at
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:311)
at
 org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:243)
at
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:347)
at
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
 
  I did a search and found that people have generally seen this error when
 their network configuration is not correct and so the data nodes are not
 able to communicate with each other to shuffle the data. I don't think that
 is the problem in my case because everything works fine if Kerberos
 authentication is disabled. Any idea what what the problem could be?
 
  Thanks,
  Terance.
 




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: MapReduce for complex key/value pairs?

2014-04-08 Thread Jay Vyas

- Adding parsing logic in mappers/reducers is the simplest, least elegant
way to do it, or just writing json  strings is one simple way to do it.

- You get more advanced by writing custom writables which parse the data
are the first way to do it.

- The truly portable and right way is to do it is to define a schema and
use Avro to parse it.   Unlike manually adding parsing to app logic, or
adding json deser to your mapper/reducers, proper Avro serialization has
the benefit of increasing performance and app portability while also code
more maintainable (it interoperates with pure java domain objects)


On Tue, Apr 8, 2014 at 2:30 PM, Harsh J ha...@cloudera.com wrote:

 Yes, you can write custom writable classes that detail and serialise
 your required data structure. If you have Hadoop: The Definitive
 Guide, checkout its section Serialization under chapter Hadoop
 I/O.

 On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly
 natalia.v.conno...@gmail.com wrote:
  Dear All,
 
  I was wondering if the following is possible using MapReduce.
 
  I would like to create a job that loops over a bunch of documents,
  tokenizes them into ngrams, and stores the ngrams and not only the
 counts of
  ngrams but also _which_ document(s) had this particular ngram.  In other
  words, the key would be the ngram but the value would be an integer (the
  count) _and_ an array of document id's.
 
  Is this something that can be done?  Any pointers would be
 appreciated.
 
  I am using Java, btw.
 
 Thank you,
 
 Natalia Connolly
 



 --
 Harsh J




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop Serialization mechanisms

2014-03-31 Thread Jay Vyas

But I believe w.r.t. will we see performance gains when using
avro/thrift/... over writables -- it depends on the writable
implementation.For example, If I have  a writable serialization which
can use a bit map to store an enum,  but then read that enum as a string:
It will look the same to user, but my writable implementation would be
superior.We can obviously say that if you use avro/thrift/pbuffers in
an efficient way, then yes, you will see a  performance gain then say
storing everything as Text Writable objects.  But clever optimizations can
be done even within the Writable framework as well.


On Sun, Mar 30, 2014 at 4:08 PM, Harsh J ha...@cloudera.com wrote:

  Does Hadoop provides a pluggible feature for Serialization for both the
 above cases?

 - You can override the RPC serialisation module and engine with a
 custom class if you wish to, but it would not be trivial task.
 - You can easily use custom data serialisation modules for I/O.

  Is Writable the default Serialization mechanism for both the above cases?

 While MR's built-in examples in Apache Hadoop continue to use
 Writables, the RPCs have moved to using Protocol buffers for
 themselves in 2.x onwards.

  Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop
 2.x?

 Yes partially, see above.

  Will there be a significant performance gain if the default
 Serialization i.e. Writables is replaced with Avro, Protol Buffers or
 Thrift in Map Reduce programming?

 Yes, you should see a gain in using a more efficient data
 serialisation format for data files.

 On Sun, Mar 30, 2014 at 9:09 PM, Jay Vyas jayunit...@gmail.com wrote:
  Those are all great questions, and mostly difficultto answer.I havent
  played with serialization APIs in some time, but let me try to give some
  guidance.  WRT to your bulleted questions above:
 
  1) Serialization is file system independant:  The use of any hadoop
  compatible file system should support any kind of serialization.
 
  2) See (1).  The default serialization is Writables: But you can easily
  add your own by modifiying the io.serializations configuration parameter.
 
  3) I doubt anything significant effecting the way serialization works:
  The
  main thrust of 1-2 was in the way services are deployed, not changing
 the
  internals of how data is serialized.  After all, the serialization APIs
 need
  to remain stability even as the arch. of hadoop changes.
 
  4) It depends on the implementation.  If you have a custom writable that
 is
  really good at compressing your data, that will be better than using a
  thrift auto generated API for serialization that is uncustomized out of
 the
  box.  Example:  Say you are writing strings and you know the string is
 max
  3 characters.  A smart Writable serializer with custom implementations
  optimized for your data will beat a thrift serialization approach.  But I
  think in general, the advantage of thrift/avro is that its easier to get
  really good compression natively out-of-the-box, due to the fact that
 many
  different data types are strongly supported by the way they apply the
  schemas (for example , a thrift struct can contain a boolean, two
  strings , and an int These types will all be optmiized for you by
  thrift Where as in Writables, you cannot as easily create
 sophisticated
  types with optimization of nested properties.
 
 
 
 
  On Thu, Mar 27, 2014 at 2:59 AM, Radhe Radhe 
 radhe.krishna.ra...@live.com
  wrote:
 
  Hello All,
 
  AFAIK Hadoop serialization comes into picture in the 2 areas:
 
  putting data on the wire i.e., for interprocess communication between
  nodes using RPC
  putting data on disk i.e. using the Map Reduce for persistent storage
 say
  HDFS.
 
 
  I have a couple of questions regarding the Serialization mechanisms used
  in Hadoop:
 
  Does Hadoop provides a pluggible feature for Serialization for both the
  above cases?
  Is Writable the default Serialization mechanism for both the above
 cases?
  Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop
  2.x?
  Will there be a significant performance gain if the default
 Serialization
  i.e. Writables is replaced with Avro, Protol Buffers or Thrift in Map
 Reduce
  programming?
 
 
  Thanks,
  -RR
 
 
 
 
  --
  Jay Vyas
  http://jayunit100.blogspot.com



 --
 Harsh J




-- 
Jay Vyas
http://jayunit100.blogspot.com

Jobs fail immediately in local mode ?

2014-03-29 Thread Jay Vyas

Im running a job in local mode, and have found that it returns immediately,
switching job state to FAILURE.

From the /tmp/hadoop-jay directory, I see that clearly an attempt was made
to run the job , and that some files seem to have been created  But I
don't see any clues.

├── [102]  local
│   └── [102]  localRunner
│   └── [170]  jay
│   ├── [ 68]  job_local1531736937_0001
│   ├── [ 68]  job_local218993552_0002
│   └── [136]  jobcache
│   ├── [102]  job_local1531736937_0001
│   │   └── [102]
attempt_local1531736937_0001_m_00_0
│   │   └── [136]  output
│   │   ├── [ 14]  file.out
│   │   └── [ 32]  file.out.index
│   └── [102]  job_local218993552_0002
│   └── [102]
attempt_local218993552_0002_m_00_0
│   └── [136]  output
│   ├── [ 14]  file.out
│   └── [ 32]  file.out.index
└── [136]  staging
├── [102]  jay1531736937
└── [102]  jay218993552


Any thoughts on how i can further diagnose whats happening and why my job
fails without a stacktrace?  Because I dont have hadoop installed on the
system (i.e. im just running a java app that fires up  a hadoop client
locally), I cant see anything in /var/log .




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Doubt

2014-03-19 Thread Jay Vyas

Certainly it is , and quite common especially if you have some high
performance machines : they  can run as mapreduce slaves and also double as
mongo hosts.  The problem would of course be that when running mapreduce
jobs you might have very slow network bandwidth at times, and if your front
end needs fast response times all the time from mongo instances you could
be in trouble.



On Wed, Mar 19, 2014 at 11:50 AM, praveenesh kumar praveen...@gmail.comwrote:

 Why not ? Its just a matter of installing 2 different packages.
 Depends on what do you want to use it for, you need to take care of few
 things, but as far as installation is concerned, it should be easily doable.

 Regards
 Prav


 On Wed, Mar 19, 2014 at 3:41 PM, sri harsha rsharsh...@gmail.com wrote:

 Hi all,
 is it possible to install Mongodb on the same VM which consists hadoop?

 --
 amiable harsha





-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: What if file format is dependent upon first few lines?

2014-02-27 Thread Jay Vyas

-- method 1 --

You could, i think, just extend fileinputformat, with isSplittable =
false.  Then each file wont be brokeen up into separate blocks, and
processed as a whole per mapper.  This is probably the easiest thing to do
but if you have huge files, it wont perform very well.

-- method 2 --

You can use Harsh's suggestion (thanks for that idea, i didnt know it).

1) In the setup method of a mapper, you can get the file path : using

((FileSplit) context.getInputSplit()).getPath();


2) Then , in the mappers setup method, you should be able open a file
input stream and call seek(0) to read the file header, as Harsh sais.

3) When you process the header, you can store the results in the Setup
method as a local variable, and the mapper can read from that variable and
proceed.




On Thu, Feb 27, 2014 at 9:09 PM, Fengyun RAO raofeng...@gmail.com wrote:

 thanks, Harsh.

 could you specify more detail, or give some links or an example where I
 can start?



 2014-02-27 22:17 GMT+08:00 Harsh J ha...@cloudera.com:

 A mapper's record reader implementation need not be restricted to
 strictly only the input split boundary. It is a loose relationship -
 you can always seek(0), read the lines you need to prepare, then
 seek(offset) and continue reading.

 Apache Avro (http://avro.apache.org) has a similar format - header
 contains the schema a reader needs to work.

 On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO raofeng...@gmail.com
 wrote:
  Below is a fake sample of Microsoft IIS log:
  #Software: Microsoft Internet Information Services 7.5
  #Version: 1.0
  #Date: 2013-07-04 20:00:00
  #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
  cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
  time-taken
  2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2
 someuserAgent 200
  0 0 390
  2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3
 someuserAgent 200
  0 0 390
  2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4
 someuserAgent 200
  0 0 390
  ...
 
  The first four lines describe the file format, which is a must to parse
 each
  log line. It means log file could NOT be simply splitted, otherwise the
  second split would lost the file format information.
 
  How could each mapper get the first few lines in the file?



 --
 Harsh J





-- 
Jay Vyas
http://jayunit100.blogspot.com

YARN job exits fast without failure, but does nothing

2014-02-17 Thread Jay Vyas

Hi yarn:

Ive traced a oozie problem to a yarn task log , which originates from an
oozie submitted job:

http://paste.fedoraproject.org/78099/92698193/raw/

Although the above yarn task ends  in SUCCESS, it seems to do essentially
nothing.  Has anyone ever seen a log like that before?

Any insight into why i might have an empty task like this would be
appreciated.  I wont go into details about oozie here since its the yarn
mailing list, but the link to my original problem is here:

http://qnalist.com/questions/4726691/oozie-reports-unkown-hadoop-job-failure-but-no-error-indication-in-yarn


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: How to ascertain why LinuxContainer dies?

2014-02-14 Thread Jay Vyas

Not sure where the containers dump standard out /error to?  I figured it would 
be propagated in the node manager logs if anywhere, right?

Sent from my iPhone

 On Feb 14, 2014, at 4:46 AM, Harsh J ha...@cloudera.com wrote:
 
 Hi,
 
 Does your container command generate any stderr/stdout outputs that
 you can check under the container's work directory after it fails?
 
 On Fri, Feb 14, 2014 at 9:46 AM, Jay Vyas jayunit...@gmail.com wrote:
 I have a linux container that dies.  The nodemanager logs only say:
 
 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor:
 Exception from container-launch :
 org.apache.hadoop.util.Shell$ExitCodeException:
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:202)
   at org.apache.hadoop.util.Shell.run(Shell.java:129)
   at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
   at
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
   at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
   at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 
 where can i find the root cause of the non-zero exit code ?
 
 --
 Jay Vyas
 http://jayunit100.blogspot.com
 
 
 
 -- 
 Harsh J

Re: How to ascertain why LinuxContainer dies?

2014-02-14 Thread Jay Vyas

Okay harsh : Your hint was enought to get me back on trakc! I  found the
linux container logs and they are Wonderful :)... I guess at the end of
each container run, logs get propogated into the Distributed file system's
/var/log  directories.

In any case, once i dug in there, I found the cryptic failure was because
my done_intermediate permissions were bad.

anyways, thanks for the hint Harsh ! After monitoring the local
/var/log/hadoop-yarn/container/ directory, i was able to see that the
stdout/stderr files were being deleted , and then after some googling i
found a post about how YARN aggregates logs into the DFS.

Anyways, problem solved.  For those curious:  If debugging
Yarn-linux-containers that are dying (as shown in [local]
/var/log/hadoop-yarn/ nodemanager logs), you can dig more after the task
dies by going into

hadoop fs -cat
/var/log/hadoop-yarn/apps/oozie_user/logs/application_1392385522708_0008/*



On Fri, Feb 14, 2014 at 9:17 AM, German Florez-Larrahondo 
german...@samsung.com wrote:

 I believe that errors on containers are not propagated to the standard
 Java logs.

 You have to look into the std* and syslog files of the container:



 Here is an example :




 *.../userlogs/application_1391549207212_0006/container_1391549207212_0006_01_27*



 [htf@gfldesktop container_1391549207212_0006_01_27]$ ls -lart

 total 60

 -rw-rw-r--  1 htf htf 0 Feb  4 17:27 stdout

 -rw-rw-r--  1 htf htf 0 Feb  4 17:27 stderr

 drwx--x--- 28 htf htf  4096 Feb  4 17:27 ..

 drwx--x---  2 htf htf  4096 Feb  4 17:27 .

 -rw-rw-r--  1 htf htf 50471 Feb  4 17:31 syslog



 Regards

 ./g



 -Original Message-
 From: Jay Vyas [mailto:jayunit...@gmail.com]
 Sent: Friday, February 14, 2014 7:02 AM
 To: user@hadoop.apache.org
 Cc: user@hadoop.apache.org
 Subject: Re: How to ascertain why LinuxContainer dies?



 Not sure where the containers dump standard out /error to?  I figured it
 would be propagated in the node manager logs if anywhere, right?



 Sent from my iPhone



  On Feb 14, 2014, at 4:46 AM, Harsh J ha...@cloudera.com wrote:

 

  Hi,

 

  Does your container command generate any stderr/stdout outputs that

  you can check under the container's work directory after it fails?

 

  On Fri, Feb 14, 2014 at 9:46 AM, Jay Vyas jayunit...@gmail.com wrote:

  I have a linux container that dies.  The nodemanager logs only say:

 

  WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor:

  Exception from container-launch :

  org.apache.hadoop.util.Shell$ExitCodeException:

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:202)

  at org.apache.hadoop.util.Shell.run(Shell.java:129)

  at

  org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:

  322)

  at

  org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.laun

  chContainer(LinuxContainerExecutor.java:230)

  at

  org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C

  ontainerLaunch.call(ContainerLaunch.java:242)

  at

  org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C

  ontainerLaunch.call(ContainerLaunch.java:68)

  at

  java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

  at java.util.concurrent.FutureTask.run(FutureTask.java:138)

  at

  java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec

  utor.java:886)

  at

  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor

  .java:908)

  at java.lang.Thread.run(Thread.java:662)

 

  where can i find the root cause of the non-zero exit code ?

 

  --

  Jay Vyas

  http://jayunit100.blogspot.com

 

 

 

  --

  Harsh J




-- 
Jay Vyas
http://jayunit100.blogspot.com

How to ascertain why LinuxContainer dies?

2014-02-13 Thread Jay Vyas

I have a linux container that dies.  The nodemanager logs only say:

WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor:
Exception from container-launch :
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:202)
at org.apache.hadoop.util.Shell.run(Shell.java:129)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

where can i find the root cause of the non-zero exit code ?

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Test hadoop code on the cloud

2014-02-12 Thread Jay Vyas

As a slightly more advanced option for OpenStack people: Consider trying
savanna (Hadoop provisioned on top of open stack) as well.


On Wed, Feb 12, 2014 at 10:23 AM, Silvina Caíno Lores silvi.ca...@gmail.com
 wrote:

 You can check Amazon Elastic MapReduce, which comes preconfigured on EC2
 but you need to pay a little por it, or make your custom instalation on EC2
 (beware that EC2 instances come with nothing but really basic shell tools
 on it, so it may take a while to get it running).

 Amazon's free tier allows you to instantiate several tiny machines; when
 you spend your free quota they start charging you so be careful.

 Good luck :D




 On 12 February 2014 13:27, Andrea Barbato and.barb...@gmail.com wrote:

 Thanks for the answer, but if i want to test my code on a full
 distributed installation? (for more accurate performance)


 2014-02-12 13:01 GMT+01:00 Zhao Xiaoguang cool...@gmail.com:

 I think you can test it in Amazon EC2 with pseudo distribute, it support
 1 tiny instance for 1 year free.


 Send From My Macbook

 On Feb 12, 2014, at 6:29 PM, Andrea Barbato and.barb...@gmail.com
 wrote:

  Hi!
  I need to test my hadoop code on a cluster,
  what is the simplest way to do this on the cloud?
  Is there any way to do it for free?
 
  Thank in advance






-- 
Jay Vyas
http://jayunit100.blogspot.com

YARN FSDownload: How did Mr1 do it ?

2014-02-11 Thread Jay Vyas

Im noticing that resource localization is much more complex in YARN than
MR1, in particular, the timestamps need to be identical, or else, an
exception is thrown.

i never saw that in MR1.

How did MR1 JobTrackers handle resource localization differently than MR2
App Masters?

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: performance of hadoop fs -put

2014-01-29 Thread Jay Vyas

No , im using a glob pattern, its all done in one put statement


On Tue, Jan 28, 2014 at 9:22 PM, Harsh J ha...@cloudera.com wrote:

 Are you calling one command per file? That's bound to be slow as it
 invokes a new JVM each time.
 On Jan 29, 2014 7:15 AM, Jay Vyas jayunit...@gmail.com wrote:

 Im finding that hadoop fs -put on a cluster is quite slow for me when i
 have large amounts of small files... much slower than native file ops.
 Note that Im using the RawLocalFileSystem as the underlying backing
 filesystem that is being written to in this case, so HDFS isnt the issue.

 I see that the Put class creates a linkedlist of # number of elements in
 the path.

 1) Is there a more performant way to run fs -put

 2) Has anyone else noted that fs -put has extra overhead?

 Im going to trace some more but , just wanted to bounce this off the
 mailing list... maybe others also have run into this issue.

 ** Is hadoop fs -put inherently slower than a unix cpaction,
 regardless of filesystem -- and if so , why? **


 --
 Jay Vyas
 http://jayunit100.blogspot.com




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: DistributedCache deprecated

2014-01-29 Thread Jay Vyas

gotcha this makes sense


On Wed, Jan 29, 2014 at 4:44 PM, praveenesh kumar praveen...@gmail.comwrote:

 @Jay - Plus if you see DistributedCache class, these methods have been
 added inside the Job class, I am guessing they have kept the functionality
 same, just merged DistributedCache class into Job class itself. giving more
 methods for developers with less classes to worry about, thus simplifying
 the API. I hope that makes sense.

 Regards
 Prav


 On Wed, Jan 29, 2014 at 9:41 PM, praveenesh kumar praveen...@gmail.comwrote:

 @Jay - I don't know how Job class is replacing the DistributedCache class
 , but I remember trying distributed cache functions like

void *addArchiveToClassPath
 http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path%29*
 (Pathhttp://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/Path.html
  archive)
   Add an archive path to the current set of classpath entries.
  void *addCacheArchive
 http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addCacheArchive%28java.net.URI%29*
 (URIhttp://download.oracle.com/javase/6/docs/api/java/net/URI.html?is-external=true
  uri)
   Add a archives to be localized   void *addCacheFile
 http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29*
 (URIhttp://download.oracle.com/javase/6/docs/api/java/net/URI.html?is-external=true
  uri)
   Add a file to be localized

 and it works fine. The same way you were using DC before.. Well I am not
 sure what would be the best answer, but if you are trying to use DC , I was
 able to do it with Job class itself.

 Regards
 Prav


 On Wed, Jan 29, 2014 at 9:27 PM, Jay Vyas jayunit...@gmail.com wrote:

 Thanks for asking this : Im not sure and didnt realize this until you
 mentioned it!

 1) Prav:  You are implying that we would use the Job Class... but how
 could it replace the DC?

 2) The point of the DC is to replicate a file so that its present and
 local on ALL nodes.   I didnt know it was deprecated, but there must be
 some replacement for it - or maybe it just got renamed and moved?

 SO ... what is the future of the DistributedCache for mapreduce jobs?


 On Wed, Jan 29, 2014 at 4:22 PM, praveenesh kumar 
 praveen...@gmail.comwrote:

 I think you can use the Job class.

 http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html

 Regards
 Prav


 On Wed, Jan 29, 2014 at 9:13 PM, Giordano, Michael 
 michael.giord...@vistronix.com wrote:

  I noticed that in Hadoop 2.2.0
 org.apache.hadoop.mapreduce.filecache.DistributedCache has been 
 deprecated.



 (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)



 Is there a class that provides equivalent functionality? My
 application relies heavily on DistributedCache.



 Thanks,

 Mike G.

 This communication, along with its attachments, is considered
 confidential and proprietary to Vistronix.  It is intended only for the 
 use
 of the person(s) named above.  Note that unauthorized disclosure or
 distribution of information not generally known to the public is strictly
 prohibited.  If you are not the intended recipient, please notify the
 sender immediately.





 --
 Jay Vyas
 http://jayunit100.blogspot.com






-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Passing data from Client to AM

2014-01-29 Thread Jay Vyas

while your at it, what about adding values to the Configuration() object,
does that still work as a hack for information passing?


On Wed, Jan 29, 2014 at 5:25 PM, Arun C Murthy a...@hortonworks.com wrote:

 Command line arguments  env variables are the most direct options.

 A more onerous option is to write some data to a file in HDFS, use
 LocalResource to ship it to the container on each node and get application
 code to read that file locally. (In MRv1 parlance that is Distributed
 Cache).

 hth,
 Arun

 On Jan 29, 2014, at 12:59 PM, Brian C. Huffman 
 bhuff...@etinternational.com wrote:

 I'm looking at Distributed Shell as an example for writing a YARN
 application.

 My question is why are the script path and associated metadata saved as
 environment variables?  Are there any other ways besides environment
 variables or command line arguments for passing data from the Client to the
 ApplicationMaster?

 Thanks,
 Brian




 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Hadoop-2.2.0 and Pig-0.12.0 - error IBM_JAVA

2014-01-28 Thread Jay Vyas

Thanks for sharing this as we had the same problem, and  We are playing
with similar errors.  Starting to think that there is something overly
difficult about pig/hadoop 2.x deployment, related to which version of pig
you use .

chelsoo has helped us resolve our issue by pointing us to

https://issues.apache.org/jira/browse/PIG-3729

hope somebody can illuminate whats actually going on with pig , hadoop 2x,
and hadoop 1x : and  why the standard pig jars dont work on 2x?


On Tue, Jan 28, 2014 at 2:23 PM, Serge Blazhievsky hadoop...@gmail.comwrote:

 Which hadoop distribution are you using?


 On Tue, Jan 28, 2014 at 10:04 AM, Viswanathan J 
 jayamviswanat...@gmail.com wrote:

 Hi Guys,

 I'm running hadoop 2.2.0 version with pig-0.12.0, when I'm trying to run
 any job getting the error as below,

 *java.lang.NoSuchFieldError: IBM_JAVA*

 Is this because of Java version or compatibility issue with hadoop and
 pig.
 I'm using Java version - *1.6.0_31*

 Please help me out.

 --
 Regards,
 Viswa.J





-- 
Jay Vyas
http://jayunit100.blogspot.com

performance of hadoop fs -put

2014-01-28 Thread Jay Vyas

Im finding that hadoop fs -put on a cluster is quite slow for me when i
have large amounts of small files... much slower than native file ops.
Note that Im using the RawLocalFileSystem as the underlying backing
filesystem that is being written to in this case, so HDFS isnt the issue.

I see that the Put class creates a linkedlist of # number of elements in
the path.

1) Is there a more performant way to run fs -put

2) Has anyone else noted that fs -put has extra overhead?

Im going to trace some more but , just wanted to bounce this off the
mailing list... maybe others also have run into this issue.

** Is hadoop fs -put inherently slower than a unix cpaction, regardless
of filesystem -- and if so , why? **


-- 
Jay Vyas
http://jayunit100.blogspot.com

Strange rpc exception in Yarn

2014-01-27 Thread Jay Vyas

Hi folks:

At the **end** of a successful job, im getting some strange stack traces
  this when using pig, however, it doesnt seem to be pig specific from
the stacktrace.  Rather, it appears that the job client is attempting to do
something funny.

Anyone ever see this sort of exception in Yarn ?  It seems as though its
related to an IPC call, but the IPC call is throwing an exception in the
hasNext(..) implementation in the AbstractfileSystem.

ERROR org.apache.hadoop.security.UserGroupInformation -
PriviledgedActionException as:roofmonkey (auth:SIMPLE)
cause:java.io.IOException:
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
java.lang.NullPointerException
at
org.apache.hadoop.fs.AbstractFileSystem$1.hasNext(AbstractFileSystem.java:861)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectory(HistoryFileManager.java:656)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanDirectoryForHistoryFiles(HistoryFileManager.java:668)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:722)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$300(HistoryFileManager.java:77)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:275)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:708)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getFileInfo(HistoryFileManager.java:847)
at
org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(CachedHistoryStorage.java:107)
at
org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(JobHistory.java:207)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:200)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:196)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.verifyAndGetJob(HistoryClientService.java:196)
at
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getJobReport(HistoryClientService.java:228)
at
org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122)
at
org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)

Re: Shutdown hook for FileSystems

2014-01-21 Thread Jay Vyas

what is happening when you remove the shutdown hook ?is that supposed
to  trigger an exception -

Re: What is the difference between Hdfs and DistributedFileSystem?

2014-01-13 Thread Jay Vyas

yes, yes , and YES !  The use of alternative file systems to HDFS exciting
part of the hadoop ecosystem, allowing us to plug mapreduce applications
into different storage backends.  Lost of folks in the hadoop community are
working hard to democratize storage on Hadoop.

Take a moment to read a scan some of these article to get an idea of how
modular the hadoop stack really is, and how broad the ecosystem is in terms
of storage backends.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/s3/S3FileSystem.html
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/RawLocalFileSystem.html
http://answers.oreilly.com/topic/456-get-to-know-hadoop-filesystems/
https://wiki.apache.org/hadoop/HCFS
http://www.gluster.org/2013/10/automated-hadoop-deployment-on-glusterfs-with-apache-ambari/
http://www.redhat.com/about/news/archive/2013/10/red-hat-contributes-apache-hadoop-plug-in-to-the-gluster-community




On Mon, Jan 13, 2014 at 12:10 PM, Michael sjp120...@gmail.com wrote:

 HDFS is an implementation of the Distributed File System. There can be
 other implementations of a generic distributed file system ( for eg google
 file system GFS )


 On 13 January 2014 17:01, 梁李印 liyin.lian...@aliyun-inc.com wrote:

 What is the difference between Hdfs.java and DistributedFileSystem.java
 in Hadoop2?



 Best Regards,

 Liyin Liang



 Tel: 78233

 Email: liyin.lian...@alibaba-inc.com







-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Ways to manage user accounts on hadoop cluster when using kerberos security

2014-01-08 Thread Jay Vyas

I recently found a pretty simple and easy way to set ldap up for my machines on 
rhel and wrote it up using jumpbox and authconfig.

If you are in the cloud and only need a quick easy ldap idh and nssswitch 
setup, this is I think the easiest / cheapest way to do it.

I know rhel and fedora come with authconfig not sure about the other Linux 
distros:

http://jayunit100.blogspot.com/2013/12/an-easy-way-to-centralize.html?m=1





 On Jan 8, 2014, at 1:22 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com 
 wrote:
 
 
 On Jan 7, 2014, at 2:55 PM, Manoj Samel manoj.sa...@gmail.com wrote:
 
 I am assuming that if the users are in a LDAP, can using the PAM for LDAP 
 solve the issue.
 
 
 That's how I've seen this issue addressed. 
 
 +Vinod
 
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to 
 which it is addressed and may contain information that is confidential, 
 privileged and exempt from disclosure under applicable law. If the reader of 
 this message is not the intended recipient, you are hereby notified that any 
 printing, copying, dissemination, distribution, disclosure or forwarding of 
 this communication is strictly prohibited. If you have received this 
 communication in error, please contact the sender immediately and delete it 
 from your system. Thank You.

Re: Debug Hadoop Junit Test in Eclipse

2013-12-16 Thread Jay Vyas

Excellent question.  its not trivial to debug a distributed app in eclipse,
but it is totally doable using javaagents .   We've written it up here:

http://jayunit100.blogspot.com/2013/07/deep-dive-into-hadoop-with-bigtop-and.html

FYI cc Brad childs (https://github.com/childsb) at red hat has helped me
with the tutorial, he might have some extra advice also (cc'd on this
email), I've written up one way to do this using the bigtop VMs here.




On Mon, Dec 16, 2013 at 8:07 AM, Karim Awara karim.aw...@kaust.edu.sawrote:

 Hi,

 I want to trace how a file upload (-put) happens in hadoop. So Im junit
 testing TestDFSShell.java. When I try to debug the test, It fails due to
 test timed out exception. I believe this is because I am trying to stop one
 thread while the rest are working. I have changed the breakpoint property
 to suspend VM, but still same problem.


 How can I trace calls made by datanode/namenode when running
 TestDFSShell.java Junit test through eclipse?

 I am using hadoop 2.2.0

 --
 Best Regards,
 Karim Ahmed Awara

 --
 This message and its contents, including attachments are intended solely
 for the original recipient. If you are not the intended recipient or have
 received this message in error, please notify me immediately and delete
 this message from your computer system. Any unauthorized use or
 distribution is prohibited. Please consider the environment before printing
 this email.




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Debug Hadoop Junit Test in Eclipse

2013-12-16 Thread Jay Vyas

In that case i guess you will have to statically trace the code your self.



On Mon, Dec 16, 2013 at 10:32 AM, Karim Awara karim.aw...@kaust.edu.sawrote:


 Useful post, however, I am not trying to  debug mapreduce programs with
 its associated VMs. I want to modify HDFS source code on how it uploads
 files. So I am only looking to trace fs commands through the DFS shell. I
 believe this should be require less work in debugging than actually going
 to mapred VMs!




 --
 Best Regards,
 Karim Ahmed Awara


 On Mon, Dec 16, 2013 at 5:57 PM, Jay Vyas jayunit...@gmail.com wrote:

 Excellent question.  its not trivial to debug a distributed app in
 eclipse, but it is totally doable using javaagents .   We've written it up
 here:


 http://jayunit100.blogspot.com/2013/07/deep-dive-into-hadoop-with-bigtop-and.html

 FYI cc Brad childs (https://github.com/childsb) at red hat has helped me
 with the tutorial, he might have some extra advice also (cc'd on this
 email), I've written up one way to do this using the bigtop VMs here.




 On Mon, Dec 16, 2013 at 8:07 AM, Karim Awara karim.aw...@kaust.edu.sawrote:

 Hi,

 I want to trace how a file upload (-put) happens in hadoop. So Im junit
 testing TestDFSShell.java. When I try to debug the test, It fails due to
 test timed out exception. I believe this is because I am trying to stop one
 thread while the rest are working. I have changed the breakpoint property
 to suspend VM, but still same problem.


 How can I trace calls made by datanode/namenode when running
 TestDFSShell.java Junit test through eclipse?

 I am using hadoop 2.2.0

 --
 Best Regards,
 Karim Ahmed Awara

 --
 This message and its contents, including attachments are intended solely
 for the original recipient. If you are not the intended recipient or have
 received this message in error, please notify me immediately and delete
 this message from your computer system. Any unauthorized use or
 distribution is prohibited. Please consider the environment before printing
 this email.




 --
 Jay Vyas
 http://jayunit100.blogspot.com



 --
 This message and its contents, including attachments are intended solely
 for the original recipient. If you are not the intended recipient or have
 received this message in error, please notify me immediately and delete
 this message from your computer system. Any unauthorized use or
 distribution is prohibited. Please consider the environment before printing
 this email.




-- 
Jay Vyas
http://jayunit100.blogspot.com

Pluggable distribute cache impl

2013-12-15 Thread Jay Vyas

are there any ways to plug in an alternate distributed cache implantation (I.e 
when nodes of a cluster already have an nfs mount or other local data 
service...)?

Re: multiusers in hadoop through LDAP

2013-12-10 Thread Jay Vyas

So, not knowing much about LDAP, but being very interested in the multiuser
problem on multiuser filesystems, i was excited to see this question Im
researching the same thing at the moment, and it seems obviated by the fact
that :

- the FileSystem API itslef provides implementations for getting group and
user names / permissions

And furthermore

- the linux task controllers launch jobs as the user submitting the job,
whereas the regular task controllers launch tasksunder the YARN daemon
name, iirc.

So where does LDAP begin and TaskController / FileSystem notions of
ownership end ?

I guess I'm also asking what are the entites which are ownable in hadoop
app , and how we can leverage the GroupMappingServiceProviders to deploy
more flexible hadoop environments.

Any thoughts on this would be appreciated.

On Tue, Dec 10, 2013 at 6:38 PM, Adam Kawa kawa.a...@gmail.com wrote:

 Please have a look at hadoop.security.group.mapping.ldap.* settings as Hardik
 Pandya suggests.

 =

 In advance, just to share our story related to LDAP +
 hadoop.security.group.mapping.ldap.*, if you run into the same limitation
 as we did:

 In many cases hadoop.security.group.mapping.ldap.* should solve your
 problem. Unfortunately, they did now work for us. The problematic setting
 relates to an additional filter to use when searching for LDAP groups. We
 wanted to use posixGroups filter, but it is currently not supported by
 Hadoop. Finally, we found a workaround using name service switch
 configuration where we specified that the LDAP should the primary source of
 information about groups of our users. This means that we solved this
 problem on the operating system level, not on Hadoop level.

 You can read more about this issue here:

 http://hakunamapdata.com/a-user-having-surprising-troubles-running-more-resource-intensive-hive-queries/
 and here
 http://www.slideshare.net/AdamKawa/hadoop-adventures-at-spotify-strata-conference-hadoop-world-2013
  (slides
 18-26).


 2013/12/10 Hardik Pandya smarty.ju...@gmail.com


 have you looked at hadoop.security.group.mapping.ldap.* in
 hadoop-common/core-default.xmlhttp://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-common/core-default.xml

 additional 
 resourcehttp://hakunamapdata.com/a-user-having-surprising-troubles-running-more-resource-intensive-hive-queries/may
  help






 On Tue, Dec 10, 2013 at 3:06 AM, YouPeng Yang 
 yypvsxf19870...@gmail.comwrote:

 Hi

   In my cluster ,I want to have multiusers for different purpose.The
 usual method is to add a user through the OS  on  Hadoop NameNode .
   I notice the hadoop also support to LDAP, could I add user through
 LDAP instead through OS? So that if a user is authenticated by the LDAP
 ,who will also access the HDFS directory?


 Regards






-- 
Jay Vyas
http://jayunit100.blogspot.com

Write a file to local disks on all nodes of a YARN cluster.

2013-12-08 Thread Jay Vyas

I want to put a file on all nodes of my cluster, that is locally readable
(not in HDFS).

Assuming that i cant gaurantee a FUSE mount or NFS or anything of the SORT
on my cluster, is there a poor man's way to do this in yarn?  something
like

for each node n in cluster:
n.copToLocal(a,/tmp/a);

So that afterwards, all nodes in the cluster have a file a in /tmp.

-- 
Jay Vyas
http://jayunit100.blogspot.com

FSMainOperations FSContract tests?

2013-12-05 Thread Jay Vyas

Mainly @steveloughran Is it safe to say that *old* fs semantics are in 
FSContract test, and *new* fs semantics in FSMainOps tests ? 

I ask this because it seems that you had tests in your swift filesystem tests 
which used the FSContract libs, as well as the FSMainOps.. 

Not sure why you need both?  There is pretty high redundancy it seems

Re: how to prevent JAVA HEAP OOM happen in shuffle process in a MR job?

2013-12-02 Thread Jay Vyas

version is rewally important here..

- If 1.x, then Where (NN , JT , TT ?)
- if 2.x, then where? (AM, NM, ... ?) -- probably less likely here, since
the resources are ephemeral.

I know that some older 1x versions had an issue with the jobtracker having
an ever-expanding hashmap or something like that, so that if you ran 100s
of jobs, you could get OOM erros on the JobTracker.

Re: Hadoop Test libraries: Where did they go ?

2013-11-25 Thread Jay Vyas

Yup , we figured it out eventually.
The artifacts now use the test-jar directive which creates a jar file that you 
can reference in mvn using the type tag in your dependencies.

However, fyi, I haven't been able to successfully google for the quintessential 
classes in the hadoop test libs like the fs BaseContractTest by name, so they 
are now harder to find then before

So i think it's unfortunate that they are not a top level maven artifact.

It's misleading, as It's now very easy to assume from looking at hadoop in mvn 
central that hadoop-test is just an old library that nobody updates anymore.

Just a thought but Maybe hadoop-test could be rejuvenated to point to the 
hadoop-commons some how?


 On Nov 25, 2013, at 4:52 AM, Steve Loughran ste...@hortonworks.com wrote:
 
 I see a hadoop-common-2.2.0-tests.jar in org.apache.hadoop/hadoop-?common;
 SHA1 a9994d261d00295040a402cd2f611a2bac23972a, which resolves in a search
 engine to
 http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/2.2.0/
 
 It looks like it is now part of the hadoop-common artifacts, you just say
 you want the test bits
 
 http://maven.apache.org/guides/mini/guide-attached-tests.html
 
 
 
 On 21 November 2013 23:28, Jay Vyas jayunit...@gmail.com wrote:
 
 It appears to me that
 
 http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-test
 
 Is no longer updated
 
 Where does hadoop now package the test libraries?
 
 Looking in the .//hadoop-common-project/hadoop-common/pom.xml  file in
 the hadoop 2X branches, im not sure wether or not src/test is packaged into
 a jar anymore... but i fear it is not.
 
 --
 Jay Vyas
 http://jayunit100.blogspot.com
 
 -- 
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to 
 which it is addressed and may contain information that is confidential, 
 privileged and exempt from disclosure under applicable law. If the reader 
 of this message is not the intended recipient, you are hereby notified that 
 any printing, copying, dissemination, distribution, disclosure or 
 forwarding of this communication is strictly prohibited. If you have 
 received this communication in error, please contact the sender immediately 
 and delete it from your system. Thank You.

Hadoop Test libraries: Where did they go ?

2013-11-21 Thread Jay Vyas

It appears to me that

http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-test

Is no longer updated

Where does hadoop now package the test libraries?

Looking in the .//hadoop-common-project/hadoop-common/pom.xml  file in
the hadoop 2X branches, im not sure wether or not src/test is packaged into
a jar anymore... but i fear it is not.

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: how to stream the video from hdfs

2013-11-13 Thread Jay Vyas

I believe there is a FUSE mount for hdfs which will allow you to open files 
normally in your streaming app rather than requiring using the jav API.

Also consider that For Media and highly available binary data for a front end I 
  would guess that hdfs might be overkill because of the blocking/nn 
requirement..If hdfs is not required  but you still want a hadoop compatible 
dfs, you could also try gluster which may be a little better suited for read 
only , unblocked data for streaming from a front end.

 On Nov 13, 2013, at 12:50 AM, mallik arjun mallik.cl...@gmail.com wrote:
 
 Dear Jens,
 
 i want to put the videos into hdfs, and then i want to stream those video's 
 php front end.
 
 
 On Tue, Nov 12, 2013 at 11:50 PM, Jens Scheidtmann 
 jens.scheidtm...@gmail.com wrote:
 Dear Mallik,
 
 Please tell us what you are trying to accomplish, maybe then somebody is 
 able to help you...
 
 Jens 
 
 Am Montag, 11. November 2013 schrieb mallik arjun :
 
 hi all,
 
 how to stream the video from hdfs.

YARN And NTP

2013-10-24 Thread Jay Vyas

Hi folks.  Is there a way to make YARN more forgiving with last
modification times? The following exception in

org.apache.hadoop.yarn.util.FSDownload:

 changed on src filesystem (expected  + resource.getTimestamp() +
, was  + sStat.getModificationTime());

I realize that time should be the same, but depending on underlying
filesystem the semantics of this last modified time might vary.

Any thoughts on this?


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Uploading a file to HDFS

2013-10-01 Thread Jay Vyas

I've diagramed the hadoop HDFS write path here:

http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html


On Tue, Oct 1, 2013 at 5:24 PM, Ravi Prakash ravi...@ymail.com wrote:

 Karim!

 Look at DFSOutputStream.java:DataStreamer

 HTH
 Ravi


   --
  *From:* Karim Awara karim.aw...@kaust.edu.sa
 *To:* user user@hadoop.apache.org
 *Sent:* Thursday, September 26, 2013 7:51 AM
 *Subject:* Re: Uploading a file to HDFS


 Thanks for the reply. when the client caches 64KB of data on its own side,
 do you know which set of major java classes/files responsible for such
 action?

 --
 Best Regards,
 Karim Ahmed Awara


 On Thu, Sep 26, 2013 at 2:25 PM, Jitendra Yadav 
 jeetuyadav200...@gmail.com wrote:

 Case 2:

 While selecting target DN in case of write operations, NN will always
 prefers first DN as same DN from where client  sending the data, in some
 cases NN ignore that DN when there is some disk space issues or some other
 health symptoms found,rest of things will same.

 Thanks
 Jitendra


 On Thu, Sep 26, 2013 at 4:15 PM, Shekhar Sharma shekhar2...@gmail.comwrote:

 Its not the namenode that does the reading or breaking of the file..
 When you run the command hadoop fs -put input output.
 Here hadoop is a script file which is default client for hadoop..and
 when the client contacts the namenode for writing, then NN creates a
 block id and ask 3 dN to host the block ( replication factor to 3) and
 this information is sent to client.

 client caches 64KB of data on its own side and then pushes the data to
 the DN and then this data gets pushed through pipeline..and this
 process gets repeated till 64MB data is written and if the client
 wants to to write more then he will again contact NN via heart beat
 signal and this process continuess...

 Check how does writing happens in HDFS?


 Regards,
 Som Shekhar Sharma
 +91-8197243810


 On Thu, Sep 26, 2013 at 3:41 PM, Karim Awara karim.aw...@kaust.edu.sa
 wrote:
  Hi,
 
  I have a couple of questions about the process of uploading a large file
 (
  10GB) to HDFS.
 
  To make sure my understanding is correct, assuming I have a cluster of N
  machines.
 
  What happens in the following:
 
 
  Case 1:
  assuming i want to uppload a file (input.txt) of size K
 GBs
  that resides on the local disk of machine 1 (which happens to be the
  namenode only). if I am running the command  -put input.txt {some hdfs
 dir}
  from the namenode (assuming it does not play the datanode role), then
 will
  the namenode read the first 64MB in a temporary pipe and then transfers
 it
  to one of the cluster datanodes once finished?  Or the namenode does not
 do
  any reading of the file, but rather asks a certain datanode to read the
 64MB
  window from the file remotely?
 
 
  Case 2:
   assume machine 1 is the namenode, but i run the -put command
  from machine 3 (which is a datanode). who will start reading the file?
 
 
 
  --
  Best Regards,
  Karim Ahmed Awara
 
  
  This message and its contents, including attachments are intended solely
 for
  the original recipient. If you are not the intended recipient or have
  received this message in error, please notify me immediately and delete
 this
  message from your computer system. Any unauthorized use or distribution
 is
  prohibited. Please consider the environment before printing this email.




 --
 This message and its contents, including attachments are intended solely
 for the original recipient. If you are not the intended recipient or have
 received this message in error, please notify me immediately and delete
 this message from your computer system. Any unauthorized use or
 distribution is prohibited. Please consider the environment before printing
 this email.





-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Retrieve and compute input splits

2013-09-27 Thread Jay Vyas

Technically, the block locations are provided by the InputSplit which in
the FileInputFormat case, is provided by the FileSystem Interface.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

The thing to realize here is that the FileSystem implementation is provided
at runtime - so the InputSplit class is responsible to create a FileSystem
implementation using reflection, and then call the getBlockLocations of on
a given file or set of files which the input split is corresponding to.

I think your confusion here lies in the fact that the input splits create a
filesystem, however, they dont know what the filesystem implementation
actually is - they only rely on the abstract contract, which provides a set
of block locations.

See the FileSystem abstract class for details on that.


On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian mohaj...@gmail.comwrote:

 For the JobClient to compute the input splits doesn't it need to contact
 Name Node. Only Name Node knows where the splits are, how can it compute it
 without that additional call?


 On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal sonalgoy...@gmail.comwrote:

 The input splits are not copied, only the information on the location of
 the splits is copied to the jobtracker so that it can assign tasktrackers
 which are local to the split.

 Check the Job Initialization section at

 http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/

 To create the list of tasks to run, the job scheduler first retrieves
 the input splits computed by the JobClient from the shared filesystem
 (step 6). It then creates one map task for each split. The number of reduce
 tasks to create is determined by the mapred.reduce.tasks property in the
 JobConf, which is set by the setNumReduceTasks() method, and the
 scheduler simply creates this number of reduce tasks to be run. Tasks are
 given IDs at this point.

 Best Regards,
 Sonal
 Nube Technologies http://www.nubetech.co

  http://in.linkedin.com/in/sonalgoyal




 On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai saigr...@yahoo.in wrote:

 Hi
 I have attached the anatomy of MR from definitive guide.

 In step 6 it says JT/Scheduler  retrieve  input splits computed by the
 client from hdfs.

 In the above line it refers to as the client computes input splits.

 1. Why does the JT/Scheduler retrieve the input splits and what does it
 do.
 If it is retrieving the input split does this mean it goes to the block
 and reads each record
 and gets the record back to JT. If so this is a lot of data movement for
 large files.
 which is not data locality. so i m getting confused.

 2. How does the client know how to calculate the input splits.

 Any help please.
 Thanks
 Sai






-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Extending DFSInputStream class

2013-09-26 Thread Jay Vyas

This is actually somewhat common in some of the hadoop core classes :
Private constructors and inner classes. I think in the long term jiras
should be opened for these to make them public and pluggable with public
parameterized constructors wherever possible, so that modularizations can
be provided.


On Thu, Sep 26, 2013 at 10:46 AM, Rob Blah tmp5...@gmail.com wrote:

 Hi

 I would like to wrap DFSInputStream by extension. However it seems that
 the DFSInputStream constructor is package private. Is there anyway to
 achieve my goal? Also just out of curiosity why you have made this class
 inaccessible for developers, or am I missing something?

 regards
 tmp




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Extending DFSInputStream class

2013-09-26 Thread Jay Vyas

The way we have gotten around this in the past is extending and then
copying the private code and creating a brand new implementation.


On Thu, Sep 26, 2013 at 10:50 AM, Jay Vyas jayunit...@gmail.com wrote:

 This is actually somewhat common in some of the hadoop core classes :
 Private constructors and inner classes. I think in the long term jiras
 should be opened for these to make them public and pluggable with public
 parameterized constructors wherever possible, so that modularizations can
 be provided.


 On Thu, Sep 26, 2013 at 10:46 AM, Rob Blah tmp5...@gmail.com wrote:

 Hi

 I would like to wrap DFSInputStream by extension. However it seems that
 the DFSInputStream constructor is package private. Is there anyway to
 achieve my goal? Also just out of curiosity why you have made this class
 inaccessible for developers, or am I missing something?

 regards
 tmp




 --
 Jay Vyas
 http://jayunit100.blogspot.com




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Concatenate multiple sequence files into 1 big sequence file

2013-09-10 Thread Jay Vyas

iirc sequence files can be concatenated as is and read as one large file
but maybe im forgetting something.

RawLocalFileSystem, getPos and NullPointerException

2013-09-09 Thread Jay Vyas

What is the correct behaviour for getPos  in a record reader, and how
should it behave when the underlying stream is null?  It appears this can
happen in the rawlocalfilesystem.  Not sure if its implemented more safely
in DistributedfileSYstem just yet.


   I've found that the getPos in the RawLocalFileSystem's input stream can
throw a null pointer exception if its underlying stream is closed.

I discovered this when playing with a custom record reader.

to patch it, I simply check if a call to stream.available() throws an
exception, and if so, I return 0 in the getPos() function.

The existing getPos() implementation is found here:

https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20/src/examples/org/apache/hadoop/examples/MultiFileWordCount.java

What should be the correct behaviour of getPos() in the RecordReader?


http://stackoverflow.com/questions/18708832/hadoop-rawlocalfilesystem-and-getpos

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: hadoop cares about /etc/hosts ?

2013-09-09 Thread Jay Vyas

Jitendra:  When you say  check your masters file content  what are you
referring to?


On Mon, Sep 9, 2013 at 8:31 AM, Jitendra Yadav
jeetuyadav200...@gmail.comwrote:

 Also can you please check your masters file content in hadoop conf
 directory?

 Regards
 JItendra

 On Mon, Sep 9, 2013 at 5:11 PM, Olivier Renault 
 orena...@hortonworks.comwrote:

 Could you confirm that you put the hash in front of 192.168.6.10
 localhost

 It should look like

 # 192.168.6.10localhost

 Thanks
 Olivier
  On 9 Sep 2013 12:31, Cipher Chen cipher.chen2...@gmail.com wrote:

   Hi everyone,
   I have solved a configuration problem due to myself in hadoop cluster
 mode.

 I have configuration as below:

   property
 namefs.default.name/name
 valuehdfs://master:54310/value
   /property

 a
 nd the hosts file:


 /etc/hosts:
 127.0.0.1   localhost
  192.168.6.10localhost
 ###

 192.168.6.10tulip master
 192.168.6.5 violet slave

 a
 nd when i was trying to start-dfs.sh, namenode failed to start.


 namenode log hinted that:
 13/09/09 17:09:02 INFO namenode.NameNode: Namenode up at: localhost/
 192.168.6.10:54310
 ...
 13/09/09 17:09:10 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54310. Already tried 0 time(s); retry policy is
 RetryUpToMaximumCountWithF
 13/09/09 17:09:11 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54310. Already tried 1 time(s); retry policy is
 RetryUpToMaximumCountWithF
 13/09/09 17:09:12 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54310. Already tried 2 time(s); retry policy is
 RetryUpToMaximumCountWithF
 13/09/09 17:09:13 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54310. Already tried 3 time(s); retry policy is
 RetryUpToMaximumCountWithF
 13/09/09 17:09:14 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54310. Already tried 4 time(s); retry policy is
 RetryUpToMaximumCountWithF
 13/09/09 17:09:15 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54310. Already tried 5 time(s); retry policy is
 RetryUpToMaximumCountWithF
 13/09/09 17:09:16 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54310. Already tried 6 time(s); retry policy is
 RetryUpToMaximumCountWithF
 13/09/09 17:09:17 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54310. Already tried 7 time(s); retry policy is
 RetryUpToMaximumCountWithF
 13/09/09 17:09:18 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54310. Already tried 8 time(s); retry policy is
 RetryUpToMaximumCountWithF
 13/09/09 17:09:19 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:54310. Already tried 9 time(s); retry policy is
 RetryUpToMaximumCountWithF
 ...

 Now I know deleting the line 192.168.6.10localhost  ###
 
 would fix this.
 But I still don't know

 why hadoop would resolve master to localhost/127.0.0.1.


 Seems http://blog.devving.com/why-does-hbase-care-about-etchosts/explains 
 this,
  I'm not quite sure.
 Is there any
  other
 explanation to this?

 Thanks.


  --
 Cipher Chen


 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.





-- 
Jay Vyas
http://jayunit100.blogspot.com

MultiFileLineRecodrReader vs CombineFileRecordReader

2013-09-07 Thread Jay Vyas

I've found that there are two different implementations of seemingly the
same class:

MultiFileLineRecordReader (implemented as an inner class in some versions
of MultiFileWordCount) and

CombineFileRecordReader

In order to implement RecordReaders for the MultiFileWordCount class.

Is there any major difference between these classes, and why the redundancy
?  I'm thinking maybe it was retro added at some point, based on some git
detective work which I tried...

But I figured it might just be easier to ask here :)

-- 
Jay Vyas
http://jayunit100.blogspot.com

examples of HADOOP REST API

2013-08-20 Thread Jay Vyas

Hi, it appears that there are some completed jiras for the Hadoop REST
services for monitoring via http calls.

Are there any examples of these in use?   I dont see any docs on the URLs
that the hadoop REST API publishes cluster information over.

Im assuming also that there might be some overlap between this and the
ambari REST services, but not sure where to start digging.

I want to run some rest calls at the end of some jobs to query how many
tasks failed, etc...
Hopefully, I could get this in JSON rather than scraping HTML.

Thanks!

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: e-Science app on Hadoop

2013-08-16 Thread Jay Vyas

there are literally hundreds.  Here is a great review article for how
mapreduce is used in the bioinformatics algorithms space:

http://www.biomedcentral.com/1471-2105/11/S12/S1


On Fri, Aug 16, 2013 at 3:38 PM, Felipe Gutierrez 
felipe.o.gutier...@gmail.com wrote:

 Hello,

 Does anybody know an e-Science application to run on Hadoop?

 Thanks.
 Felipe

 --
 *--
 -- Felipe Oliveira Gutierrez
 -- felipe.o.gutier...@gmail.com
 -- https://sites.google.com/site/lipe82/Home/diaadia*




-- 
Jay Vyas
http://jayunit100.blogspot.com

Mapred.system.dir: should JT start without it?

2013-08-15 Thread Jay Vyas

Is there a startup for contract mapreduce making its own mapred.system.dir  ? 

Also, it seems that the jobtracker can startup even if this directory was not 
created / doesn't exist - I'm thinking that if that's the case, JT should fail 
up front.

Re: Why LineRecordWriter.write(..) is synchronized

2013-08-08 Thread Jay Vyas

Then is this a bug?  Synchronization in absence of any race condition is 
normally considered bad.

In any case id like to know why this writer is synchronized whereas the other 
one are not.. That is, I think, then point at issue: either other writers 
should be synchronized or else this one shouldn't be - consistency across the 
write implementations is probably desirable so that changes to output formats 
or record writers don't lead to bugs in multithreaded environments .

On Aug 8, 2013, at 6:50 AM, Harsh J ha...@cloudera.com wrote:

 While we don't fork by default, we do provide a MultithreadedMapper 
 implementation that would require such synchronization. But if you are asking 
 is it necessary, then perhaps the answer is no.
 
 On Aug 8, 2013 3:43 PM, Azuryy Yu azury...@gmail.com wrote:
 its not hadoop forked threads, we may create a line record writer, then call 
 this writer concurrently.
 
 On Aug 8, 2013 4:00 PM, Sathwik B P sathwik...@gmail.com wrote:
 Hi,
 Thanks for your reply.
 May I know where does hadoop fork multiple threads to use a single 
 RecordWriter.
 
 regards,
 sathwik
 
 On Thu, Aug 8, 2013 at 7:06 AM, Azuryy Yu azury...@gmail.com wrote:
 because we may use multi-threads to write a single file.
 
 On Aug 8, 2013 2:54 PM, Sathwik B P sath...@apache.org wrote:
 Hi,
 
 LineRecordWriter.write(..) is synchronized. I did not find any other 
 RecordWriter implementations define the write as synchronized.
 Any specific reason for this.
 
 regards,
 sathwik

Re: solr -Reg

2013-07-28 Thread Jay Vyas

True that it deserves some posting on solr, but i think It's still partially 
relevant...

The SolrInputFormat and SolrOutputFormat handle this for you and will be used 
in your map reduce jobs . 

They will output one core. per reducer, where each reducer corresponds to a 
core.. This is necessary since all indices are stored locally per core.

Remember that even though you might be able to create shards from several 
terabytes easily in hadoop, hosting them will require some very high 
performance Hardware.

On Jul 28, 2013, at 1:11 PM, Harsh J ha...@cloudera.com wrote:

 The best place to ask questions pertaining to Solr, would be on its
 own lists. Head over to http://lucene.apache.org/solr/discussion.html
 
 On Sun, Jul 28, 2013 at 11:37 AM, Venkatarami Netla
 venkatarami.ne...@cloudwick.com wrote:
 Hi,
 I am beginner for solr , so please explain step by step how to use solr with
 hdfs and map reduce..
 
 Thanks  Regards
 --
 N Venkata Rami Reddy
 
 
 
 -- 
 Harsh J

Re: Staging directory ENOTDIR error.

2013-07-12 Thread Jay Vyas

This was a very odd error - it turns out that i had created a file, called
tmp in my fs root directory, which meant that
when the jobs were trying to write to the tmp directory, they ran into the
not-a-dir exception.

In any case, I think the error reporting in NativeIO class should be
revised.

On Thu, Jul 11, 2013 at 10:24 PM, Devaraj k devara...@huawei.com wrote:

  Hi Jay,

 ** **

Here client is trying to create a staging directory in local file
 system,  which actually should create in HDFS.

 ** **

 Could you check whether do you have configured “fs.defaultFS”
 configuration in client with the HDFS.

 

 ** **

 Thanks

 Devaraj k

 ** **

 *From:* Jay Vyas [mailto:jayunit...@gmail.com]
 *Sent:* 12 July 2013 04:12
 *To:* common-u...@hadoop.apache.org
 *Subject:* Staging directory ENOTDIR error.

 ** **

 Hi , I'm getting an ungoogleable exception, never seen this before. 

 This is on a hadoop 1.1. cluster... It appears that its permissions
 related... 

 Any thoughts as to how this could crop up?

 I assume its a bug in my filesystem, but not sure.


 13/07/11 18:39:43 ERROR security.UserGroupInformation:
 PriviledgedActionException as:root cause:ENOTDIR: Not a directory
 ENOTDIR: Not a directory
 at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
 at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:699)
 at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:654)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
 at
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
 at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)

 


 --
 Jay Vyas
 http://jayunit100.blogspot.com 




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: CompositeInputFormat

2013-07-11 Thread Jay Vyas

Map Side joins will use the CompositeInputFormat.  They will only really be
worth doing if one data set is small, and the other is large.

This is a good example :
http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/

the trick is to google for CompositeInputFormat.compose()  :)


On Thu, Jul 11, 2013 at 5:02 PM, Botelho, Andrew andrew.bote...@emc.comwrote:

 Hi,

 ** **

 I want to perform a JOIN on two sets of data with Hadoop.  I read that the
 class CompositeInputFormat can be used to perform joins on data, but I
 can’t find any examples of how to do it.

 Could someone help me out? It would be much appreciated. J

 ** **

 Thanks in advance,

 ** **

 Andrew




-- 
Jay Vyas
http://jayunit100.blogspot.com

Staging directory ENOTDIR error.

2013-07-11 Thread Jay Vyas

Hi , I'm getting an ungoogleable exception, never seen this before.

This is on a hadoop 1.1. cluster... It appears that its permissions
related...
Any thoughts as to how this could crop up?

I assume its a bug in my filesystem, but not sure.

13/07/11 18:39:43 ERROR security.UserGroupInformation:
PriviledgedActionException as:root cause:ENOTDIR: Not a directory
ENOTDIR: Not a directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:699)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:654)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at
org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)


-- 
Jay Vyas
http://jayunit100.blogspot.com

Data node EPERM not permitted.

2013-07-06 Thread Jay Vyas

Hi :

I've mounted my own ext4 disk ont /mnt/sdb and chmodded it to 777.

However, when starting the data node:

/etc/init.d/hadoop-hdfs-datanode start

I get the following error in my logs (bottom of this message)

*** *** *** *** *** *** *** *** *** *** *** *** *** *** ***

What is the EPERM error caused by, and how can I reproduce it?  I'm
assuming that, since the directory permissions are recursively set to 777
there shouldnt be a way that this error could occur, unless somewhere
intermittently the directory permissions are being changed by hdfs to the
wrong thing.

*** *** *** *** *** *** *** *** *** *** *** *** *** *** ***

2013-07-06 15:54:13,968 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: Invalid
dfs.datanode.data.dir /mnt/sdb/hadoop-hdfs/cache/hdfs/dfs/data :
EPERM: Operation not permitted
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:605)
at
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:439)
at
org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:138)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:154)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.getDataDirsFromURIs(DataNode.java:1659)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1638)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1575)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1598)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1751)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1772)



-- 
Jay Vyas
http://jayunit100.blogspot.com

starting Hadoop, the new way

2013-07-05 Thread Jay Vyas

Hi :

Is there a hadoop 2.0 tutorial for 1.0 people ?  Im used to running
start-all.sh , but it appears that the new MR2 version of hadoop is much
more sophisticated.

In any case, Im wondering what the standard way to start the new generation
of hadoop/mr2 hadoop/mapreduce and hadoop/hdfs is and if I need to set any
particular env variables when doing so.

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: HDFS interfaces

2013-06-04 Thread Jay Vyas

Looking in the source, it appears that In HDFS, the Namenode supports
getting this info directly via the client, and ultimately communicates
block locations to the DFSClient , which is used by the
DistributedFileSystem.

  /**
   * @see ClientProtocol#getBlockLocations(String, long, long)
   */
  static LocatedBlocks callGetBlockLocations(ClientProtocol namenode,
  String src, long start, long length)
  throws IOException {
try {
  return namenode.getBlockLocations(src, start, length);
} catch(RemoteException re) {
  throw re.unwrapRemoteException(AccessControlException.class,
 FileNotFoundException.class,
 UnresolvedPathException.class);
}
  }




On Tue, Jun 4, 2013 at 2:00 AM, Mahmood Naderan nt_mahm...@yahoo.comwrote:

 There are many instances of getFileBlockLocations in hadoop/fs. Can you
 explain which one is the main?
 It must be combined with a method of logically splitting the input data
 along block boundaries, and of launching tasks on worker nodes that are
 close to the data splits
 Is this a user level task of system level task?


 Regards,
 Mahmood*
 *

   --
  *From:* John Lilley john.lil...@redpoint.net
 *To:* user@hadoop.apache.org user@hadoop.apache.org; Mahmood Naderan 
 nt_mahm...@yahoo.com
 *Sent:* Tuesday, June 4, 2013 3:28 AM
 *Subject:* RE: HDFS interfaces

  Mahmood,

 It is the in the FileSystem interface.
 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.Path,
 long, 
 long)http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations%28org.apache.hadoop.fs.Path,%20long,%20long%29

 This by itself is not sufficient for application programmers to make good
 use of data locality.  It must be combined with a method of logically
 splitting the input data along block boundaries, and of launching tasks on
 worker nodes that are close to the data splits.  MapReduce does both of
 these things internally along with the file-format input classes.  For an
 application to do so directly, see the new YARN-based interfaces
 ApplicationMaster and ResourceManager.  These are however very new and
 there is little documentation or examples.

 john

  *From:* Mahmood Naderan [mailto:nt_mahm...@yahoo.com]
 *Sent:* Monday, June 03, 2013 12:09 PM
 *To:* user@hadoop.apache.org
 *Subject:* HDFS interfaces

  Hello,
  It is stated in the HDFS architecture guide (
 https://hadoop.apache.org/docs/r1.0.4/hdfs_design.html) that

  *HDFS provides interfaces for applications to move themselves closer to
 where the data is located. *

  What are these interfaces and where they are in the source code? Is
 there any manual for the interfaces?

   Regards,
 Mahmood





-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

2013-05-31 Thread Jay Vyas

Just FYI if you are on linux, KVM and kickstart are really good for this as
well and we have some kickstart Fedora16 hadoop setup scripts I can share
to spin up a cluster of several VMs on the fly with static IPs (that
usually to me is the tricky part with hadoop VM cluster setup - setting up
the VMs with static ip addresses, getting the nodes to talk / ssh to each
other, and consistently defining the slaves file).

But if you are stuck with VMWare, then i beleive VMWare also has a vagrant
plugin now, which will be much easier for you to maintain.

Manually cloning machines doesnt scale well when you want to rebuild your
cluster.



On Fri, May 31, 2013 at 10:56 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Sai Sai,

 You can take a look at that also: http://goo.gl/iXzae

 I just did that yesterday for some other folks I'm working with. Maybe
 not the best way, but working like a charm.

 JM

 2013/5/31 shashwat shriparv dwivedishash...@gmail.com:
  Try this
  http://www.youtube.com/watch?v=gIRubPl20oo
  there will be three videos 1-3 watch and you can do what you need to
 do
 
 
 
  Thanks  Regards
 
  ∞
 
  Shashwat Shriparv
 
 
 
  On Fri, May 31, 2013 at 5:52 PM, Jitendra Yadav 
 jeetuyadav200...@gmail.com
  wrote:
 
  Hi,
 
  You can create a clone machine through an existing virtual machine in
  VMware and then run it as a separate virtual machine.
 
  http://www.vmware.com/support/ws55/doc/ws_clone_new_wizard.html
 
 
  After installing you have to make sure that all the virtual machines are
  setup with correct network set up so that they can ping each other (you
  should use Host only network settings in network configuration).
 
  I hope this will help you.
 
 
  Regards
  Jitendra
 
  On Fri, May 31, 2013 at 5:23 PM, Sai Sai saigr...@yahoo.in wrote:
 
  Just wondering if anyone has any documentation or references to any
  articles how to simulate a multi node cluster setup in 1 laptop with
 hadoop
  running on multiple ubuntu VMs. any help is appreciated.
  Thanks
  Sai
 
 
 




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: What else can be built on top of YARN.

2013-05-30 Thread Jay Vyas

What is the separation of concerns between YARN and Zookeeper?  That is,
where does YARN leave off and where does Zookeeper begin?  Or is there some
overlap


On Thu, May 30, 2013 at 2:42 AM, Krishna Kishore Bonagiri 
write2kish...@gmail.com wrote:

 Hi Rahul,

   It is at least because of the reasons that Vinod listed that makes my
 life easy for porting my application on to YARN instead of making it work
 in the Map Reduce framework. The main purpose of me using YARN is to
 exploit the resource management capabilities of YARN.

 Thanks,
 Kishore


 On Wed, May 29, 2013 at 11:00 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks for the response Krishna.

 I was wondering if it were possible for using MR to  solve you problem
 instead of building the whole stack on top of yarn.
 Most likely its not possible , thats why you are building it . I wanted
 to know why is that ?

 I am in just trying to find out the need or why we might need to write
 the application on yarn.

 Rahul


 On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi Rahul,

   I am porting a distributed application that runs on a fixed set of
 given resources to YARN, with the aim of  being able to run it on a
 dynamically selected resources whichever are available at the time of
 running the application.

 Thanks,
 Kishore


 On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi all,

 I was going through the motivation behind Yarn. Splitting the
 responsibility of JT is the major concern.Ultimately the base (Yarn) was
 built in a generic way for building other generic distributed applications
 too.

 I am not able to think of any other parallel processing use case that
 would be useful to built on top of YARN. I though of a lot of use cases
 that would be beneficial when run in parallel , but again ,we can do those
 using map only jobs in MR.

 Can someone tell me a scenario , where a application can utilize Yarn
 features or can be built on top of YARN and at the same time , it cannot be
 done efficiently using MRv2 jobs.

 thanks,
 Rahul








-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: understanding souce code structure

2013-05-27 Thread Jay Vyas

Hi!  a few weeks ago I had the same question... Tried a first iteration at 
documenting this by going through the classes starting with key/value pairs in 
the blog post below.  

http://jayunit100.blogspot.com/2013/04/the-kv-pair-salmon-run-in-mapreduce-hdfs.html

Note it's not perfect yet but I think it should provide some insight into 
things.  The lynch pin of it all is the DFSOutputStream and the DataStreamer 
classes.   Anyways... Feel free to borrow the contents and roll your own , or 
comment on it  leave some feedback,or let me know if anything is missing.   

Definetly would be awesome to have a rock solid view of the full write path.

On May 27, 2013, at 2:10 PM, Mahmood Naderan nt_mahm...@yahoo.com wrote:

 Hello
 
 I am trying to understand the source of of hadoop especially the HDFS. I want 
 to know where should I look exactly in the source code about how HDFS 
 distributes the data. Also how the map reduce engine tries to read the data. 
 
 
 Any hint regarding the location of those in the source code is appreciated.
  
 Regards,
 Mahmood

Re: Configuring SSH - is it required? for a psedo distriburted mode?

2013-05-16 Thread Jay Vyas

Actually, I should amend my statement -- SSH is required, but passwordless
ssh (i guess) you can live without if you are willing to enter your
password for each process that gets started.

But Why wouldn't you want to implement passwordless ssh in a pseudo
distributed cluster ? Its very easy to implement on a single node:

cat ~/.ssh/id_rsa.pub /root/.ssh/authorized_keys

On Thu, May 16, 2013 at 11:31 AM, Jay Vyas jayunit...@gmail.com wrote:

Yes it is required -- in psuedodistributed node the jobtracker is not
necessarily aware that the task trackers / data nodes are on the same
machine, and will thus attempt to ssh into them when starting the
respective deamons etc (i.e. start-all.sh)

On Thu, May 16, 2013 at 11:21 AM, kishore alajangi
alajangikish...@gmail.com wrote:

When you start the hadoop procecess, each process will ask the password
to start, to overcome this we will configure SSH if you use single node or
multiple nodes for each process, if you can enter the password for each
process Its not a mandatory even if you use multiple systems.

Thanks,
Kishore.

On Thu, May 16, 2013 at 8:24 PM, Raj Hadoop hadoop...@yahoo.com wrote:

Hi,

I have a dedicated user on Linux server for hadoop. I am installing it
in psedo distributed mode on this box. I want to test my programs on this
machine. But i see that in installation steps - they were mentioned that
SSH needs to be configured. If it is single node, I dont require it
...right? Please advise.

I was looking at this site

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

It menionted like this -

Hadoop requires SSH access to manage its nodes, i.e. remote machines
plus your local machine if you want to use Hadoop on it (which is what we
want to do in this short tutorial). For our single-node setup of Hadoop, we
therefore need to configure SSH access to localhost for the hduser user
we created in the previous section.

Thanks,
Raj

--
Jay Vyas
http://jayunit100.blogspot.com

partition as block?

2013-04-30 Thread Jay Vyas

Hi guys:

Im wondering - if I'm running mapreduce jobs on a cluster with large block
sizes - can i increase performance with either:

1) A custom FileInputFormat

2) A custom partitioner

3) -DnumReducers

Clearly, (3) will be an issue due to the fact that it might overload tasks
and network traffic... but maybe (1) or (2) will be a precise way to use
partitions as a poor mans block.

Just a thought - not sure if anyone has tried (1) or (2) before in order to
simulate blocks and increase locality by utilizing the partition API.

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

2013-04-30 Thread Jay Vyas

Well, to be more clear, I'm wondering how hadoop-mapreduce can be optimized
in a block-less filesystem... And am thinking about application tier ways
to simulate blocks - i.e. by making the granularity of partitions smaller.

Wondering, if there is a way to hack an increased numbers of partitions as
a mechanism to simulate blocks - or wether this is just a bad idea
altogether :)




On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq donta...@gmail.com wrote:

 Hello Jay,

 What are you going to do in your custom InputFormat and partitioner?Is
 your InputFormat is going to create larger splits which will overlap with
 larger blocks?If that is the case, IMHO, then you are going to reduce the
 no. of mappers thus reducing the parallelism. Also, much larger block size
 will put extra overhead when it comes to disk I/O.

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Wed, May 1, 2013 at 12:16 AM, Jay Vyas jayunit...@gmail.com wrote:

 Hi guys:

 Im wondering - if I'm running mapreduce jobs on a cluster with large
 block sizes - can i increase performance with either:

 1) A custom FileInputFormat

  2) A custom partitioner

 3) -DnumReducers

 Clearly, (3) will be an issue due to the fact that it might overload
 tasks and network traffic... but maybe (1) or (2) will be a precise way to
 use partitions as a poor mans block.

 Just a thought - not sure if anyone has tried (1) or (2) before in order
 to simulate blocks and increase locality by utilizing the partition API.

 --
 Jay Vyas
 http://jayunit100.blogspot.com





-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

2013-04-30 Thread Jay Vyas

Yes it is a problem at the first stage.  What I'm wondering, though, is
wether the intermediate results - which happen after the mapper phase - can
be optimized.


On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq donta...@gmail.com wrote:

 Hmmm. I was actually thinking about the very first step. How are you going
 to create the maps. Suppose you are on a block-less filesystem and you have
 a custom Format that is going to give you the splits dynamically. This mean
 that you are going to store the file as a whole and create the splits as
 you continue to read the file. Wouldn't it be a bottleneck from 'disk'
 point of view??Are you not going away from the distributed paradigm??

 Am I taking it in the correct way. Please correct me if I am getting it
 wrong.

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Wed, May 1, 2013 at 12:34 AM, Jay Vyas jayunit...@gmail.com wrote:

 Well, to be more clear, I'm wondering how hadoop-mapreduce can be
 optimized in a block-less filesystem... And am thinking about application
 tier ways to simulate blocks - i.e. by making the granularity of partitions
 smaller.

 Wondering, if there is a way to hack an increased numbers of partitions
 as a mechanism to simulate blocks - or wether this is just a bad idea
 altogether :)




 On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq donta...@gmail.comwrote:

 Hello Jay,

 What are you going to do in your custom InputFormat and
 partitioner?Is your InputFormat is going to create larger splits which will
 overlap with larger blocks?If that is the case, IMHO, then you are going to
 reduce the no. of mappers thus reducing the parallelism. Also, much larger
 block size will put extra overhead when it comes to disk I/O.

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Wed, May 1, 2013 at 12:16 AM, Jay Vyas jayunit...@gmail.com wrote:

 Hi guys:

 Im wondering - if I'm running mapreduce jobs on a cluster with large
 block sizes - can i increase performance with either:

 1) A custom FileInputFormat

  2) A custom partitioner

 3) -DnumReducers

 Clearly, (3) will be an issue due to the fact that it might overload
 tasks and network traffic... but maybe (1) or (2) will be a precise way to
 use partitions as a poor mans block.

 Just a thought - not sure if anyone has tried (1) or (2) before in
 order to simulate blocks and increase locality by utilizing the partition
 API.

 --
 Jay Vyas
 http://jayunit100.blogspot.com





 --
 Jay Vyas
 http://jayunit100.blogspot.com





-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: partition as block?

2013-04-30 Thread Jay Vyas

What do you mean increasing the size?  Im talking more about increasing the 
number of partitions... Which actually decreases individual file size.

On Apr 30, 2013, at 4:09 PM, Mohammad Tariq donta...@gmail.com wrote:

 Increasing the size can help us to an extent, but increasing it further might 
 cause problems during copy and shuffle. If the partitions are too big to be 
 held in the memory, we'll end up with disk based shuffle which is gonna be 
 slower than RAM based shuffle, thus delaying the entire reduce phase. 
 Furthermore N/W might get overwhelmed.
 
 I think keeping it considerably high will definitely give you some boost. 
 But it'll require a high level tinkering.
 
 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com
 
 
 On Wed, May 1, 2013 at 1:29 AM, Jay Vyas jayunit...@gmail.com wrote:
 Yes it is a problem at the first stage.  What I'm wondering, though, is 
 wether the intermediate results - which happen after the mapper phase - can 
 be optimized.
 
 
 On Tue, Apr 30, 2013 at 3:38 PM, Mohammad Tariq donta...@gmail.com wrote:
 Hmmm. I was actually thinking about the very first step. How are you going 
 to create the maps. Suppose you are on a block-less filesystem and you have 
 a custom Format that is going to give you the splits dynamically. This mean 
 that you are going to store the file as a whole and create the splits as 
 you continue to read the file. Wouldn't it be a bottleneck from 'disk' 
 point of view??Are you not going away from the distributed paradigm??
 
 Am I taking it in the correct way. Please correct me if I am getting it 
 wrong.
 
 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com
 
 
 On Wed, May 1, 2013 at 12:34 AM, Jay Vyas jayunit...@gmail.com wrote:
 Well, to be more clear, I'm wondering how hadoop-mapreduce can be 
 optimized in a block-less filesystem... And am thinking about application 
 tier ways to simulate blocks - i.e. by making the granularity of 
 partitions smaller. 
 
 Wondering, if there is a way to hack an increased numbers of partitions as 
 a mechanism to simulate blocks - or wether this is just a bad idea 
 altogether :) 
 
 
 
 
 On Tue, Apr 30, 2013 at 2:56 PM, Mohammad Tariq donta...@gmail.com wrote:
 Hello Jay,
 
 What are you going to do in your custom InputFormat and 
 partitioner?Is your InputFormat is going to create larger splits which 
 will overlap with larger blocks?If that is the case, IMHO, then you are 
 going to reduce the no. of mappers thus reducing the parallelism. Also, 
 much larger block size will put extra overhead when it comes to disk I/O.
 
 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com
 
 
 On Wed, May 1, 2013 at 12:16 AM, Jay Vyas jayunit...@gmail.com wrote:
 Hi guys:
 
 Im wondering - if I'm running mapreduce jobs on a cluster with large 
 block sizes - can i increase performance with either:
 
 1) A custom FileInputFormat
 
 2) A custom partitioner 
 
 3) -DnumReducers
 
 Clearly, (3) will be an issue due to the fact that it might overload 
 tasks and network traffic... but maybe (1) or (2) will be a precise way 
 to use partitions as a poor mans block.  
 
 Just a thought - not sure if anyone has tried (1) or (2) before in order 
 to simulate blocks and increase locality by utilizing the partition API.
 
 -- 
 Jay Vyas
 http://jayunit100.blogspot.com
 
 
 
 -- 
 Jay Vyas
 http://jayunit100.blogspot.com
 
 
 
 -- 
 Jay Vyas
 http://jayunit100.blogspot.com

Re: Maven dependency

2013-04-24 Thread Jay Vyas

this should be enough to get started (you can pick the 1.* version if you
want the newer APIs and stuff, but for the elephant book, the older apis
will work fine as well) .
dependencies
dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-core/artifactId
  version0.20.2/version
/dependency
/dependencies


On Wed, Apr 24, 2013 at 3:13 PM, Kevin Burton rkevinbur...@charter.netwrote:

 I am reading “Hadoop in Action” and the author on page 51 puts forth this
 code:

 ** **

 public class WordCount2 { 

 public static void main(String[] args) { 

JobClient client = new JobClient(); 

JobConf conf = new JobConf(WordCount2.class); 

FileInputFormat.addInputPath(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1])); 

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(LongWritable.class);

conf.setMapperClass(TokenCountMapper.class);

conf.setCombinerClass(LongSumReducer.class);

conf.setReducerClass(LongSumReducer.class);r

client.setConf(conf);

try {

JobClient.runJob(conf);

} catch (Exception e) {

e.printStackTrace();

} 

} 

 }

 ** **

 Which is an example for a simple MapReduce job. But being a beginner I am
 not sure how to set up a project for this code. If I am using Maven what
 are the Maven dependencies that I need? There are several map reduce
 dependencies and I am not sure which to pick. Are there other dependencies
 need (such as JobConf)? What are the imports needed? During the
 construction of the configuration what heuristics are used to find the
 configuration for the Hadoop cluster?

 ** **

 Thank you.




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Append MR output file to an exitsted HDFS file

2013-04-21 Thread Jay Vyas

I might be misunderstanding, but if you want each Reducer to append its
outputs to outputs to corresponding files that already exist in HDFS?

Remember that the reducers usually are outputting globs so you will have
several parts to your output - so the append has to be done in a way where
new reducer paritions corresponds to a old paritions.

If so - maybe you could play with your own OutputFormat by taking the
source from one that serves as a starting point, and replacing the
create... stream part with a call to write() with a call to append().

The reason this is tricky is that each OutputFormat is going to have to
find the corresponding file to append.


On Sun, Apr 21, 2013 at 10:54 PM, YouPeng Yang yypvsxf19870...@gmail.comwrote:

 Hi All

  Can I append a MR output file to an existed file on HDFS.

  I‘m using CDH4.1.2 vs MRv2


 Regards




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Writing intermediate key,value pairs to file and read it again

2013-04-20 Thread Jay Vyas

How many intermediate keys? If small enough, you can keep them in memory.  If 
large, you can just wait for the job to finish and siphon them into your job as 
input with the MultipleInputs API.



On Apr 20, 2013, at 10:43 AM, Vikas Jadhav vikascjadha...@gmail.com wrote:

 Hello,
 Can anyone help me in following issue
 Writing intermediate key,value pairs to file and read it again
 
 let us say i have to write each intermediate pair received @reducer to a file 
 and again read that as key value pair again and use it for processing
 
 I found IFile.java file which has reader and writer but i am not able 
 understand how to use it for example. I dont understand Counter value as last 
 parameter spilledRecordsCounter 
 
 
 Thanks.
 
 
 -- 
 
 
   Regards,
Vikas

JobSubmissionFiles: past , present, and future?

2013-04-12 Thread Jay Vyas

Hi guys: 

I'm curious about the changes and future of the JobSubmissionFiles class.

Grepping around on the web I'm finding some code snippets that suggest that 
hadoop security is not handled the same way on the staging directory as before:

http://javasourcecode.org/html/open-source/hadoop/hadoop-0.20.203.0/org/apache/hadoop/mapreduce/JobSubmissionFiles.java.html

http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201210.mbox/%3ccaocnvr0eylsckxaocpnm7kbzwphvcdjbbx5a+azes_s6pws...@mail.gmail.com%3E

But I'm having trouble definitively pinning this to versions.

Why the difference in the if/else logic here and what is the future
Of permissions on .staging?

Re: JobSubmissionFiles: past , present, and future?

2013-04-12 Thread Jay Vyas

To update on this, it was just pointed out to me by matt farrallee 
that the auto fix of permissions is for a failsafe 
in case of a race condition, and not meant to mend bad permissions in all cases:

https://github.com/apache/hadoop-common/commit/f25dc04795a0e9836e3f237c802bfc1fe8a243ad

Something to keep in mind - if you see the fixing staging permissions error 
message alot
Then there might be a more systemic problem in your fs... At least, that was 
the case for us.

On Apr 12, 2013, at 6:11 AM, Jay Vyas jayunit...@gmail.com wrote:

 Hi guys: 
 
 I'm curious about the changes and future of the JobSubmissionFiles class.
 
 Grepping around on the web I'm finding some code snippets that suggest that 
 hadoop security is not handled the same way on the staging directory as 
 before:
 
 http://javasourcecode.org/html/open-source/hadoop/hadoop-0.20.203.0/org/apache/hadoop/mapreduce/JobSubmissionFiles.java.html
 
 http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201210.mbox/%3ccaocnvr0eylsckxaocpnm7kbzwphvcdjbbx5a+azes_s6pws...@mail.gmail.com%3E
 
 But I'm having trouble definitively pinning this to versions.
 
 Why the difference in the if/else logic here and what is the future
 Of permissions on .staging?

Re: No bulid.xml when to build FUSE

2013-04-10 Thread Jay Vyas

hadoop-hdfs builds with maven, not ant.

You might also need to install the serialization libraries.

See http://wiki.apache.org/hadoop/HowToContribute .

As an aside, you could try to use gluster as a FUSE mount if you simply
want a HA FUSE mountable filesystem
which is mapreduce compatible. https://github.com/gluster/hadoop-glusterfs .



-- Forwarded message --
From: YouPeng Yang yypvsxf19870...@gmail.com
Date: Wed, Apr 10, 2013 at 10:06 AM
Subject: No bulid.xml when to build FUSE
To: user@hadoop.apache.org


Dear All

   I want to integrate the FUSE with the Hadoop.
So i checkout the code using the command:
*[root@Hadoop ~]#  svn checkout
http://svn.apache.org/repos/asf/hadoop/common/trunk/ hadoop-trunk*

   However I did not find any ant build.xmls to build the fuse-dfs in the
*hadoop-trunk/hadoop-hdfs-project/hadoop-hdfs/src/contrib.*
* *  Did I checkout the wrong codes, Or is there any other ways to bulid
fuse-dfs?

*  * Please guide me .
*   *
*
*
*Thanks *

regards



-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Copy Vs DistCP

2013-04-10 Thread Jay Vyas

DistCP is a full blown mapreduce job (mapper only, where the mappers do a
fully parallel copy to the detsination).

CP appears (correct me if im wrong) to simply invoke the FileSystem and
issues a copy command for every source file.

I have an additional question: how is CP which is internal to a cluster
optimized (if at all) ?


On Wed, Apr 10, 2013 at 6:20 PM, KayVajj vajjalak...@gmail.com wrote:

 I have few questions regarding the usage of DistCP for copying files in
 the same cluster.


 1) Which one is better within a  same cluster and what factors (like file
 size etc) wouldinfluence the usage of one over te other?

 2) when we run a cp command like below from a  client node of the cluster
 (not a data node), How does the cp command work
  i) like an MR job
 ii) copy files locally and then it copy it back at the new location.

 Example of the copy command

 hdfs dfs -cp /some_location/file /new_location/

 Thanks, your responses are appreciated.

 -- Kay




-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Distributed cache: how big is too big?

2013-04-09 Thread Jay Vyas

Hmmm.. maybe im missing something.. but (@bjorn) Why would you use hdfs as a 
replacement for the distributed cache?

After all - the distributed cache is just a file with replication over the 
whole cluster, which isn't in hdfs.  Cant you Just make the cache size big and 
store the file there?

What advantage is hdfs distribution of the file over all nodes  ?

On Apr 9, 2013, at 6:49 AM, Bjorn Jonsson bjorn...@gmail.com wrote:

 Put it once on hdfs with a replication factor equal to the number of DN. No 
 startup latency on job submission or max size and access it from anywhere 
 with fs since it sticks around untill you replace it? Just a thought.
 
 On Apr 8, 2013 9:59 PM, John Meza j_meza...@hotmail.com wrote:
 I am researching a Hadoop solution for an existing application that requires 
 a directory structure full of data for processing.
 
 To make the Hadoop solution work I need to deploy the data directory to each 
 DN when the job is executed.
 I know this isn't new and commonly done with a Distributed Cache.
 
 Based on experience what are the common file sizes deployed in a Distributed 
 Cache? 
 I know smaller is better, but how big is too big? the larger cache deployed 
 I have read there will be startup latency. I also assume there are other 
 factors that play into this.
 
 I know that-Default local.cache.size=10Gb
 
 -Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??
 -Distributed Cache is normally not used if larger than =?
 
 Another Option: Put the data directories on each DN and provide location to 
 TaskTracker?
 
 thanks
 John

The Job.xml file

2013-04-09 Thread Jay Vyas

Hi guys: I cant find much info about the life cycle for the job.xml file in
hadoop.

My thoughts are :

1) It is created by the job client
2) It is only read by the JobTracker
3) Task trackers (indirectly) are configured by information in job.xml
because
the JobTracker decomposes its contents into individual tasks

So, my (related) questions are:

Is there a way to start a job directly from a job.xml file?
What components depend on and read the job.xml file?
Where is the job.xml defined/documented (if anywhere)?

-- 
Jay Vyas
http://jayunit100.blogspot.com

MVN repository for hadoop trunk

2013-04-06 Thread Jay Vyas

Hi guys:

Is there a mvn repo for hadoop's 3.0.0 trunk build?

Clearly the hadoop pom.xml allows us to build hadoop from scratch and
installs it as 3.0.0-SNAPSHOT -- but its not clear wether there is a
published version of this
snapshot jar somewhere.

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: MVN repository for hadoop trunk

2013-04-06 Thread Jay Vyas

This is awesome thanks !


On Sat, Apr 6, 2013 at 5:14 PM, Harsh J ha...@cloudera.com wrote:

 Thanks Giri, was not aware of this one!

 On Sun, Apr 7, 2013 at 2:38 AM, Giridharan Kesavan
 gkesa...@hortonworks.com wrote:
  All the hadoop snapshot artifacts are available through the snapshots
 url:
 
  https://repository.apache.org/content/groups/snapshots
 
  -Giri
 
 
  On Sat, Apr 6, 2013 at 2:00 PM, Harsh J ha...@cloudera.com wrote:
 
  I don't think we publish nightly or rolling jars anywhere on maven
  central from trunk builds.
 
  On Sun, Apr 7, 2013 at 2:17 AM, Jay Vyas jayunit...@gmail.com wrote:
   Hi guys:
  
   Is there a mvn repo for hadoop's 3.0.0 trunk build?
  
   Clearly the hadoop pom.xml allows us to build hadoop from scratch and
   installs it as 3.0.0-SNAPSHOT -- but its not clear wether there is a
   published version of this
   snapshot jar somewhere.
  
   --
   Jay Vyas
   http://jayunit100.blogspot.com
 
 
 
  --
  Harsh J
 
 



 --
 Harsh J




-- 
Jay Vyas
http://jayunit100.blogspot.com

cannot find /usr/lib/hadoop/mapred/

2013-03-06 Thread Jay Vyas

Hi guys:  I'm getting an odd error  involving a file called toBeDeleted.
I've never seen this - somehow its blocking my task trackers from starting.

2013-03-06 16:19:24,657 ERROR org.apache.hadoop.mapred.TaskTracker: Can not
start task tracker because java.lang.RuntimeException: Cannot find root
/usr/lib/hadoop/mapred/ for execution of task deletion of
toBeDeleted/2013-03-06_
02-25-40.379_4 on /usr/lib/hadoop/mapred/ with original name
/usr/lib/hadoop/mapred/toBeDeleted/2013-03-06_02-25-40.379_4
at
org.apache.hadoop.util.AsyncDiskService.execute(AsyncDiskService.java:95)
at
org.apache.hadoop.util.MRAsyncDiskService.execute(MRAsyncDiskService.java:115)
at
org.apache.hadoop.util.MRAsyncDiskService.init(MRAsyncDiskService.java:105)
at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:742)
at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1522)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3821)


-- 
Jay Vyas
http://jayunit100.blogspot.com

1 2 >

1 - 100 of 173 matches

Mail list logo