Re: Start/stop scripts - particularly start-dfs.sh - in Hortonworks Data Platform 2.3.X

2015-10-24 Thread Stephen Boesch
OK I will continue on hdp list: I am already using the hdfs command for all
of those individual commands but they are *not*  a replacement for the
single start-dfs.sh



2015-10-24 9:48 GMT-07:00 Ted Yu <yuzhih...@gmail.com>:

> See /usr/hdp/current/hadoop-hdfs-client/bin/hdfs which calls hdfs.distro
>
> At the top of hdfs.distro, you would see the usage:
>
> function print_usage(){
>   echo "Usage: hdfs [--config confdir] COMMAND"
>   echo "   where COMMAND is one of:"
>   echo "  dfs  run a filesystem command on the file
> systems supported in Hadoop."
>   echo "  namenode -format format the DFS filesystem"
>   echo "  secondarynamenoderun the DFS secondary namenode"
>   echo "  namenode run the DFS namenode"
>   echo "  journalnode  run the DFS journalnode"
>
> BTW since this question is vendor specific, I suggest continuing on
> vendor's forum.
>
> Cheers
>
> On Fri, Oct 23, 2015 at 7:06 AM, Stephen Boesch <java...@gmail.com> wrote:
>
>>
>> We are setting up automated deployments on a headless system: so using
>> the GUI is not an option here.  When we search for those scripts under
>> HDP they are not found:
>>
>> $ pwd
>> /usr/hdp/current
>>
>> Which scripts exist in HDP ?
>>
>> [stack@s1-639016 current]$ find -L . -name \*.sh
>> ...
>>
>> There are ZERO start/stop sh scripts..
>>
>> In particular I am interested in the *start-dfs.sh* script that starts
>> the namenode(s) , journalnode, and datanodes.
>>
>>
>


Re: Start/stop scripts - particularly start-dfs.sh - in Hortonworks Data Platform 2.3.X

2015-10-24 Thread Stephen Boesch
Artem, the first thing I did was a find /usr/hdp -name \*.sh . I have done
numerous variations on that.


2015-10-24 8:35 GMT-07:00 Artem Ervits <artemerv...@gmail.com>:

> Look in /usr/hdp/2.3
> On Oct 23, 2015 10:07 AM, "Stephen Boesch" <java...@gmail.com> wrote:
>
>>
>> We are setting up automated deployments on a headless system: so using
>> the GUI is not an option here.  When we search for those scripts under
>> HDP they are not found:
>>
>> $ pwd
>> /usr/hdp/current
>>
>> Which scripts exist in HDP ?
>>
>> [stack@s1-639016 current]$ find -L . -name \*.sh
>> ...
>>
>> There are ZERO start/stop sh scripts..
>>
>> In particular I am interested in the *start-dfs.sh* script that starts
>> the namenode(s) , journalnode, and datanodes.
>>
>>


Start/stop scripts - particularly start-dfs.sh - in Hortonworks Data Platform 2.3.X

2015-10-23 Thread Stephen Boesch
We are setting up automated deployments on a headless system: so using the
GUI is not an option here.  When we search for those scripts under HDP they
are not found:

$ pwd
/usr/hdp/current

Which scripts exist in HDP ?

[stack@s1-639016 current]$ find -L . -name \*.sh
...

There are ZERO start/stop sh scripts..

In particular I am interested in the *start-dfs.sh* script that starts the
namenode(s) , journalnode, and datanodes.


Re: Jr. to Mid Level Big Data jobs in Bay Area

2015-05-17 Thread Stephen Boesch
Hi,  This is not a job board. Thanks.

2015-05-17 16:00 GMT-07:00 Adam Pritchard apritchard...@gmail.com:

 Hi everyone,

 I was wondering if any of you know any openings looking to hire a big data
 dev in the Palo Alto area.

 Main thing I am looking for is to be on a team that will embrace having a
 Jr to Mid level big data developer, where I can grow my skill set and
 contribute.


 My skills are:

 3 years Java
 1.5 years Hadoop
 1.5 years Hbase
 1 year map reduce
 1 year Apache Storm
 1 year Apache Spark (did a Spark Streaming project in Scala)

 5 years PHP
 3 years iOS development
 4 years Amazon ec2 experience


 Currently I am working in San Francisco as a big data developer, but the
 team I'm on is content leaving me work that I already knew how to do when I
 came to the team (web services) and I want to work with big data
 technologies at least 70% of the time.


 I am not a senior big data dev, but I am motivated to be and am just
 looking for an opportunity where I can work all day or most of the day with
 big data technologies, and contribute and learn from the project at hand.


 Thanks if anyone can share any information,


 Adam





Re: Jobs for Hadoop Admin in SFO Bay Area

2015-02-17 Thread Stephen Boesch
Hi,
  There are plenty of job boards: please use them : this one is for
technical items.

2015-02-17 17:50 GMT-08:00 Krish Donald gotomyp...@gmail.com:

 Hi,

 Does any of you aware of the job opening for entry level Hadoop Admin in
 SFO Bay Area ?

 If yes, please let me know.

 Thanks
 Krish



Re: Sr.Technical Architect/Technical Architect/Sr. Hadoop /Big Data Developer for CA, GA, NJ, NY, AZ Locations_(FTE)Full Time Employment

2014-11-12 Thread Stephen Boesch
Please refrain from job postings on this mailing list.

2014-11-12 20:53 GMT-08:00 Larry McCay lmc...@hortonworks.com:

 Everyone should be aware that replying to this mail results in sending
 your papers to everyone on the list


 On Wed, Nov 12, 2014 at 8:17 PM, mark charts mcha...@yahoo.com wrote:

 Hello.


 I am interested. Attached are my cover letter and my resume.


 Mark Charts


   On Wednesday, November 12, 2014 2:45 PM, Amendra Singh Gangwar 
 amen...@exlarate.com wrote:


 Hi,


 Please let me know if you are available for this FTE position for CA, GA,
 NJ, NY with good travel.
 Please forward latest resume along with Salary Expectations, Work
 Authorization  Minimum joining time required.

 Job Descriptions:

 Positions: Technical Architect/Sr. Technical Architect/ Sr.
 Hadoop /Big Data Developer

 Location  : CA, GA, NJ, NY

 Job Type  : Full Time Employment

 Domain: BigData



 Requirement 1:

 Sr. Technical Architect: 12+ years of experience in the implementation
 role
 of high end software products in telecom/ financials/ healthcare/
 hospitality domain.



 Requirement 2:

 Technical Architect: 9+ years of experience in the implementation role of
 high end software products in telecom/ financials/ healthcare/
 hospitality
 domain.



 Requirement 3:

 Sr. Hadoop /Big Data Developer 7+ years of experience in the
 implementation
 role of high end software products in telecom/ financials/ healthcare/
 hospitality domain.



 Education: Engineering Graduate, .MCA, Masters/Post Graduates (preferably
 IT/ CS)



 Primary Skills:

 1. Expertise on Java/ J2EE and should still be hands on.

 2. Implemented and in-depth knowledge of various java/ J2EE/ EAI patterns
 by
 using Open Source products.

 3. Design/ Architected and implemented complex projects dealing with the
 considerable data size (GB/ PB) and with high complexity.

 4. Sound knowledge of various Architectural concepts (Multi-tenancy, SOA,
 SCA etc) and capable of identifying and incorporating various NFR’s
 (performance, scalability, monitoring etc)

 5. Good in database principles, SQL, and experience working with large
 databases (Oracle/ MySQL/ DB2).

 6. Sound knowledge about the clustered deployment Architecture and should
 be
 capable of providing deployment solutions based on customer needs.

 7. Sound knowledge about the Hardware (CPU, memory, disk, network,
 Firewalls
 etc)

 8. Should have worked on open source products and also contributed
 towards
 it.

 9. Capable of working as an individual contributor and within team too.

 10. Experience in working in ODC model and capable of presenting the
 Design
 and Architecture to CTO’s, CEO’s at onsite

 11. Should have experience/ knowledge on working with batch processing/
 Real
 time systems using various Open source technologies lik Solr, hadoop,
 NoSQL
 DB’s, Storm, kafka etc.



 Role  Responsibilities (Technical Architect/Sr. Technical Architect)

 •Anticipate on technological evolutions.

 •Coach the technical team in the development of the technical
 architecture.

 •Ensure the technical directions and choices.

 •Design/ Architect/ Implement various solutions arising out of the large
 data processing (GB’s/ PB’s) over various NoSQL, Hadoop and MPP based
 products.

 •Driving various Architecture and design calls with bigdata customers.

 •Working with offshore team and providing guidance on implementation
 details.

 •Conducting sessions/ writing whitepapers/ Case Studies pertaining to
 BigData

 •Responsible for Timely and quality deliveries.

 •Fulfill organization responsibilities – Sharing knowledge and experience
 within the other groups in the org., conducting various technical
 sessions
 and trainings.



 Role  Responsibilities (Sr. Hadoop /Big Data Developer)

 •Implementation of various solutions arising out of the large data
 processing (GB’s/ PB’s) over various NoSQL, Hadoop and MPP based products

 •Active participation in the various Architecture and design calls with
 bigdata customers.

 •Working with Sr. Architects and providing implementation details to
 offshore.

 •Conducting sessions/ writing whitepapers/ Case Studies pertaining to
 BigData

 •Responsible for Timely and quality deliveries.

 •Fulfill organization responsibilities – Sharing knowledge and experience
 within the other groups in the org., conducting various technical
 sessions
 and trainings

 --

 Thanks  Regards,

 Amendra Singh

 Exlarate LLC

 Cell : 323-250-0583

 E-mail :amen...@exlarate.com

 www.exlarate.com

 

 Under Bill s.1618 Title III passed by the 105th U.S. Congress this mail
 cannot be considered Spam as long as we include contact information and a
 remove link for removal from our mailing list. To be removed from our
 mailing list reply with Remove and include your original email
 address/addresses in the subject heading. Include complete
 address/addresses and/or domain to be removed. We 

Re: Spark vs. Storm

2014-07-02 Thread Stephen Boesch
Spark Streaming discretizes the stream by configurable intervals of no less
than 500Milliseconds. Therefore it is not appropriate for true real time
processing.So if you need to capture events in the low 100's of milliseonds
range or less than stick with Storm (at least for now).

If you can afford one second+ of latency then spark provides advantages of
interoperability with the other Spark components and capabilities.


2014-07-02 12:59 GMT-07:00 Shahab Yunus shahab.yu...@gmail.com:

 Not exactly. There are of course  major implementation differences and
 then some subtle and high level ones too.

 My 2-cents:

 Spark is in-memory M/R and it simulated streaming or real-time distributed
 process for large datasets by micro-batching. The gain in speed and
 performance as opposed to batch paradigm is in-memory buffering or batching
 (and I am here being a bit naive/crude in explanation.)

 Storm on the other hand, supports stream processing even at a single
 record level (known as tuple in its lingo.) You can do micro-batching on
 top of it as well (using Trident API which is good for state maintenance
 too, if your BL requires that). This is more applicable where you want
 control to a single record level rather than set, collection or batch of
 records.

 Having said that, Spark Streaming is trying to simulate Storm's extreme
 granular approach but as far as I recall, it still is built on top of core
 Spark (basically another level of abstraction over core Spark constructs.)

 So given this, you can pick the framework which is more attuned to your
 needs.


 On Wed, Jul 2, 2014 at 3:31 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Do these two projects do essentially the same thing? Is one better
 than the other?





Re: Big Data tech stack (was Spark vs. Storm)

2014-07-02 Thread Stephen Boesch
You will not be arriving at a generic stack without oversimplifying to the
point of serious deficiencies. There are as you say a multitude of
options.  You are attempting to boil them down to  A vs B as opposed to A
may work better under the following conditions ..


2014-07-02 13:25 GMT-07:00 Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com:

   You know what I’m really trying to do? I’m trying to come up with a
 best practice technology stack. There are so many freaking projects it is
 overwhelming. If I were to walk into an organization that had no Big Data
 capability, what mix of projects would be best to implement based on
 performance, scalability and easy of use/implementation? So far I’ve got:
 Ubuntu
 Hadoop
 Cassandra (Seems to be the highest performing NoSQL database out there.)
 Storm (maybe?)
 Python (Easier than Java. Maybe that shouldn’t be a concern.)
 Hive (For people to leverage their existing SQL skillset.)

 That would seem to cover transaction processing and warehouse storage and
 the capability to do batch and real time analysis. What am I leaving out or
 what do I have incorrect in my assumptions?

 B.



  *From:* Stephen Boesch java...@gmail.com
 *Sent:* Wednesday, July 02, 2014 3:07 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Spark vs. Storm

  Spark Streaming discretizes the stream by configurable intervals of no
 less than 500Milliseconds. Therefore it is not appropriate for true real
 time processing.So if you need to capture events in the low 100's of
 milliseonds range or less than stick with Storm (at least for now).

 If you can afford one second+ of latency then spark provides advantages of
 interoperability with the other Spark components and capabilities.


 2014-07-02 12:59 GMT-07:00 Shahab Yunus shahab.yu...@gmail.com:

 Not exactly. There are of course  major implementation differences and
 then some subtle and high level ones too.

 My 2-cents:

 Spark is in-memory M/R and it simulated streaming or real-time
 distributed process for large datasets by micro-batching. The gain in speed
 and performance as opposed to batch paradigm is in-memory buffering or
 batching (and I am here being a bit naive/crude in explanation.)

 Storm on the other hand, supports stream processing even at a single
 record level (known as tuple in its lingo.) You can do micro-batching on
 top of it as well (using Trident API which is good for state maintenance
 too, if your BL requires that). This is more applicable where you want
 control to a single record level rather than set, collection or batch of
 records.

 Having said that, Spark Streaming is trying to simulate Storm's extreme
 granular approach but as far as I recall, it still is built on top of core
 Spark (basically another level of abstraction over core Spark constructs.)

 So given this, you can pick the framework which is more attuned to your
 needs.


 On Wed, Jul 2, 2014 at 3:31 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Do these two projects do essentially the same thing? Is one better
 than the other?







Re: Cloudera VM

2014-06-14 Thread Stephen Boesch
Hi Shashi,
   apologies, I was working on CDH5  enterprise (real deal) today and had
misremembered the settings for the VM

the VM uses */opt/cloudera/parcels/CDH/lib/*[hive|hadoop|hbase]


2014-06-13 22:51 GMT-07:00 Maisnam Ns maisnam...@gmail.com:

 Hi Stephen,

 Your last command ls -lrta /etc/hadoop worked , but when I tried this
 command I could not find hadoop|hive|hbase etc, where are they located, the
 actual hadoop directory
 [cloudera@localhost /]$ ls /usr/lib
 anaconda-runtime  hue java-1.5.0   jvm-exports  rpm
 bonobojavajava-1.6.0   jvm-private  ruby
 ConsoleKitjava-1.3.1  java-1.7.0   locale   sendmail
 cups  java-1.4.0  java-ext lsb  sendmail.postfix
 games java-1.4.1  jvm  mozilla  vmware-tools
 gcc   java-1.4.2  jvm-commmon  python2.6yum-plugins


 Thanks
 shashi


 On Sat, Jun 14, 2014 at 10:30 AM, Stephen Boesch java...@gmail.com
 wrote:

 Hi Shashidhar,
   They are under /etc/[hadoop|hbase|hive|etc]/conf as symlinks   to
 /usr/lib/[hadoop|hbase|hive|etc]

 [cloudera@localhost CDH]$ ll -Ld /etc/h*

 drwxr-xr-x  2 root root 4096 Jun  8 16:37 /etc/hadoop-httpfs
 drwxr-xr-x. 3 root root 4096 Apr  4 10:36 /etc/hal
 drwxr-xr-x. 3 root root 4096 Jun  8 16:37 /etc/hbase
 drwxr-xr-x  2 root root 4096 Jun  8 16:37 /etc/hbase-solr
 drwxr-xr-x. 3 root root 4096 Jun  8 16:37 /etc/hive
 drwxr-xr-x  2 root root 4096 Jun  8 16:37 /etc/hive-hcatalog
 drwxr-xr-x  2 root root 4096 Jun  8 16:37 /etc/hive-webhcat
 -rw-r--r--. 1 root root9 Sep 23  2011 /etc/host.conf
 -rw-r--r--. 1 root root   46 Apr  4 10:36 /etc/hosts
 -rw-r--r--. 1 root root  370 Jan 12  2010 /etc/hosts.allow
 -rw-r--r--. 1 root root  460 Jan 12  2010 /etc/hosts.deny
 drwxr-xr-x  2 root root 4096 Jun  8 16:37 /etc/hue


 [cloudera@localhost CDH]$ *ls -lrta /etc/hadoop*
 total 16
 drwxr-xr-x.   2 root root 4096 Apr  4 11:21 conf.cloudera.hdfs
 lrwxrwxrwx1 root root   29 Jun  8 16:37 *conf -
 /etc/alternatives/hadoop-conf*
 drwxr-xr-x.   4 root root 4096 Jun  8 16:37 .
 drwxr-xr-x.   2 cloudera cloudera 4096 Jun  8 19:14 conf.cloudera.yarn
 drwxr-xr-x. 113 root root 4096 Jun 13 21:57 ..




 2014-06-13 21:42 GMT-07:00 Shashidhar Rao raoshashidhar...@gmail.com:

 Hi,

 I just installed cloudera vm 5 .x on vmplayer. Can somebody having
 experience with cloudera help me in finding where are the *-site.xml files
 so that I can configure the various settings.



 Thanks
 Shashi






Re: Cloudera VM

2014-06-13 Thread Stephen Boesch
Hi Shashidhar,
  They are under /etc/[hadoop|hbase|hive|etc]/conf as symlinks   to
/usr/lib/[hadoop|hbase|hive|etc]

[cloudera@localhost CDH]$ ll -Ld /etc/h*

drwxr-xr-x  2 root root 4096 Jun  8 16:37 /etc/hadoop-httpfs
drwxr-xr-x. 3 root root 4096 Apr  4 10:36 /etc/hal
drwxr-xr-x. 3 root root 4096 Jun  8 16:37 /etc/hbase
drwxr-xr-x  2 root root 4096 Jun  8 16:37 /etc/hbase-solr
drwxr-xr-x. 3 root root 4096 Jun  8 16:37 /etc/hive
drwxr-xr-x  2 root root 4096 Jun  8 16:37 /etc/hive-hcatalog
drwxr-xr-x  2 root root 4096 Jun  8 16:37 /etc/hive-webhcat
-rw-r--r--. 1 root root9 Sep 23  2011 /etc/host.conf
-rw-r--r--. 1 root root   46 Apr  4 10:36 /etc/hosts
-rw-r--r--. 1 root root  370 Jan 12  2010 /etc/hosts.allow
-rw-r--r--. 1 root root  460 Jan 12  2010 /etc/hosts.deny
drwxr-xr-x  2 root root 4096 Jun  8 16:37 /etc/hue


[cloudera@localhost CDH]$ *ls -lrta /etc/hadoop*
total 16
drwxr-xr-x.   2 root root 4096 Apr  4 11:21 conf.cloudera.hdfs
lrwxrwxrwx1 root root   29 Jun  8 16:37 *conf -
/etc/alternatives/hadoop-conf*
drwxr-xr-x.   4 root root 4096 Jun  8 16:37 .
drwxr-xr-x.   2 cloudera cloudera 4096 Jun  8 19:14 conf.cloudera.yarn
drwxr-xr-x. 113 root root 4096 Jun 13 21:57 ..




2014-06-13 21:42 GMT-07:00 Shashidhar Rao raoshashidhar...@gmail.com:

 Hi,

 I just installed cloudera vm 5 .x on vmplayer. Can somebody having
 experience with cloudera help me in finding where are the *-site.xml files
 so that I can configure the various settings.



 Thanks
 Shashi



Re: FW: Lambdoop - We are hiring! CTO-Founder2Be based in Madrid, Spain

2014-01-29 Thread Stephen Boesch
Please refrain from making this a job posting site. thanks.


2014-01-29 Gursev Bajwa gba...@blackberry.com



 Gursev Bajwa
 BlackBerry Technical Support
 BlackBerry Care

 Office:
 (519) 888-7465 x39495
 Mobile:
 (226) 339-2981
 PIN:
 2FFFC504


 Always On, Always Connected

 Sent from my BlackBerry 10 Smartphone

 -Original Message-
 From: Info [mailto:i...@lambdoop.com]
 Sent: Wednesday, January 29, 2014 8:02 AM
 To: gene...@hadoop.apache.org
 Subject: Lambdoop - We are hiring! CTO-Founder2Be based in Madrid, Spain

 Dear all,

 At Lambdoop we are building a BigData middleware with a simple Java API
 for building Big Data applications using any processing paradigm:
 long-running (batch) tasks, streaming (real-time) processing and hybrid
 computation models (Lambda Architecture). No MapReduce coding, streaming
 topology processing or complex NoSQL management. No synchronization or
 aggregation issues. Just Lambdoop!

 More info to come at lambdoop.com

 Job description:
 Due to our next official launch, we are looking to hire a talented Chief
 Technology Officer with experience working in a start-up environment. Our
 engineering team has been hardly working on implementing our initial
 product and now we are ready to launch and grow a new innovative BigData
 company that will make BigData application development easier and faster.
 With our ultimate goal to make our customers have better use of data,
 simplifying the way they create valuable insights from they sparse data
 sources.

 You will be responsible for building and leading our engineering team,
 setting the product strategy and roadmap. While this will be the main
 activity, our next CTO should be willing to get his hands on the products,
 writing code, solving problems and prototyping new functionalities. You
 will be providing technical guidance and direction to our entire
 engineering team. Ideally this individual will propose innovative solutions
 and explore new ways of increase the value we provide to our customers.

 We are looking for a CTO- founder to be - based in Madrid, Spain. W e
 offer competitive salary and strong benefits, (equity included). We offer a
 very innovative work culture, within a passionate team willing to make the
 difference.

 If you are intersted, please drop us an email at in f...@lambdoop.com and
 tell us about you.

 Required Skills


 *PhD/MS Computer Science - Software Engineering *Strong OOP skills,
 ability to analyze requirements and prepare design *Strong background in
 Architecture Design, especially in parallel and distributed processing
 systems *Demonstrated track building SW solutions and products, preferably
 in the data analytics space, in another successful start-up or
 well-respected technology company *Hands on experience with common
 open-source and JEE technologies *Knowledge on cloud environments (AWS,
 Google, ...) *Team-oriented individual with excellent interpersonal,
 planning, coordination, and problem-finding skills. Ability to prioritize
 and assign tasks throughout the team.
 *High degree of initiative and the ability to work independently and
 follow-through on assignments.
 Big Data experience


 *Hadoop (HDFS, MapReduce, YARN)
 *Batch technologies (Hive, Pig, Cascading) *NoSQL (HBase, Cassandra,
 Redis) *Real-time processing (Storm, Trident) *Related tools (Avro, Sqoop,
 Flume) *Machine Learning (Mahout, SAMOA) Personal skills


 *Strong communication capabilities in Spanish and English (both written
 and verbal) *Result oriented capable to work under tight deadlines
 *Motivated *Open minded *A passion to learn and to make a difference Thanks
 .
 -
 The Lambdoop team
 la mbdoop.com
 @infoLambdoop




Re: how to benchmark the whole hadoop cluster performance?

2013-09-02 Thread Stephen Boesch
You are on the right track.  TestDFSIO and TeraGen/Sort provide good
characterization of IO  and shuffle /sort performance.  You would likely
want to run/save dstat (/vmstat/iostat/..) info on the individual nodes as
well.

HiBench does provide additional useful characterizations such as mixed
workloads using typical hadoop ecosystem tools.


2013/9/2 Ravi Kiran ravikiranmag...@gmail.com

 You can also look at
 a ) https://github.com/intel-hadoop/HiBench


 Regards
 Ravi Magham


 On Mon, Sep 2, 2013 at 12:26 PM, ch huang justlo...@gmail.com wrote:

 hi ,all:
i want to evaluate my hadoop cluster performance ,what tool can i
 use? (TestDFSIO,nnbench?)





Re: Collect, Spill and Merge phases insight

2013-07-16 Thread Stephen Boesch
great questions, i am also looking forward to answers from expert(s) here.


2013/7/16 Felix.徐 ygnhz...@gmail.com

 Hi all,

 I am trying to understand the process of Collect, Spill and Merge in Map,
 I've referred to a few documentations but still have a few questions.

 Here is my understanding about the spill phase in map:

 1.Collect function add a record into the buffer.
 2.If the buffer exceeds a threshold (determined by parameters like
 io.sort.mb), spill phase begins.
 3.Spill phase includes 3 actions : sort , combine and compression.
 4.Spill may be performed multiple times thus a few spilled files will be
 generated.
 5.If there are more than 1 spilled files, Merge phase begins and merge
 these files into a big one.

 If there is any miss understanding about these phases, please correct me
 ,thanks!
 And my questions are:

 1.Where is the partition being calculated (in Collect or Spill) ?  Does
 Collect simply append a record into the buffer and check whether we should
 spill the buffer?

 2.At Merge phase, since the spilled files are compressed, does it need to
 uncompressed these files and compress them again? Since Merge may be
 performed more than 1 round, does it compress intermediate files?

 3.Does the Merge phase at Map and Reduce side almost the same (External
 merge-sort combined with Min-Heap) ?




Hint on EOFException's on datanodes

2013-05-23 Thread Stephen Boesch
On a smallish (10 node) cluster with only 2 mappers per node after a few
minutes EOFExceptions are cropping up on the datanodes: an example is shown
below.

Any hint on what to tweak/change in hadoop / cluster settings to make this
more happy?


2013-05-24 05:03:57,460 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode
(org.apache.hadoop.hdfs.server.datanode.DataXceiver@1b1accfc): writeBlock
blk_7760450154173670997_48372 received exception java.io.EOFException:
while trying to read 65557 bytes
2013-05-24 05:03:57,262 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder 0 for
Block blk_-3990749197748165818_48331): PacketResponder 0 for block
blk_-3990749197748165818_48331 terminating
2013-05-24 05:03:57,460 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode
(org.apache.hadoop.hdfs.server.datanode.DataXceiver@1b1accfc):
DatanodeRegistration(10.254.40.79:9200,
storageID=DS-1106090267-10.254.40.79-9200-1369343833886, infoPort=9102,
ipcPort=9201):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:268)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:406)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112)
at java.lang.Thread.run(Thread.java:662)
2013-05-24 05:03:57,261 INFO org.apache.hadoop.hdfs.server.datanode.Dat


Re: Reduce starts before map completes (at 23%)

2013-04-11 Thread Stephen Boesch
Hi Sai,
  The first phase of reducer is to copy/fetch the files from Mapper
machine(s) to the reducer. This can be started when some of the mappers
have completed. You will notice that the reducer will not surpass 33%
though: since the next phase - sort - requries that all of the mappers be
completed.

Summary: that is fine.


2013/4/11 Sai Sai saigr...@yahoo.in

 I am running the wordcount from hadoop-examples, i am giving as input a
 bunch of test files, i have noticed in the output given below reduce starts
 when the map is at 23%, i was wondering if it is not right that reducers
 will start only after the complete mapping is done which mean when map is
 100% then i thought the reducers will start. Why r the reducers starting
 when map is still at 23%.

 13/04/11 21:10:32 INFO mapred.JobClient:  map 0% reduce 0%
 13/04/11 21:10:56 INFO mapred.JobClient:  map 1% reduce 0%
 13/04/11 21:10:59 INFO mapred.JobClient:  map 2% reduce 0%
 13/04/11 21:11:02 INFO mapred.JobClient:  map 3% reduce 0%
 13/04/11 21:11:05 INFO mapred.JobClient:  map 4% reduce 0%
 13/04/11 21:11:08 INFO mapred.JobClient:  map 6% reduce 0%
 13/04/11 21:11:11 INFO mapred.JobClient:  map 7% reduce 0%
 13/04/11 21:11:17 INFO mapred.JobClient:  map 8% reduce 0%
 13/04/11 21:11:23 INFO mapred.JobClient:  map 10% reduce 0%
 13/04/11 21:11:26 INFO mapred.JobClient:  map 12% reduce 0%
 13/04/11 21:11:32 INFO mapred.JobClient:  map 14% reduce 0%
 13/04/11 21:11:44 INFO mapred.JobClient:  map 23% reduce 0%
 13/04/11 21:11:50 INFO mapred.JobClient:  map 23% reduce 1%
 13/04/11 21:11:53 INFO mapred.JobClient:  map 33% reduce 7%
 13/04/11 21:12:02 INFO mapred.JobClient:  map 42% reduce 7%

 Please pour some light.
 Thanks
 Sai



Re: Job: Bigdata Architect

2013-04-01 Thread Stephen Boesch
Please let's refrain from recruiting on this DL, which is my understanding
focused on hadoop questions/issues not jobs.  Thanks,


2013/4/1 jessica k jessica.kudukisgr...@gmail.com

 We are a recruiting firm focused on The IT Service and Solutions industry.
 We were contracted by a top tier $7 Billion + consulting firm.
 I thought you may be interested , or know someone who may be, in the
 following position.

 **
 **
 *1.   **Job Role:* Sr. Architect (C2) / Architect (C1) / Tech Lead (B3) –
 Big Data and Online Analytics
 *2.   **Locations*: Mountain View, California or New York City, NY*
 3.   **Job Description: *
  We are looking for Architects for developing Big Data solutions. The
 candidates would be responsible for the following activities:

- Meet clients, understand their needs and craft solutions meeting
client needs Technology evaluation, architecture  design, implementation,
testing and technical reviews of highly scalable platforms, solutions and
technology building blocks to address customer’s requirement.
- Lead the overall technical solution implementation as part of
customer’s project delivery team
- Mentor and groom project team members in the core technology areas,
usage of SDLC tools and Agile
- Engage in external and internal forums to demonstrate thought
leadership

 *4. Skills Required:*

- 7+ years of overall work experience.
- Strong hands-on work experience in architecting, designing and
implementing big data solutions including one or more of the following
skills:


   Programming Languages
  Core Java, J2EE, JDBC, Spring, Struts, Hibernate
   Scalable databases or object stores
  Cassandra, MongoDB, AWS DynamoDB, Hbase, OpenStack Swift,  etc

Caching
  memcached, BerkelyDB, Gigaspaces, Infinispan

Distributed File Systems
  HDFS, Gluster, Lustre
   Distributed Data Processing
  Hadoop Map Reduce, Parallel Processing, Stream Processing, Flume, Splunk

Other skills
  Machine Learning, Mathematical background  NLP

 If you are interested , please respond with a current resume to *
 jess...@kudukisgroup.com*.
 I will give you a call to speak in further detail.
 We keep all information confidential. Feel free to reply with any
 questions.

 Thanks,
 ~ Jessica



Re: Job driver and 3rd party jars

2013-03-09 Thread Stephen Boesch
  *-Dmapreduce.task.classpath.user.precedence=true*
*
*
I have also experienced these issues with -libjars not working and am
following this thread with interest.

Where is this particular option - as opposed to
mapreduce.user.classpath.first  which in version 1.0.3 is in the
TaskRunner.java ?  Any documentation / hints on this?


2013/3/7 刘晓文 lxw1...@qq.com

 try:
 hadoop jar  *-Dmapreduce.task.classpath.user.precedence=true *-libjars
 your_jar



 -- Original --
 *From: * Barak Yaishbarak.ya...@gmail.com;
 *Date: * Fri, Mar 8, 2013 03:06 PM
 *To: * useruser@hadoop.apache.org; **
 *Subject: * Re: Job driver and 3rd party jars

 Yep, my typo, I'm using the later. I was also trying export
 HADOOP_CLASSPATH_USER_FIRST =true and export HADOOP_CLASSPATH=myjar before
 launching the hadoop jar, but I still getting the same exception.
 I'm running hadoop 1.0.4.
 On Mar 8, 2013 2:27 AM, Harsh J ha...@cloudera.com wrote:

 To be precise, did you use -libjar or -libjars? The latter is the right
 option.

 On Fri, Mar 8, 2013 at 12:18 AM, Barak Yaish barak.ya...@gmail.com
 wrote:
  Hi,
 
  I'm able to run M/R jobs where the mapper and reducer required to use
 3rd
  party jars. I'm registering those jars in -libjar while invoking the
 hadoop
  jar command. I'm facing a strange problem, though, when the job driver
  itself ( extends Configured implements Tool ) required to run such code
 (
  for example notify some remote service upon start and end). Is there a
 way
  to configure classpath when submitting jobs using hadoop jar? Seems like
  -libjar doesn't work for this case...
 
  Exception in thread main java.lang.NoClassDefFoundError:
  com/me/context/DefaultContext
  at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClassCond(ClassLoader.java:632)
  at java.lang.ClassLoader.defineClass(ClassLoader.java:616)
  at
  java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
  at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
  at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
  at
  com.peer39.bigdata.mr.pnm.PnmDataCruncher.run(PnmDataCruncher.java:50)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at com.me.mr.pnm.PnmMR.main(PnmDataCruncher.java:261)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
  Caused by: java.lang.ClassNotFoundException:
 com.me.context.DefaultContext
  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:248)



 --
 Harsh J




Re: Tests to be done to check a library if it works on Hadoop MapReduce framework

2013-02-16 Thread Stephen Boesch
check out mrunit  http://mrunit.apache.org/


2013/2/16 Varsha Raveendran varsha.raveend...@gmail.com

 Hello!

 As part of my graduate project I am trying to create a library to support
 Genetic Algorithms to work on Hadoop MapReduce.

 A couple of things I have not understood -

  How to test the library in Hadoop? I mean how to check if it is doing
 what it is supposed to do.
  What is the user expected to give as input?

 I am a newbie to both GA and Hadoop.
 I have been studying both for the past couple of weeks. So now have a fair
 idea about GAs but not able to understand how to implement a library. Never
 done this before.
 Any suggestions/help would be appreciated.

 Regards,
 Varsha



Re: Help with DataDrivenDBInputFormat: splits are created properly but zero records are sent to the mappers

2013-01-24 Thread Stephen Boesch
It turns out to be an apparent problem in one of the two methods
of  DataDrivenDBAPi.setInput().   The version I used does not work as
shown: it needs to have a primary key column set somehow. But no
information / documentation on how to set the pkcol that I could find.  So
I converted to using the other setIput() method  as follow:

DataDrivenDBInputFormat.setInput(job, DBTextWritable.class,
  APP_DETAILS_CRAWL_QUEUE_V, null, id, id);

Now  this is working .




2013/1/24 Stephen Boesch java...@gmail.com


 I have made an attempt to implement a job using DataDrivenDBInputFormat.
 The result is that the input splits are created successfully with 45K
 records apeice, but zero records are then actually sent to the mappers.

 If anyone can point to working example(s) of using DataDrivenDBInputFormat
 it would be much appreciated.


 Here are further details of my attempt:


 DBConfiguration.configureDB(job.getConfiguration(), props.getDriver(),
 props.getUrl(), props.getUser(), props.getPassword());
 // Note: i also include code here to verify able to get
 java.sql.Connection using the above props..

 DataDrivenDBInputFormat.setInput(job,
   DBLongWritable.class,
   select id,status from app_detail_active_crawl_queue_v where  +
  DataDrivenDBInputFormat.SUBSTITUTE_TOKEN,
   SELECT MIN(id),MAX(id) FROM app_detail_active_crawl_queue_v );
 // I verified by stepping with debugger that the input query were
 successfully applied by DataDrivenDBInputFormat to create two splits of 40K
 records each
 );

 .. snip  ..
 // Register a custom DBLongWritable class
   static {
 WritableComparator.define(DBLongWritable.class, new
 DBLongWritable.DBLongKeyComparator());
 int x  = 1;
   }


 Here is the job output. No rows were processed (even though 90K rows were
 identified in the INputSplits phase and divided into two 45K splits..So why
 were the input splits not processed?

 [Thu Jan 24 12:19:59] Successfully connected to
 driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/classint
 user=stephenb
 [Thu Jan 24 12:19:59] select id,status from
 app_detail_active_crawl_queue_v where $CONDITIONS
 13/01/24 12:20:03 INFO mapred.JobClient: Running job: job_201301102125_0069
 13/01/24 12:20:05 INFO mapred.JobClient:  map 0% reduce 0%
 13/01/24 12:20:22 INFO mapred.JobClient:  map 50% reduce 0%
 13/01/24 12:20:25 INFO mapred.JobClient:  map 100% reduce 0%
 13/01/24 12:20:30 INFO mapred.JobClient: Job complete:
 job_201301102125_0069
 13/01/24 12:20:30 INFO mapred.JobClient: Counters: 17
 13/01/24 12:20:30 INFO mapred.JobClient:   Job Counters
 13/01/24 12:20:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=21181
 13/01/24 12:20:30 INFO mapred.JobClient: Total time spent by all
 reduces waiting after reserving slots (ms)=0
 13/01/24 12:20:30 INFO mapred.JobClient: Total time spent by all maps
 waiting after reserving slots (ms)=0
 13/01/24 12:20:30 INFO mapred.JobClient: Launched map tasks=2
 13/01/24 12:20:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
 13/01/24 12:20:30 INFO mapred.JobClient:   File Output Format Counters
 13/01/24 12:20:30 INFO mapred.JobClient: Bytes Written=0
 13/01/24 12:20:30 INFO mapred.JobClient:   FileSystemCounters
 13/01/24 12:20:30 INFO mapred.JobClient: HDFS_BYTES_READ=215
 13/01/24 12:20:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44010
 13/01/24 12:20:30 INFO mapred.JobClient:   File Input Format Counters
 13/01/24 12:20:30 INFO mapred.JobClient: Bytes Read=0
 13/01/24 12:20:30 INFO mapred.JobClient:   Map-Reduce Framework
 13/01/24 12:20:30 INFO mapred.JobClient: Map input records=0
 13/01/24 12:20:30 INFO mapred.JobClient: Physical memory (bytes)
 snapshot=200056832
 13/01/24 12:20:30 INFO mapred.JobClient: Spilled Records=0
 13/01/24 12:20:30 INFO mapred.JobClient: CPU time spent (ms)=2960
 13/01/24 12:20:30 INFO mapred.JobClient: Total committed heap usage
 (bytes)=247201792
 13/01/24 12:20:30 INFO mapred.JobClient: Virtual memory (bytes)
 snapshot=4457689088
 13/01/24 12:20:30 INFO mapred.JobClient: Map output records=0
 13/01/24 12:20:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=215




Getting hostname (or any environment variable) into *-site.xml files

2012-10-12 Thread Stephen Boesch
Hi,
   We are using a library with hdfs that uses custom properties inside the
*-site.xml files.  Instead of (a) hard-coding or (b) writing a sed script
to update to the local hostnames on each deployed node, is there a
mechanism to use environment variables?

property
namecustom.property/name
value${HOSTNAME}/value
/property


I have tried this and out of the box the string ${HOSTNAME} is used.
 Anyone have recommendation/solution on this?

thanks,

stephen b


Re: Anyone successfully run Hadoop in pseudo or cluster model under Cygwin?

2012-04-13 Thread Stephen Boesch
Hi Tim,
Running in cygwin one encounters more bugs and often will get less
support (since less people running cygwin) than running on native linux
distros.  So if you have a choice I'd not recommend it. If you do not have
a choice then of course please wait for other responses on this ML.

stephenb

2012/4/12 Tim.Wu china.tim...@gmail.com

 If yes. Could u send me an email?  Your prompt reply will be appreciated,
 because I asked two questions in this mailing list in March, but no one
 reply me.

 Questions are listed in

 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201203.mbox/%3CCA+2n-oGLocSo3YMUiYzhbtOPzO=11g1rl_b45+y-tggyvzk...@mail.gmail.com%3E

 and

 http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201203.mbox/%3CCA+2n-oERYM15BKb4dc9KETpjNfucqzd-=coecf2sgclj5u7...@mail.gmail.com%3E


 --
 Best,
 WU Pengcheng ( Tim )



Re: UDF compiling

2012-04-12 Thread Stephen Boesch
HI
   try adding  the directory in which WordCount.class was placed to the
-classpath
4/12 Barry, Sean F sean.f.ba...@intel.com

 I am trying to compiling a customized WordCount UDF but I get this cannot
 find symbol error when I compile. And I'm not sure how to resolve this
 issue.

 hduser@master:~ javac -classpath
 /usr/lib/hadoop/hadoop-core-0.20.2-cdh3u3.jar WordCount.java
 WordCount.java:24: error: cannot find symbol
conf.setMapperClass(WordMapper.class);
^
  symbol:   class WordMapper
  location: class WordCount
 WordCount.java:25: error: cannot find symbol
conf.setReducerClass(SumReducer.class);
 ^
  symbol:   class SumReducer
  location: class WordCount
 2 errors



 hduser@master:~ ls
  SumReducer.classWordMapper.class
  SumReducer.java  WordCount.java  WordMapper.java
 hduser@master:~





Re: Hadoop Oppurtunity

2012-02-19 Thread Stephen Boesch
yes please, let's focus on the technical issues

2012/2/18 real great.. greatness.hardn...@gmail.com

 Could we actually create a separate mailing list for Hadoop related jobs?

 On Sun, Feb 19, 2012 at 11:40 AM, larry la...@pssclabs.com wrote:

  Hi:
 
  We are looking for someone to help install and support hadoop clusters.
   We are in Southern California.
 
  Thanks,
 
  Larry Lesser
  PSSC Labs
  (949) 380-7288 Tel.
  la...@pssclabs.com
  20432 North Sea Circle
  Lake Forest, CA 92630
 
 


 --
 Regards,
 R.V.



Re: MRv2 DataNode problem: isBPServiceAlive invoked order of 200K times per second

2011-11-29 Thread Stephen Boesch
Update on this:  I've shut down all the servers multiple times.  Also
cleared the data directories and reformatted the namenode. Restarted it and
the same results: 100% cpu and millions of these calls to isBPServiceAlive.


2011/11/29 Stephen Boesch java...@gmail.com

 I am just trying to get off the ground with MRv2.  The first node (in
 pseudo distributed mode)  is working fine - ran a couple of TeraSort's on
 it.

 The second node has a serious issue with its single DataNode: it consumes
 100% of one of the CPU's.  Looking at it through JVisualVM, there are over
 8 million invocations of isBPServiceAlive in a matter of a minute or so and
  continually incrementing at a steady clip.   A screenshot of the JvisualVM
 cpu profile - showing just shy of 8M invocations is attached.

 What kind of configuration error could lead to this?  The conf/masters and
 conf/slaves simply say localhost.   If need be I'll copy the *-site.xml's.
  They are boilerplate from the Cloudera page by Ahmed Radwan.







Re: MRv2 DataNode problem: isBPServiceAlive invoked order of 200K times per second

2011-11-29 Thread Stephen Boesch
Hi Uma,
   I mentioned that I have restarted the datanode *many *times, and in fact
the entire cluster more than ten times.

2011/11/29 Uma Maheswara Rao G mahesw...@huawei.com

  Looks you are getting HDFS-2553.

 The cause might be that, you cleared the datadirectories directly without
 DN restart. Workaround would be to restart DNs.



 Regards,

 Uma



 --

  *From:* Stephen Boesch [java...@gmail.com]
 *Sent:* Tuesday, November 29, 2011 8:53 PM
 *To:* mapreduce-user@hadoop.apache.org
 *Subject:* Re: MRv2 DataNode problem: isBPServiceAlive invoked order of
 200K times per second

  Update on this:  I've shut down all the servers multiple times.  Also
 cleared the data directories and reformatted the namenode. Restarted it and
 the same results: 100% cpu and millions of these calls to isBPServiceAlive.


 2011/11/29 Stephen Boesch java...@gmail.com

 I am just trying to get off the ground with MRv2.  The first node (in
 pseudo distributed mode)  is working fine - ran a couple of TeraSort's on
 it.

  The second node has a serious issue with its single DataNode: it
 consumes 100% of one of the CPU's.  Looking at it through JVisualVM, there
 are over 8 million invocations of isBPServiceAlive in a matter of a minute
 or so and  continually incrementing at a steady clip.   A screenshot of the
 JvisualVM cpu profile - showing just shy of 8M invocations is attached.

  What kind of configuration error could lead to this?  The conf/masters
 and conf/slaves simply say localhost.   If need be I'll copy the
 *-site.xml's.  They are boilerplate from the Cloudera page by Ahmed Radwan.








Re: MRv2 DataNode problem: isBPServiceAlive invoked order of 200K times per second

2011-11-29 Thread Stephen Boesch
I verified the DN was down via both jps and java. Anyways,  it was enough
to see via top since as mentioned DN was consuming 100% of one cpu when
running.

2011/11/29 Stephen Boesch java...@gmail.com

 Hi Uma,
I mentioned that I have restarted the datanode *many *times, and in
 fact the entire cluster more than ten times.


 2011/11/29 Uma Maheswara Rao G mahesw...@huawei.com

  Looks you are getting HDFS-2553.

 The cause might be that, you cleared the datadirectories directly without
 DN restart. Workaround would be to restart DNs.



 Regards,

 Uma



 --

  *From:* Stephen Boesch [java...@gmail.com]
 *Sent:* Tuesday, November 29, 2011 8:53 PM
 *To:* mapreduce-user@hadoop.apache.org
 *Subject:* Re: MRv2 DataNode problem: isBPServiceAlive invoked order of
 200K times per second

  Update on this:  I've shut down all the servers multiple times.  Also
 cleared the data directories and reformatted the namenode. Restarted it and
 the same results: 100% cpu and millions of these calls to isBPServiceAlive.


 2011/11/29 Stephen Boesch java...@gmail.com

 I am just trying to get off the ground with MRv2.  The first node (in
 pseudo distributed mode)  is working fine - ran a couple of TeraSort's on
 it.

  The second node has a serious issue with its single DataNode: it
 consumes 100% of one of the CPU's.  Looking at it through JVisualVM, there
 are over 8 million invocations of isBPServiceAlive in a matter of a minute
 or so and  continually incrementing at a steady clip.   A screenshot of the
 JvisualVM cpu profile - showing just shy of 8M invocations is attached.

  What kind of configuration error could lead to this?  The conf/masters
 and conf/slaves simply say localhost.   If need be I'll copy the
 *-site.xml's.  They are boilerplate from the Cloudera page by Ahmed Radwan.









ProtocolProvider errors On MRv2 Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider

2011-11-28 Thread Stephen Boesch
Hi
  I set up a pseudo cluster  according to the instructions  here
http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/.

Initially the randomwriter example worked. But after a crash on the machine
and restarting the services I am getting the errors shown below.

Jps seems to think the processes are running properly:


had@mithril:/shared/hadoop$ jps
7980 JobHistoryServer
7668 NameNode
7821 ResourceManager
7748 DataNode
8021 Jps
7902 NodeManager


$ hadoop jar hadoop-mapreduce-examples-0.23.0.jar  randomwriter -
Dmapreduce.job.user.name=$USER
-Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory
-Dmapreduce.randomwriter.bytespermap=1 -Ddfs.blocksize=64m
-Ddfs.block.size=64m -libjars
$YARN_HOME/modules/hadoop-mapreduce-client-jobclient-0.23.0.jar output


2011-11-28 10:23:56,102 WARN  conf.Configuration
(Configuration.java:set(629)) - mapred.used.genericoptionsparser is
deprecated. Instead, use mapreduce.client.genericoptionsparser.used
2011-11-28 10:23:56,158 INFO  ipc.YarnRPC (YarnRPC.java:create(47)) -
Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
2011-11-28 10:23:56,162 INFO  mapred.ResourceMgrDelegate
(ResourceMgrDelegate.java:init(95)) - Connecting to ResourceManager at /
0.0.0.0:8040
2011-11-28 10:23:56,163 INFO  ipc.HadoopYarnRPC
(HadoopYarnProtoRPC.java:getProxy(48)) - Creating a HadoopYarnProtoRpc
proxy for protocol interface org.apache.hadoop.yarn.api.ClientRMProtocol
2011-11-28 10:23:56,203 INFO  mapred.ResourceMgrDelegate
(ResourceMgrDelegate.java:init(99)) - Connected to ResourceManager at /
0.0.0.0:8040
2011-11-28 10:23:56,248 INFO  mapreduce.Cluster
(Cluster.java:initialize(116)) - Failed to use
org.apache.hadoop.mapred.YarnClientProtocolProvider due to error:
java.lang.reflect.InvocationTargetException
2011-11-28 10:23:56,250 INFO  mapreduce.Cluster
(Cluster.java:initialize(111)) - Cannot pick
org.apache.hadoop.mapred.LocalClientProtocolProvider as the
ClientProtocolProvider - returned null protocol
2011-11-28 10:23:56,251 INFO  mapreduce.Cluster
(Cluster.java:initialize(111)) - Cannot pick
org.apache.hadoop.mapred.JobTrackerClientProtocolProvider as the
ClientProtocolProvider - returned null protocol
java.io.IOException: Cannot initialize Cluster. Please check your
configuration for mapreduce.framework.name and the correspond server
addresses.


My  *-site.xml files are precisely as shown on the instructions page.

In any case copying here the one that is most germane - mapred-site.xml

?xml version=1.0?
?xml-stylesheet href=configuration.xsl?
configuration
property
name mapreduce.framework.name/name
valueyarn/value
/property
/configuration


Re: ProtocolProvider errors On MRv2 Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider

2011-11-28 Thread Stephen Boesch
Here is complete output.



2011-11-28 16:34:27,606 WARN  conf.Configuration
(Configuration.java:set(629)) - mapred.used.genericoptionsparser is
deprecated. Instead, use mapreduce.client.genericoptionsparser.used
2011-11-28 16:34:27,660 INFO  ipc.YarnRPC (YarnRPC.java:create(47)) -
Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
2011-11-28 16:34:27,663 INFO  mapred.ResourceMgrDelegate
(ResourceMgrDelegate.java:init(95)) - Connecting to ResourceManager at /
0.0.0.0:8040
2011-11-28 16:34:27,664 INFO  ipc.HadoopYarnRPC
(HadoopYarnProtoRPC.java:getProxy(48)) - Creating a HadoopYarnProtoRpc
proxy for protocol interface org.apache.hadoop.yarn.api.ClientRMProtocol
2011-11-28 16:34:27,700 INFO  mapred.ResourceMgrDelegate
(ResourceMgrDelegate.java:init(99)) - Connected to ResourceManager at /
0.0.0.0:8040
2011-11-28 16:34:27,734 INFO  mapreduce.Cluster
(Cluster.java:initialize(116)) - Failed to use
org.apache.hadoop.mapred.YarnClientProtocolProvider due to error:
java.lang.reflect.InvocationTargetException
2011-11-28 16:34:27,735 INFO  mapreduce.Cluster
(Cluster.java:initialize(111)) - Cannot pick
org.apache.hadoop.mapred.LocalClientProtocolProvider as the
ClientProtocolProvider - returned null protocol
2011-11-28 16:34:27,736 INFO  mapreduce.Cluster
(Cluster.java:initialize(111)) - Cannot pick
org.apache.hadoop.mapred.JobTrackerClientProtocolProvider as the
ClientProtocolProvider - returned null protocol
java.io.IOException: Cannot initialize Cluster. Please check your
configuration for mapreduce.framework.name and the correspond server
addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:123)
at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:85)
at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:78)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:460)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:450)
at org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:246)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:294)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:189)


2011/11/28 Stephen Boesch java...@gmail.com

 I had mentioned in the original post that the configuration files were set
 up exactly as in the cloudera post.  That includes the yarn-site.xml.

 but since there are questions about it, i'll go ahead and include those
 below.

 This setup DID work one time, just does not restart properly..



 yarn-site.xml

 ?xml version=1.0?
 configuration
 property
 nameyarn.nodemanager.aux-services/name
 valuemapreduce.shuffle/value
 /property
 property
 nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name
 valueorg.apache.hadoop.mapred.ShuffleHandler/value
 /property
 /configuration


 core-site.xml


 ?xml version=1.0?
 ?xml-stylesheet href=configuration.xsl?
 configuration
 property
 namefs.default.name/name
 valuehdfs://localhost:9000/value
 /property
 property
 nameyarn.user/name
 valuehad/value
 /property
 /configuration

 hdfs-site.xml


 ?xml version=1.0?
 ?xml-stylesheet href=configuration.xsl?
 configuration
 property
 namedfs.replication/name
 value1/value
 /property
 property
 namedfs.permissions/name
 valuefalse/value
 /property
 /configuration


 mapred-site.xml

 ?xml version=1.0?
 ?xml-stylesheet href=configuration.xsl?
 configuration
 property
 name mapreduce.framework.name/name
 valueyarn/value
 /property
 /configuration

 thx


















 2011/11/28 Stephen Boesch java...@gmail.com

 Yes I did both of those already.


 2011/11/28 Marcos Luis Ortiz Valmaseda marcosluis2...@googlemail.com

 2011/11/28 Stephen Boesch java...@gmail.com:
 
  Hi
I set up a pseudo cluster  according to the instructions  here
   http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/.
  Initially the randomwriter example worked. But after a crash on the
 machine
  and restarting the services I am getting the errors shown below.
  Jps seems to think the processes are running properly:
 
  had@mithril:/shared/hadoop$ jps
  7980 JobHistoryServer
  7668 NameNode
  7821

Re: MRv2 with other hadoop projects

2011-11-27 Thread Stephen Boesch
thanks,  I looked into it more.  As you mentioned there are patches for PIG
available.  Hive is a WIP:  Tom White and a couple of other core committers
are getting close on the 0.23.0 branch.  Hbase is farther out, with more
work to do.

stephenb


2011/11/27 Mahadev Konar maha...@hortonworks.com

 Stephen,
  Coming up PIG 0.9.2 and 0.10 are supposed to work with 0.23 (mrv2). Hive
 - I am not too sure of, you might want to check there mailing lists.  As
 for HBase, there is some work needed (see: HBASE-4813 and MR-3169) to
 make it work with MRv2.

 Hope that helps.

 thanks
 mahadev

 On Nov 26, 2011, at 5:11 PM, Stephen Boesch wrote:


 Hi,
 Any work out there on using hbase, hive, pig with MRv2?

 thx!

 stephenb





MRv2 with other hadoop projects

2011-11-26 Thread Stephen Boesch
Hi,
Any work out there on using hbase, hive, pig with MRv2?

thx!

stephenb


Re: Gratuitous use of CHMOD in windows

2011-11-21 Thread Stephen Boesch
Hi,
  you'll probably need to be running this under cygwin since windows native
is not supported.

2011/11/21 Steve Lewis lordjoe2...@gmail.com

 I am running a job on a cluster launching from a windows box and 
 fs.default.name to point the job to the cluster.
 Everything works until the last step where I say

 FileSystem fileSystem = FileSystem.get(config); // this is hdfs
 fileSystem.copyToLocalFile(src, dst); // local system

 at this point RawLocalFileSystem tries to exec chmod which is not on
 windows.

 This looks like a bug - it seems that there should be a fallback system or
 better yet a call to the
 Java standard File.canWrite command before attempting to alter permissions

 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com





Re: Matrix multiplication in Hadoop

2011-11-19 Thread Stephen Boesch
Hi,
   there are two solutions suggested that take advantage of either (a) a
vector x matrix (your CF / Mahout example )  or (b) a small matrix x large
matrix (an earlier suggestion of putting the  small matrix into the
Distributed Cache).  Not clear yet on good approaches of (c)  large matrix
x large matrix.


2011/11/19 bejoy.had...@gmail.com

 Hey Mike
  In mahout one place where   matrix multiplication is used is in
  Collaborative Filtering distributed implementation. The recommendations
 here are generated by the multiplication of a cooccurence matrix with a
 user vector. This user vector is treated as a single column matrix and then
 the matrix multiplication takes place in there.

 Regards
 Bejoy K S

 -Original Message-
 From: Mike Spreitzer mspre...@us.ibm.com
 Date: Fri, 18 Nov 2011 14:52:05
 To: common-user@hadoop.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: RE: Matrix multiplication in Hadoop

 Well, this mismatch may tell me something interesting about Hadoop. Matrix
 multiplication has a lot of inherent parallelism, so from very crude
 considerations it is not obvious that there should be a mismatch.  Why is
 matrix multiplication ill-suited for Hadoop?

 BTW, I looked into the Mahout documentation some, and did not find matrix
 multiplication there.  It might be hidden inside one of the advertised
 algorithms; I looked at the documentation for a few, but did not notice
 mention of MM.

 Thanks,
 Mike



 From:   Michael Segel michael_se...@hotmail.com
 To: common-user@hadoop.apache.org
 Date:   11/18/2011 01:49 PM
 Subject:RE: Matrix multiplication in Hadoop




 Ok Mike,

 First I admire that you are studying Hadoop.

 To answer your question... not well.

 Might I suggest that if you want to learn Hadoop, you try and find a
 problem which can easily be broken in to a series of parallel tasks where
 there is minimal communication requirements between each task?

 No offense, but if I could make a parallel... what you're asking is akin
 to taking a normalized relational model and trying to run it as is in
 HBase.
 Yes it can be done. But not the best use of resources.

  To: common-user@hadoop.apache.org
  CC: common-user@hadoop.apache.org
  Subject: Re: Matrix multiplication in Hadoop
  From: mspre...@us.ibm.com
  Date: Fri, 18 Nov 2011 12:39:00 -0500
 
  That's also an interesting question, but right now I am studying Hadoop
  and want to know how well dense MM can be done in Hadoop.
 
  Thanks,
  Mike
 
 
 
  From:   Michel Segel michael_se...@hotmail.com
  To: common-user@hadoop.apache.org common-user@hadoop.apache.org
  Date:   11/18/2011 12:34 PM
  Subject:Re: Matrix multiplication in Hadoop
 
 
 
  Is Hadoop the best tool for doing large matrix math.
  Sure you can do it, but, aren't there better tools for these types of
  problems?
 
 
  Sent from a remote device. Please excuse any typos...
 
  Mike Segel
 





Namenode in inconsistent state: how to reinitialize the storage directory

2011-10-26 Thread Stephen Boesch
I am relatively new here and starting the CDH3u1 (on vmware).   The
nameserver is not coming up due to the following error:


2011-10-25 22:47:00,547 INFO org.apache.hadoop.hdfs.server.common.Storage:
Cannot access storage directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name
2011-10-25 22:47:00,549 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory
/var/lib/hadoop-0.20/cache/hadoop/dfs/name is in an inconsistent state:
storage directory does not exist or is not accessible.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:305)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:327)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:465)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1224)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1233)

Now, I first noticed there was a lock file.  So I sudo rm'ed it and
retried. But same error.  Then, not knowing what files are required (if any)
to restart, I moved the entire dir and created a new empty one.  Here are
both the new and the 'sav dirs


cloudera@cloudera-demo:/usr/lib/hadoop/logs$ ll
/var/lib/hadoop-0.20/cache/hadoop/dfs/name
total 8
drwxr-xr-x 2 hdfs hdfs   4096 2011-10-25 23:11 .
drwxr-xr-x 4 hdfs hadoop 4096 2011-10-25 23:11 ..
cloudera@cloudera-demo:/usr/lib/hadoop/logs$ ll
/var/lib/hadoop-0.20/cache/hadoop/dfs/name.sav
total 20
drwxr-xr-x 2 hdfs hdfs   4096 2011-01-24 15:24 image
drwxr-xr-x 2 hdfs hdfs   4096 2011-09-25 11:49 previous.checkpoint
drwxr-xr-x 2 hdfs hdfs   4096 2011-10-25 21:01 current
drwxr-xr-x 5 hdfs hdfs   4096 2011-10-25 23:02 .
drwxr-xr-x 4 hdfs hadoop 4096 2011-10-25 23:11 ..


So then, any recommendations on how to proceed?


thanks


Re: Namenode in inconsistent state: how to reinitialize the storage directory

2011-10-26 Thread Stephen Boesch
I found a suggestion to reformat the namenode.  In order to do so, I found
it necessary to set the dir to 777. AFter


$ sudo chmod 777 /var/lib/hadoop-0.20/cache/hadoop/dfs/name

$ ./hadoop namenode -format

(successful)

$ ./hadoop-daemon.sh --config $HADOOP/conf start namenode

(success!)


So.. this leads to a related question:  *What gives with these permissions?
 **  *Maybe  this is *cloudera *specific.   I am logged in to cloudera
user,. but these directories have owners/groups with a mix of hadoop,
mapred, hbase, hdfs, etc.When i look in /etc/passwd and /etc/group there
is no clear indication that cloudera should be able to access files owned by
members of those groups.

Where is there more info about making the file permissions happy when
running the various hadoop services from cloudera user ?

i am on CDH3u1

thx


2011/10/25 Stephen Boesch java...@gmail.com


 I am relatively new here and starting the CDH3u1 (on vmware).   The
 nameserver is not coming up due to the following error:


 2011-10-25 22:47:00,547 INFO org.apache.hadoop.hdfs.server.common.Storage:
 Cannot access storage directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name
 2011-10-25 22:47:00,549 ERROR
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
 initialization failed.
 org.apache.hadoop.hdfs.server.common.InconsistentFSStateException:
 Directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name is in an inconsistent
 state: storage directory does not exist or is not accessible.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:305)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:327)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:465)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1224)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1233)

 Now, I first noticed there was a lock file.  So I sudo rm'ed it and
 retried. But same error.  Then, not knowing what files are required (if any)
 to restart, I moved the entire dir and created a new empty one.  Here are
 both the new and the 'sav dirs


 cloudera@cloudera-demo:/usr/lib/hadoop/logs$ ll
 /var/lib/hadoop-0.20/cache/hadoop/dfs/name
 total 8
 drwxr-xr-x 2 hdfs hdfs   4096 2011-10-25 23:11 .
 drwxr-xr-x 4 hdfs hadoop 4096 2011-10-25 23:11 ..
 cloudera@cloudera-demo:/usr/lib/hadoop/logs$ ll
 /var/lib/hadoop-0.20/cache/hadoop/dfs/name.sav
 total 20
 drwxr-xr-x 2 hdfs hdfs   4096 2011-01-24 15:24 image
 drwxr-xr-x 2 hdfs hdfs   4096 2011-09-25 11:49 previous.checkpoint
 drwxr-xr-x 2 hdfs hdfs   4096 2011-10-25 21:01 current
 drwxr-xr-x 5 hdfs hdfs   4096 2011-10-25 23:02 .
 drwxr-xr-x 4 hdfs hadoop 4096 2011-10-25 23:11 ..


 So then, any recommendations on how to proceed?


 thanks





Suggestions on hadoop using virtualization on local hardware but can't use Hypervisor

2011-09-07 Thread Stephen Boesch
Hi
   I have been struggling a long time with how to set up a hadoop test
environment with 6 to 10 nodes.  I have some concerns about using S3/EC2 but
have not ruled it out.

Another direction is that I have set up a higher end box with the intent of
using virtualization.  The box is a core i7 2600-K with an MSI Motherboard.
p67a-gd65 b3.  This one apparently does *not *support either Citrix
XenServer or Vmware hypervisor's.  (rats!!)

That puts me back at the 'drawing board'. But i already have this really
fast box just sitting there..

Any suggestions for other ways to employ this machine to get the job done?

thanks

stephen


Re: Estimating Time required to compute M/Rjob

2011-04-16 Thread Stephen Boesch
You could consider two scenarios / set of requirements for your estimator:


   1. Allow it to 'learn' from certain input data and then project running
   times of similar (or moderately dissimilar) workloads.   So the first steps
   could be to define a couple of  relatively small control M/R jobs on a
   small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R
cluster.  Try to design the control M/R job  in a way that it will be
   able to completely load down all of the  available DataNodes in the
cluster-under-test for at least a brief period of time.   Then you wlil
   have obtained a decent signal on the capabilities of the cluster under test
   and may allow a relatively high degree of predictive accuracy for even much
   larger jobs
   2. If instead it were your goal to drive the predictions off of a purely
   mathematical model  - in your terms the application and base file system
   - and without any empirical data - then here is an alternative approach.
  - Follow step (1) above against a variety of applications and base
  file systems - especially in configurations for which  you wish
your model
  to provide high quality predictions.
  - Save  the results in structured data
  - Derive formulas for characterizing the curves of performance via
  those variables that you defined (application /  base file system)

Now you have a trained model.  When it is applied to a new set of
applications / base file systems it can use the curves you have already
determined to provide the result without any runtime requirements.

Obviously the value of this second approach is limited by the degree of
similarity of the training data to the applications you attempt to model.
 If all of your training data is on a 50 node cluster against machines with
IDE drives don't expect good results when asked to model a 1000 node cluster
using SAN's / RAID's / SCSI's.


2011/4/16 Sonal Goyal sonalgoy...@gmail.com

 What is your MR job doing? What is the amount of data it is processing?
 What
 kind of a cluster do you have? Would you be able to share some details
 about
 what you are trying to do?

 If you are looking for metrics, you could look at the Terasort run ..

 Thanks and Regards,
 Sonal
 https://github.com/sonalgoyal/hihoHadoop ETL and Data
 Integrationhttps://github.com/sonalgoyal/hiho
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal





 On Sat, Apr 16, 2011 at 3:31 PM, real great..
 greatness.hardn...@gmail.comwrote:

  Hi,
  As a part of my final year BE final project I want to estimate the time
  required by a M/R job given an application and a base file system.
  Can you folks please help me by posting some thoughts on this issue or
  posting some links here.
 
  --
  Regards,
  R.V.
 



Re: Estimating Time required to compute M/Rjob

2011-04-16 Thread Stephen Boesch
some additional thoughts about the the  'variables' involved in
characterizing the M/R application itself.


   - the configuration of the cluster for numbers of mappers vs reducers
   compared to the characteristics (amount of work/procesing) required in each
   of the map/shuffle/reduce stages


   - is the application using multiple chained M/R stages?  Multi stage
   M/R's are more difficult to tune properly in terms of keeping all workers
   busy  . That may be challenging to model.

2011/4/16 Stephen Boesch java...@gmail.com

 You could consider two scenarios / set of requirements for your estimator:


1. Allow it to 'learn' from certain input data and then project running
times of similar (or moderately dissimilar) workloads.   So the first steps
could be to define a couple of  relatively small control M/R jobs on a
small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ 
 M/R
 cluster.  Try to design the control M/R job  in a way that it will be
able to completely load down all of the  available DataNodes in the
 cluster-under-test for at least a brief period of time.   Then you wlil
have obtained a decent signal on the capabilities of the cluster under test
and may allow a relatively high degree of predictive accuracy for even much
larger jobs
2. If instead it were your goal to drive the predictions off of a
purely mathematical model  - in your terms the application and base file
system - and without any empirical data - then here is an alternative
approach.
   - Follow step (1) above against a variety of applications and
   base file systems - especially in configurations for which  you wish 
 your
   model to provide high quality predictions.
   - Save  the results in structured data
   - Derive formulas for characterizing the curves of performance via
   those variables that you defined (application /  base file system)

 Now you have a trained model.  When it is applied to a new set of
 applications / base file systems it can use the curves you have already
 determined to provide the result without any runtime requirements.

 Obviously the value of this second approach is limited by the degree of
 similarity of the training data to the applications you attempt to model.
  If all of your training data is on a 50 node cluster against machines with
 IDE drives don't expect good results when asked to model a 1000 node cluster
 using SAN's / RAID's / SCSI's.


 2011/4/16 Sonal Goyal sonalgoy...@gmail.com

 What is your MR job doing? What is the amount of data it is processing?
 What
 kind of a cluster do you have? Would you be able to share some details
 about
 what you are trying to do?

 If you are looking for metrics, you could look at the Terasort run ..

 Thanks and Regards,
 Sonal
 https://github.com/sonalgoyal/hihoHadoop ETL and Data
 Integrationhttps://github.com/sonalgoyal/hiho
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal





 On Sat, Apr 16, 2011 at 3:31 PM, real great..
 greatness.hardn...@gmail.comwrote:

  Hi,
  As a part of my final year BE final project I want to estimate the time
  required by a M/R job given an application and a base file system.
  Can you folks please help me by posting some thoughts on this issue or
  posting some links here.
 
  --
  Regards,
  R.V.
 





Re: Birthday Calendar

2011-04-11 Thread Stephen Boesch
Forum moderator: pls mark emails from this user as spam.

2011/4/10 Tiru Murugan veera.tirumurugan...@gmail.com

 Hi

 I am creating a birthday calendar of all my friends and family.  Can you
 please click on the link below to enter your birthday for me?


 http://www.birthdayalarm.com/bd2/86124361a478169686b1536202358c505931142d1386

 Thanks,
 Tiru



Re: running local hadoop job in windows

2011-04-05 Thread Stephen Boesch
another approach you may well have already considered.. but may reconsider..
use (free version of ..) vmware player running on your winXXX env (host)
and install linux distro of your choice as the guest o/s.  you can spin up
essentially any number of instances that way.

.. and not be concerned about the configuration/behavorial discrepancies
between cygwin and native linux.

If you wish there are pre-packaged distros for cloudera, apache, yahoo! for
vmware.



2011/4/5 Tish Heyssel tish...@gmail.com

 Mark,

 Make sure you add the cygwin/bin to your global PATH variable in Windows
 too... and echo the PATH, if you're running in a command window to make
 sure
 it shows up there...  When running through eclipse, it should pick up the
 PATH variable.

 Good luck.  its worth the trouble.  this does work.

 tish

 On Mon, Apr 4, 2011 at 10:13 PM, Mark Kerzner markkerz...@gmail.com
 wrote:

  I understand now, I need to install cygwin correctly, asking it for all
 the
  right options.
 
  Thank you,
  Mark
 
  On Mon, Apr 4, 2011 at 9:06 PM, Lance Norskog goks...@gmail.com wrote:
 
   You're stuck with cygwin! Hadoop insists on running the 'chmod'
   program. You have to have a binary in your search path.
  
   Lance
  
   On Sat, Mar 19, 2011 at 9:15 PM, Mark Kerzner markkerz...@gmail.com
   wrote:
Now I AM running under cygwin, and I get the same error, as you can
 see
   from
the attached screenshot.
Thank you,
Mark
   
On Sat, Mar 19, 2011 at 9:16 PM, Simon gsmst...@gmail.com wrote:
   
As far as I know, currently hadoop can only run under *nix like
  systems.
Correct me if I am wrong.
And if you want to run it under windows, you can try cygwin as the
environment.
   
Thanks
Simon
   
On Fri, Mar 18, 2011 at 7:11 PM, Mark Kerzner 
 markkerz...@gmail.com
wrote:
   
 No, I hoped that it is not absolutely necessary for that kind of
  use.
   I
 am
 not even issuing the hadoop -jar command, but it is pure java
   -jar.
 It
 is true though that my Ubuntu has a Hadoop set up, so maybe it is
   doing
 a
 lot of magic behind my back.

 I did not want to have my inexperienced Windows users to have to
   install
 cygwin for just trying the package.

 Thank you,
 Mark

 On Fri, Mar 18, 2011 at 6:06 PM, Stephen Boesch 
 java...@gmail.com
 wrote:

  presumably you ran this under cygwin?
 
  2011/3/18 Mark Kerzner markkerz...@gmail.com
 
   Hi, guys,
  
   I want to give my users a sense of what my hadoop application
  can
   do,
 and
  I
   am trying to make it run in Windows, with this command
  
   java -jar dist\FreeEed.jar
  
   This command runs my hadoop job locally, and it works in
 Linux.
 However,
  in
   Windows I get the error listed below. Since I am running
   completely
   locally,
   I don't see why it is trying to do what it does. Is there a
   workaround?
  
   Thank you,
   Mark
  
   Error:
  
   11/03/18 17:57:43 INFO jvm.JvmMetrics: Initializing JVM
 Metrics
   with
   processName
   =JobTracker, sessionId=
   java.io.IOException: Failed to set permissions of path:
   file:/tmp/hadoop-Mark/ma
   pred/staging/Mark-1397630897/.staging to 0700
  at
  
  
   org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFile
   System.java:526)
  at
  
  
   org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
   tem.java:500)
  at
  
  
   org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.jav
   a:310)
  at
  
  
   org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:18
   9)
  at
  
  
   org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmi
   ssionFiles.java:116)
  at
   org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:799)
  at
   org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
  at java.security.AccessController.doPrivileged(Native
   Method)
  at javax.security.auth.Subject.doAs(Unknown Source)
  at
  
  
   org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
   tion.java:1063)
  at
  
  
   org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
   93)
  at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
  at
 org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495)
  at
   org.frd.main.FreeEedProcess.run(FreeEedProcess.java:66)
  at
   org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at
   org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at
   org.frd.main.FreeEedProcess.main

location for mapreduce next generation branch

2011-03-28 Thread Stephen Boesch
Looking under http://svn.apache.org/repos/asf/hadoop/mapreduce/branches/  it
does not seem to be present.

pointers to correct location appreciated.


Re: How to insert some print codes into Hadoop?

2011-03-23 Thread Stephen Boesch
keep in mind for going that direction that aspectJ has limited support in
some IDE's (e.g. intellij). can complicate the development.

2011/3/23 Konstantin Boudnik c...@apache.org

 [Moving to common-user@, Bcc'ing general@]

 If you know where you need to have your print statements you can use
 AspectJ to do runtime injection of needed java code into desirable
 spots. You don't need to even touch the source code for that - just
 instrument (weave) the jar file
 --
   Take care,
 Konstantin (Cos) Boudnik

 On Tue, Mar 22, 2011 at 09:19, Bo Sang sampl...@gmail.com wrote:
  Hi, guys:
 
  I would like to do some minor modification of Hadoop (just to insert some
  print codes into particular places). And I have the following questions:
 
  1. It seems there are three parts of Hadoop: common, hdfs, mapred. And
 they
  are packed as three independent jar packages. Could I only modify one
 part
  (eg, common) and pack a new jar package without modifying the rest two.
 
  2. I have try to import the folder hadoop-0.21.0/common into eclipse as a
  project. But eclipse fails to recognize it as an existing project. But if
 I
  import folder hadoop-0.21.0 as a existing project, it works. However, I
 only
  want to modify common part. How could I only modify  common part and
 export
  a new common jar package without modifying the rest two parts.
 
  --
  Best Regards!
 
  Sincerely
  Bo Sang
 



Re: Hadoop Jar

2011-03-22 Thread Stephen Boesch
Hola Miguel,

Wondering what Main method are you referring to that is defined to run.
Can you be specific?

2011/3/22 Miguel Costa miguel-co...@telecom.pt

 I already fixed,



 I had the Main class defined on my project .



 Even if I execute hadoop –jar myyjar.jar mymainClass

 mymainClass is never executed.

 The project can not have a Main method defined to run if has more than one
 and we want to run different Main methods.



 Miguel





 *From:* Miguel Costa [mailto:miguel-co...@telecom.pt]
 *Sent:* terça-feira, 22 de Março de 2011 16:18
 *To:* common-user@hadoop.apache.org
 *Subject:* Hadoop Jar



 Hi,



 I’m trying to execute a jar file from ./bin/hadoop jar /mypath/myjar
 mymainclass



 But the jar that is execute is an old jar.

 If I Execute the jar from java –cp myjar maymainclass it runs fine.



 What am I doing wrong?



 Thanks,



 Miguel











Re: running local hadoop job in windows

2011-03-18 Thread Stephen Boesch
presumably you ran this under cygwin?

2011/3/18 Mark Kerzner markkerz...@gmail.com

 Hi, guys,

 I want to give my users a sense of what my hadoop application can do, and I
 am trying to make it run in Windows, with this command

 java -jar dist\FreeEed.jar

 This command runs my hadoop job locally, and it works in Linux. However, in
 Windows I get the error listed below. Since I am running completely
 locally,
 I don't see why it is trying to do what it does. Is there a workaround?

 Thank you,
 Mark

 Error:

 11/03/18 17:57:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName
 =JobTracker, sessionId=
 java.io.IOException: Failed to set permissions of path:
 file:/tmp/hadoop-Mark/ma
 pred/staging/Mark-1397630897/.staging to 0700
at
 org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFile
 System.java:526)
at
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys
 tem.java:500)
at
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.jav
 a:310)
at
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:18
 9)
at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmi
 ssionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:799)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
 tion.java:1063)
at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
 93)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495)
at org.frd.main.FreeEedProcess.run(FreeEedProcess.java:66)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.frd.main.FreeEedProcess.main(FreeEedProcess.java:71)
at org.frd.main.FreeEedMain.runProcessing(FreeEedMain.java:88)
at org.frd.main.FreeEedMain.processOptions(FreeEedMain.java:65)
at org.frd.main.FreeEedMain.main(FreeEedMain.java:31)