RE: Experience with Hadoop in production

2012-02-24 Thread GOEKE, MATTHEW (AG/1000)
I would add that it also depends on how thoroughly you have vetted your use 
cases. If you have already ironed out how ad-hoc access works, Kerberos vs 
Firewall and network segmentation, how code submission works, procedures for 
various operational issues, backup of your data, etc (the list is a couple 
hundred bullets long at minimum...) on your current cluster then there might be 
little need for that support. However if you are hoping to figure that stuff 
out still then you could potentially be in a world of hurt when you attempt the 
transition with just your own staff. It also helps to have that outside advice 
in certain situations to resolve cross department conflicts when it comes to 
how the cluster will be implemented :)

Matt

-Original Message-
From: Mike Lyon [mailto:mike.l...@gmail.com] 
Sent: Thursday, February 23, 2012 2:33 PM
To: common-user@hadoop.apache.org
Subject: Re: Experience with Hadoop in production

Just be sure you have that corporate card available 24x7 when you need
to call support ;)

Sent from my iPhone

On Feb 23, 2012, at 10:30, Serge Blazhievsky
 wrote:

> What I have seen companies do often is that they will use free version of
> the commercial vendor and only get their support if there are major
> problems that they cannot solve on their own.
>
>
> That way you will get free distribution and insurance that you have
> support if something goes wrong.
>
>
> Serge
>
> On 2/23/12 10:42 AM, "Jamack, Peter"  wrote:
>
>> A lot of it depends on your staff and their experiences.
>> Maybe they don't have hadoop, but if they were involved with large
>> databases, data warehouse, etc they can utilize their skills & experiences
>> and provide a lot of help.
>> If you have linux admins, system admins, network admins with years of
>> experience, they will be a goldmine.At the other end, database
>> developers who know SQL, programmers who know Java, and so on can really
>> help staff up your 'big data' team. Having a few people who know ETL would
>> be great too.
>>
>> The biggest problem I've run into seems to be how big the Hadoop
>> project/team is or is not. Sometimes it's just an 'experimental'
>> department and therefore half the people are only 25-50 percent available
>> to help out.  And if they aren't really that knowledgeable about hadoop,
>> it tends to be one of those, not enough time in the day scenarios.  And
>> the few people dedicated to the Hadoop project(s) will get the brunt of
>> the work.
>>
>> It's like any ecosystem.  To do it right, you might need system/network
>> admins, a storage person to actually know how to set up the proper storage
>> architecture, maybe a security expert,  a few programmers, and a few data
>> people.   If you're combining analytics, that's another group.  Of course
>> most companies outside the Google and Facebooks of the world,  will have a
>> few people dedicated to Hadoop.  Which means you need somebody who knows
>> storage, knows networking, knows linux, knows how to be a system admin,
>> knows security, and maybe other things(AKA if you have a firewall issue,
>> somebody needs to figure out ways to make it work through or around),  and
>> then you need some programmes who either know MapReduce or can pretty much
>> figure it out because they've done java for years.
>>
>> Peter J
>>
>> On 2/23/12 10:17 AM, "Pavel Frolov"  wrote:
>>
>>> Hi,
>>>
>>> We are going into 24x7 production soon and we are considering whether we
>>> need vendor support or not.  We use a free vendor distribution of Cluster
>>> Provisioning + Hadoop + HBase and looked at their Enterprise version but
>>> it
>>> is very expensive for the value it provides (additional functionality +
>>> support), given that we¹ve already ironed out many of our performance and
>>> tuning issues on our own and with generous help from the community (e.g.
>>> all of you).
>>>
>>> So, I wanted to run it through the community to see if anybody can share
>>> their experience of running a Hadoop cluster (50+ nodes with Apache
>>> releases or Vendor distributions) in production, with in-house support
>>> only, and how difficult it was.  How many people were involved, etc..
>>>
>>> Regards,
>>> Pavel
>>
>
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail o

RE: Tom White's book, 2nd ed. Which API?

2012-02-06 Thread GOEKE, MATTHEW (AG/1000)
I haven't gotten a chance to look at the rough cut of 3rd out on safari right 
now but what are the main differences between it and the 2nd edition?

-Original Message-
From: Russell Jurney [mailto:russell.jur...@gmail.com] 
Sent: Monday, February 06, 2012 3:35 PM
To: common-user@hadoop.apache.org
Subject: Re: Tom White's book, 2nd ed. Which API?

Or get O'Reilly Safari, which would get you both?

On Feb 6, 2012, at 9:34 AM, Richard Nadeau  wrote:

> If you're looking to buy the 2nd edition you might want to wait, the third
> edition is in the works now.
> 
> Regards,
> Rick
> On Feb 6, 2012 10:24 AM, "W.P. McNeill"  wrote:
> 
>> The second edition of Tom White's *Hadoop: The Definitive
>> Guide
>> * uses the old API for its examples, though it does contain a brief
>> two-page overview of the new API.
>> 
>> The first edition is all old API.
>> 































This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Large server recommedations

2011-12-15 Thread GOEKE, MATTHEW (AG/1000)
mapred.map.tasks is a suggestion to the engine and there is really no reason to 
define it as it will be driven by the block level partitioning of your files 
(e.g. if you have a file that is 30 blocks then it will by default spawn 30 map 
tasks). As for mapred.reduce.tasks, just set it to whatever you set your 
mapred.tasktracker.reduce.tasks.maximum (reasoning being you are running all of 
this on a single tasktracker so these two should essentially line up).

By now I should be able to answer whether those are JT level vs TT level 
parameters but I have heard one thing and personally experienced another so I 
will leave that answer up to someone who can confirm 100%. Either way I would 
recommend that your JT and TT sites not deviate from each other for clarity but 
you can change mapred.reduce.tasks at the app level so if you have something 
that needs a global sort order you can invoke it as mapred.reduce.tasks=1 using 
job level conf.

Matt

From: Dale McDiarmid [mailto:d...@ravn.co.uk]
Sent: Thursday, December 15, 2011 3:58 PM
To: common-user@hadoop.apache.org
Cc: GOEKE, MATTHEW [AG/1000]
Subject: Re: Large server recommedations

thanks matt,
Assuming therefore i run a single tasktracker and have 48 cores available.  
Based on your recommendation of 2:1 mappers to reducer threads i will be 
assigning:


mapred.tasktracker.map.tasks.maximum=30

mapred.tasktracker.reduce.tasks.maximum=15
This brings me onto my question:

"Can i confirm mapred.map.tasks and mapred.reduce.tasks are these JobTracker 
parameters? The recommendation for these settings seems to related to the 
number of task trackers. In my architecture, i have potentially only 1 if a 
single task tracker can only be configured on each host. What should i set 
these values to therefore considering the box spec?"

I have read:

mapred.local.tasks = 10x of task trackers
mapred.reduce.tasks=2x task trackers

Given i have a single task tracker, with multiple concurrent processes does 
this equates to:

mapred.local.tasks =300?
mapred.reduce.tasks=30?

Some reasoning behind these values appreciated...


appreciate this is a little simplified and we will need to profile. Just 
looking for a sensible starting position.
Thanks
Dale


On 15/12/2011 16:43, GOEKE, MATTHEW (AG/1000) wrote:

Dale,



Talking solely about hadoop core you will only need to run 4 daemons on that 
machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to 
run multiple of any of them as the tasktracker will spawn multiple child jvms 
which is where you will get your task parallelism. When you set your 
mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum configurations you will limit the upper 
bound of the child jvm creation but this needs to be configured based on job 
profile (I don't know much about Mahoot but traditionally I setup the clusters 
as 2:1 mappers to reducers until the profile proves otherwise). If you look at 
blogs / archives you will see that you can assign 1 child task per *logical* 
core (e.g. hyper threaded core) and to be safe you will want 1 daemon per 
*physical* core so you can divvy it up based on that recommendation.



To summarize the above: if you are sharing the same IO pipe / box then there is 
no reason to have multiple daemons running because you are not really gaining 
anything from that level of granularity. Others might disagree based on 
virtualization but in your case I would say save yourself the headache and keep 
it simple.



Matt



-Original Message-

From: Dale McDiarmid [mailto:d...@ravn.co.uk]

Sent: Thursday, December 15, 2011 1:50 PM

To: common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>

Subject: Large server recommedations



Hi all

New to the community and using hadoop and was looking for some advice as

to optimal configurations on very large servers.  I have a single server

with 48 cores and 512GB of RAM and am looking to perform an LDA analysis

using Mahoot across approx 180 million documents.  I have configured my

namenode and job tracker.  My questions are primarily around the optimal

number of tasktrackers and data nodes.  I have had no issues configuring

multiple datanodes, each which could potentially be utilised its own

disk location (underlying disk is SAN - solid state).



However, from my reading the typical architecture for hadoop is a larger

number of smaller nodes with a single tasktracker on each host.  Could

someone please clarify the following:



1. Can multiple task trackers be run on a single host? If so, how is

this configured as it doesn't seem possible to control the host:port.



2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker

parameters? The recommendation for these settings seems to related to

the number of task trackers.  In my architecture, i have potentially

only 1 if a single task tracker can only be configured on each host.

What should i set

RE: Large server recommedations

2011-12-15 Thread GOEKE, MATTHEW (AG/1000)
Dale,

Talking solely about hadoop core you will only need to run 4 daemons on that 
machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to 
run multiple of any of them as the tasktracker will spawn multiple child jvms 
which is where you will get your task parallelism. When you set your 
mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum configurations you will limit the upper 
bound of the child jvm creation but this needs to be configured based on job 
profile (I don't know much about Mahoot but traditionally I setup the clusters 
as 2:1 mappers to reducers until the profile proves otherwise). If you look at 
blogs / archives you will see that you can assign 1 child task per *logical* 
core (e.g. hyper threaded core) and to be safe you will want 1 daemon per 
*physical* core so you can divvy it up based on that recommendation.

To summarize the above: if you are sharing the same IO pipe / box then there is 
no reason to have multiple daemons running because you are not really gaining 
anything from that level of granularity. Others might disagree based on 
virtualization but in your case I would say save yourself the headache and keep 
it simple.

Matt

-Original Message-
From: Dale McDiarmid [mailto:d...@ravn.co.uk] 
Sent: Thursday, December 15, 2011 1:50 PM
To: common-user@hadoop.apache.org
Subject: Large server recommedations

Hi all
New to the community and using hadoop and was looking for some advice as 
to optimal configurations on very large servers.  I have a single server 
with 48 cores and 512GB of RAM and am looking to perform an LDA analysis 
using Mahoot across approx 180 million documents.  I have configured my 
namenode and job tracker.  My questions are primarily around the optimal 
number of tasktrackers and data nodes.  I have had no issues configuring 
multiple datanodes, each which could potentially be utilised its own 
disk location (underlying disk is SAN - solid state).

However, from my reading the typical architecture for hadoop is a larger 
number of smaller nodes with a single tasktracker on each host.  Could 
someone please clarify the following:

1. Can multiple task trackers be run on a single host? If so, how is 
this configured as it doesn't seem possible to control the host:port.

2. Can i confirm mapred.map.tasks and mapred.reduce.tasks are JobTracker 
parameters? The recommendation for these settings seems to related to 
the number of task trackers.  In my architecture, i have potentially 
only 1 if a single task tracker can only be configured on each host.  
What should i set these values to therefore considering the box spec?

3. I noticed the parameters mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum - do these control the number of 
JVM processes spawned to handle the respective steps? Is a tasktracker 
with 48 configured equivalent to a 48 task trackers with a value of 1 
configured for these values?

4. Benefits of a large number of datanodes on a single large server? I 
can see value where the host has multiple IO interfaces and disk sets to 
avoid IO contention. In my case, however, a SAN negates this.  Are there 
still benefits of multiple datanodes outside of resiliency and potential 
increase of data transfer i.e. assuming a single data node is limited 
and single threaded?

5. Any other thoughts/recommended settings?

Thanks
Dale
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: HDFS Explained as Comics

2011-11-30 Thread GOEKE, MATTHEW (AG/1000)
Maneesh,

Firstly, I love the comic :)

Secondly, I am inclined to agree with Prashant on this latest point. While one 
code path could take us through the user defining command line overrides (e.g. 
hadoop fs -D blah -put foo bar) I think it might confuse a person new to 
Hadoop. The most common flow would be using admin determined values from 
hdfs-site and the only thing that would need to change is that conversation 
happening between client / server and not user / client.

Matt

-Original Message-
From: Prashant Kommireddi [mailto:prash1...@gmail.com] 
Sent: Wednesday, November 30, 2011 3:28 PM
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics

Sure, its just a case of how readers interpret it.

   1. Client is required to specify block size and replication factor each
   time
   2. Client does not need to worry about it since an admin has set the
   properties in default configuration files

A client could not be allowed to override the default configs if they are
set final (well there are ways to go around it as well as you suggest by
using create() :)

The information is great and helpful. Just want to make sure a beginner who
wants to write a "WordCount" in Mapreduce does not worry about specifying
block size' and replication factor in his code.

Thanks,
Prashant

On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney wrote:

> Hi Prashant
>
> Others may correct me if I am wrong here..
>
> The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size
> and replication factor. In the source code, I see the following in the
> DFSClient constructor:
>
>defaultBlockSize = conf.getLong("dfs.block.size", DEFAULT_BLOCK_SIZE);
>
>defaultReplication = (short) conf.getInt("dfs.replication", 3);
>
> My understanding is that the client considers the following chain for the
> values:
> 1. Manual values (the long form constructor; when a user provides these
> values)
> 2. Configuration file values (these are cluster level defaults:
> dfs.block.size and dfs.replication)
> 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3)
>
> Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to
> create a file is
> void create(, short replication, long blocksize);
>
> I presume it means that the client already has knowledge of these values
> and passes them to the NameNode when creating a new file.
>
> Hope that helps.
>
> thanks
> -Maneesh
>
> On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi  >wrote:
>
> > Thanks Maneesh.
> >
> > Quick question, does a client really need to know Block size and
> > replication factor - A lot of times client has no control over these (set
> > at cluster level)
> >
> > -Prashant Kommireddi
> >
> > On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges  > >wrote:
> >
> > > Hi Maneesh,
> > >
> > > Thanks a lot for this! Just distributed it over the team and comments
> are
> > > great :)
> > >
> > > Best regards,
> > > Dejan
> > >
> > > On Wed, Nov 30, 2011 at 9:28 PM, maneesh varshney  > > >wrote:
> > >
> > > > For your reading pleasure!
> > > >
> > > > PDF 3.3MB uploaded at (the mailing list has a cap of 1MB
> attachments):
> > > >
> > > >
> > >
> >
> https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1
> > > >
> > > >
> > > > Appreciate if you can spare some time to peruse this little
> experiment
> > of
> > > > mine to use Comics as a medium to explain computer science topics.
> This
> > > > particular issue explains the protocols and internals of HDFS.
> > > >
> > > > I am eager to hear your opinions on the usefulness of this visual
> > medium
> > > to
> > > > teach complex protocols and algorithms.
> > > >
> > > > [My personal motivations: I have always found text descriptions to be
> > too
> > > > verbose as lot of effort is spent putting the concepts in proper
> > > time-space
> > > > context (which can be easily avoided in a visual medium); sequence
> > > diagrams
> > > > are unwieldy for non-trivial protocols, and they do not explain
> > concepts;
> > > > and finally, animations/videos happen "too fast" and do not offer
> > > > self-paced learning experience.]
> > > >
> > > > All forms of criticisms, comments (and encouragements) welcome :)
> > > >
> > > > Thanks
> > > > Maneesh
> > > >
> > >
> >
>
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, acce

RE: "No HADOOP COMMON HOME set."

2011-11-17 Thread GOEKE, MATTHEW (AG/1000)
Jay,

Did you download stable (0.20.203.X) or 0.23? From what I can tell, after 
looking in the tarball for 0.23, it is a different setup then 0.20 (e.g. 
hadoop-env.sh doesn't exist anymore and is replaced by yarn-env.sh) and the 
documentation you referenced below is for setting up 0.20.

I would suggest you go back and download stable and then the setup 
documentation you are following will make a lot more sense :)

Matt

-Original Message-
From: Jay Vyas [mailto:jayunit...@gmail.com] 
Sent: Thursday, November 17, 2011 2:07 PM
To: common-user@hadoop.apache.org
Subject: "No HADOOP COMMON HOME set."

Hi guys : I followed the exact directions on the hadoop installation guide
for psuedo-distributed mode
here
http://hadoop.apache.org/common/docs/current/single_node_setup.html#Configuration

However, I get that several environmental variables are not set (for
example , "HaDOOP_COMMON_HOME" is not set)

Also, hadoop reported thatHADOOP CONF was not set, as well.

Im wondering wether there is a resource on how to set environmental
variables to run hadoop ?

Thanks.

-- 
Jay Vyas
MMSB/UCHC
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: updated example

2011-10-11 Thread GOEKE, MATTHEW (AG/1000)
The old API is still fully usable in 0.20.204.

Matt

-Original Message-
From: Jignesh Patel [mailto:jign...@websoft.com] 
Sent: Tuesday, October 11, 2011 12:17 PM
To: common-user@hadoop.apache.org
Subject: Re: updated example

Thea means old API is not integrated in 0.20.204.0??

When do you expect the release of 0.20.205?

-Jignesh
On Oct 11, 2011, at 12:32 PM, Tom White wrote:

> JobConf and the old API are no longer deprecated in the forthcoming
> 0.20.205 release, so you can continue to use it without issue.
> 
> The equivalent in the new API is setInputFormatClass() on
> org.apache.hadoop.mapreduce.Job.
> 
> Cheers,
> Tom
> 
> On Tue, Oct 11, 2011 at 9:18 AM, Keith Thompson  
> wrote:
>> I see that the JobConf class used in the WordCount tutorial is deprecated
>> for the Configuration class.  I am wanting to change the file input format
>> (to the StreamInputFormat for XML as in Hadoop: The Definitive Guide pp.
>> 212-213) but I don't see a setInputFormat method in the Configuration class
>> as there was in the JobConf class.  Is there an updated example using the
>> non-deprecated classes and methods?  I have searched but not found one.
>> 
>> Regards,
>> Keith
>> 

This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Learning curve after MapReduce and HDFS

2011-09-30 Thread GOEKE, MATTHEW (AG/1000)
Are you learning for the sake of experimenting or are there functional 
requirements driving you to dive into this space?

*If you are learning for the sake of adding new tools to your portfolio: Look 
into high level overviews of each of the projects and review architecture 
solutions that use them. Focus on how they interact and target ones that peak 
your curiosity the most.

*If you are learning the ecosystem to fulfill some customer requirements then 
just learn the pieces as you need them. Compare the high level differences 
between the sub projects and let the requirements drive which pieces you focus 
on.

There are plenty of training videos out there (for free) that go over quite a 
few of the pieces. I recently came across 
https://www.db2university.com/courses/auth/openid/login.php which has a basic 
set of reference materials that reviews a few of the sub projects within the 
eco system with included labs. Yahoo developer network and Cloudera also have 
some great resources as well.

Any one of us could point you in a certain direction but it is all a matter of 
opinion. Compare your needs with each of the sub projects and that should 
filter the list down to a manageable size.

Matt
-Original Message-
From: Varad Meru [mailto:meru.va...@gmail.com] 
Sent: Friday, September 30, 2011 11:19 AM
To: common-user@hadoop.apache.org; Varad Meru
Subject: Learning curve after MapReduce and HDFS

Hi all,

I have been working with Hadoop core, Hadoop HDFS and Hadoop MapReduce for the 
past 8 months. 

Now I want to learn other projects under Apache Hadoop such as Pig, Hive, HBase 
...

Can you suggest me a learning path to learn about the Hadoop Eco-System in a 
structured manner?
I am confused between so many alternatives such as 
Hive vs Jaql vs Pig
HBase vs Hypertable vs Cassandra
And many other projects which are similar to each other.  

Thanks in advance,
Varad


---
Varad Meru
Software Engineer
Persistent Systems and Solutions Ltd. 
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: dump configuration

2011-09-28 Thread GOEKE, MATTHEW (AG/1000)
You could always check the web-ui job history for that particular run, open the 
job.xml, and search for what the value of that parameter was at runtime.

Matt

-Original Message-
From: patrick sang [mailto:silvianhad...@gmail.com] 
Sent: Wednesday, September 28, 2011 4:00 PM
To: common-user@hadoop.apache.org
Subject: dump configuration

Hi hadoopers,

I was looking the way to dump hadoop configuration in order to check if what
i have just changed
in mapred-site.xml is really kicked in.

Found that HADOOP-6184
is exactly what i
want but the thing is I am running CDH3u0 which is
0.20.2 based.

I wonder if anyone here have a magic to dump the hadoop configuration;
doesn't need to be json
as long as i can check if what i changed in configuration file is really
kicked in.

PS, i change this "mapred.user.jobconf.limit"

-P
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Temporary Files to be sent to DistributedCache

2011-09-27 Thread GOEKE, MATTHEW (AG/1000)
The simplest route I can think of is to ingest the data directly into HDFS 
using Sqoop if there is a driver currently made for your database. At that 
point it would be relatively simple just to read directly from HDFS in your MR 
code. 

Matt

-Original Message-
From: lessonz [mailto:less...@q.com] 
Sent: Tuesday, September 27, 2011 4:48 PM
To: common-user@hadoop.apache.org
Subject: Temporary Files to be sent to DistributedCache

I have a need to write information retrieved from a database to a series of
files that need to be made available to my mappers. Because each mapper
needs access to all of these files, I want to put them in the
DistributedCache. Is there a preferred method to writing new information to
the DistributedCache? I can use Java's File.createTempFile(String prefix,
String suffix), but that uses the system default temporary folder. While
that should usually work, I'd rather have a method that doesn't depend on
writing to the local file system before copying files to the
DistributedCache. As I'm extremely new to Hadoop, I hope I'm not missing
something obvious.

Thank you for your time.
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Environment consideration for a research on scheduling

2011-09-23 Thread GOEKE, MATTHEW (AG/1000)
If you are starting from scratch with no prior Hadoop install experience I 
would configure stand-alone, migrate to pseudo distributed and then to fully 
distributed verifying functionality at each step by doing a simple word count 
run. Also, if you don't mind using the CDH distribution then SCM / their rpms 
will greatly simplify both the bin installs as well as the user creation.

Your VM route will most likely work but I can imagine the amount of hiccups 
during migration from that to the real cluster will not make it worth your time.

Matt 

-Original Message-
From: Merto Mertek [mailto:masmer...@gmail.com] 
Sent: Friday, September 23, 2011 10:00 AM
To: common-user@hadoop.apache.org
Subject: Environment consideration for a research on scheduling

Hi,
in the first phase we are planning to establish a small cluster with few
commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
10.10 and  a hadoop build from the branch 0.20.204 (i had some issues with
version 0.20.203 with missing
libraries).
Would you suggest any other version?

In the second phase we are planning to analyse, test and modify some of
hadoop schedulers.

Now I am interested what is the best way to deploy ubuntu and hadop to this
few machine. I was thinking to configure the system in the local VM and then
converting it to each physical machine but probably this is not the best
option. If you know any other please share..

Thanks you!
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Hadoop RPC and general serialization question

2011-09-22 Thread GOEKE, MATTHEW (AG/1000)
I was reviewing a video from Hadoop Summit 2011[1] where Arun Murthy mentioned 
that MRv2 was moving towards protocol buffers as the wire format but I feel 
like this is contrary to an Avro presentation that Doug Cutting did back in 
Hadoop World '09[2]. I haven't stayed up to date with the Jira for MRv2 but is 
there a disagreement between contributors as to which format will be the de 
facto standard going forward and if so what are the biggest points of 
contention? The only reason I bring this up is I am trying to integrate a 
serialization framework into our best practices and, while I am currently 
working towards Avro, this disconnect caused a little concern.

Matt

*1 - http://www.youtube.com/watch?v=2FpO7w6X41I
*2 - http://www.cloudera.com/videos/hw09_next_steps_for_hadoop



















This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


Question regarding Oozie and Hive replication / backup

2011-09-22 Thread GOEKE, MATTHEW (AG/1000)
I would like to have a robust setup for anything residing on our edge nodes, 
which is where these two daemons will be, and I was curious if anyone had any 
suggestions around how to replicate / keep an active clone of the metadata for 
these components. We already use DRBD and a vip to get around this issue for 
our master nodes and I know this would work for the edge nodes but I wanted to 
make sure I wasn't overlooking any options.

Hive: Currently tinkering with the built-in db but evaluating whether to go 
with a dedicated MySQL or PostgreSQL instance so suggestions can reference 
either solution.

Thanks,
Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


RE: risks of using Hadoop

2011-09-21 Thread GOEKE, MATTHEW (AG/1000)
I would completely agree with Mike's comments with one addition: Hadoop centers 
around how to manipulate the flow of data in a way to make the framework work 
for your specific problem. There are recipes for common problems but depending 
on your domain that might solve only 30-40% of your use cases. It should take 
little to no time for a good java dev to understand how to make an MR program. 
It will take significantly more time for that java dev to understand the domain 
and Hadoop well enough to consistently write *good* MR programs. Mike listed 
some great ways to cut down on that curve but you really want someone who has 
not only an affinity for code but can also apply the critical thinking to how 
you should pipeline your data. If you plan on using it purely with Pig/Hive 
abstractions on top then this can be negated significantly.

Some my might disagree but that is my $0.02
Matt 

-Original Message-
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Wednesday, September 21, 2011 12:48 PM
To: common-user@hadoop.apache.org
Subject: RE: risks of using Hadoop


Kobina

The points 1 and 2 are definitely real risks. SPOF is not.

As I pointed out in my mini-rant to Tom was that your end users / developers 
who use the cluster can do more harm to your cluster than a SPOF machine 
failure.

I don't know what one would consider a 'long learning curve'. With the adoption 
of any new technology, you're talking at least 3-6 months based on the 
individual and the overall complexity of the environment. 

Take anyone who is a strong developer, put them through Cloudera's training, 
plus some play time, and you've shortened the learning curve.
The better the java developer, the easier it is for them to pick up Hadoop.

I would also suggest taking the approach of hiring a senior person who can 
cross train and mentor your staff. This too will shorten the runway.

HTH

-Mike


> Date: Wed, 21 Sep 2011 17:02:45 +0100
> Subject: Re: risks of using Hadoop
> From: kobina.kwa...@gmail.com
> To: common-user@hadoop.apache.org
> 
> Jignesh,
> 
> Will your point 2 still be valid if we hire very experienced Java
> programmers?
> 
> Kobina.
> 
> On 20 September 2011 21:07, Jignesh Patel  wrote:
> 
> >
> > @Kobina
> > 1. Lack of skill set
> > 2. Longer learning curve
> > 3. Single point of failure
> >
> >
> > @Uma
> > I am curious to know about .20.2 is that stable? Is it same as the one you
> > mention in your email(Federation changes), If I need scaled nameNode and
> > append support, which version I should choose.
> >
> > Regarding Single point of failure, I believe Hortonworks(a.k.a Yahoo) is
> > updating the Hadoop API. When that will be integrated with Hadoop.
> >
> > If I need
> >
> >
> > -Jignesh
> >
> > On Sep 17, 2011, at 12:08 AM, Uma Maheswara Rao G 72686 wrote:
> >
> > > Hi Kobina,
> > >
> > > Some experiences which may helpful for you with respective to DFS.
> > >
> > > 1. Selecting the correct version.
> > >I will recommend to use 0.20X version. This is pretty stable version
> > and all other organizations prefers it. Well tested as well.
> > > Dont go for 21 version.This version is not a stable version.This is risk.
> > >
> > > 2. You should perform thorough test with your customer operations.
> > >  (of-course you will do this :-))
> > >
> > > 3. 0.20x version has the problem of SPOF.
> > >   If NameNode goes down you will loose the data.One way of recovering is
> > by using the secondaryNameNode.You can recover the data till last
> > checkpoint.But here manual intervention is required.
> > > In latest trunk SPOF will be addressed bu HDFS-1623.
> > >
> > > 4. 0.20x NameNodes can not scale. Federation changes included in latest
> > versions. ( i think in 22). this may not be the problem for your cluster.
> > But please consider this aspect as well.
> > >
> > > 5. Please select the hadoop version depending on your security
> > requirements. There are versions available for security as well in 0.20X.
> > >
> > > 6. If you plan to use Hbase, it requires append support. 20Append has the
> > support for append. 0.20.205 release also will have append support but not
> > yet released. Choose your correct version to avoid sudden surprises.
> > >
> > >
> > >
> > > Regards,
> > > Uma
> > > - Original Message -
> > > From: Kobina Kwarko 
> > > Date: Saturday, September 17, 2011 3:42 am
> > > Subject: Re: risks of using Hadoop
> > > To: common-user@hadoop.apache.org
> > >
> > >> We are planning to use Hadoop in my organisation for quality of
> > >> servicesanalysis out of CDR records from mobile operators. We are
> > >> thinking of having
> > >> a small cluster of may be 10 - 15 nodes and I'm preparing the
> > >> proposal. my
> > >> office requires that i provide some risk analysis in the proposal.
> > >>
> > >> thank you.
> > >>
> > >> On 16 September 2011 20:34, Uma Maheswara Rao G 72686
> > >> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> First of all where you are planning to use Hadoop

RE: Using HBase for real time transaction

2011-09-20 Thread GOEKE, MATTHEW (AG/1000)
In order to answer you first question we would need to know what types of data 
you plan on storing and latency requirements. If it is 
semistructured/unstructured data then HBase *can* be a great fit but I have 
seen very few cases where you will want to scrap your RDBMS completely. Most 
organizations that use HBase will still have a need for a RDBMS/MPP solution 
for real time access to structured data.

Matt 

-Original Message-
From: Jignesh Patel [mailto:jign...@websoft.com] 
Sent: Tuesday, September 20, 2011 4:25 PM
To: common-user@hadoop.apache.org
Subject: Re: Using HBase for real time transaction

Tom,
Let me reword: can HBase be used as a transactional database(i.e. in 
replacement of mysql)?

The requirement is to have real time read and write operations. I mean as soon 
as data is written the user should see the data(Here data should be written in 
Hbase).

-Jignesh


On Sep 20, 2011, at 5:11 PM, Tom Deutsch wrote:

> Real-time means different things to different people. Can you share your 
> latency requirements from the time the data is generated to when it needs 
> to be consumed, or how you are thinking of using Hbase in the overall 
> flow?
> 
> 
> Tom Deutsch
> Program Director
> CTO Office: Information Management
> Hadoop Product Manager / Customer Exec
> IBM
> 3565 Harbor Blvd
> Costa Mesa, CA 92626-1420
> tdeut...@us.ibm.com
> 
> 
> 
> 
> Jignesh Patel  
> 09/20/2011 12:57 PM
> Please respond to
> common-user@hadoop.apache.org
> 
> 
> To
> common-user@hadoop.apache.org
> cc
> 
> Subject
> Using HBase for real time transaction
> 
> 
> 
> 
> 
> 
> We are exploring possibility of using HBase for the real time 
> transactions. Is that possible?
> 
> -Jignesh
> 

This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: cannot start namenode

2011-09-20 Thread GOEKE, MATTHEW (AG/1000)
Possibly a stupid question but have you done a "df -h" on the namenode to 
verify that you didn't somehow fill up all of the free space on that mount? I 
can imagine that it would blow up if it was not able to initiate the namenode 
log for that session (and could potentially explain why it crashed in the first 
place).

Matt

-Original Message-
From: Peng, Wei [mailto:wei.p...@xerox.com] 
Sent: Tuesday, September 20, 2011 3:12 PM
To: common-user@hadoop.apache.org
Subject: cannot start namenode

Hi,

 

I was copying some files from an old cluster to a new cluster.

When it failed copying, I was not watching it. (I think over 85% of data
has been transferred).

The name node crashed, and I cannot restarted it.

 

I got the following error when I try to restart the namenode

Hadoop-daemon.sh: line 114: echo: write error: No space left on device.

 

Does someone know how to fix this problem ? I am in urgent need to
restart this cluster...

 

Thanks a lot

Wei

This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: how to set the number of mappers with 0 reducers?

2011-09-20 Thread GOEKE, MATTHEW (AG/1000)
There is currently no way to disable S/S. You can do many things to alleviate 
any issues you have with it though, one of them you mentioned below. Is there a 
reason why you are allowing each of your keys to be unique? If it is truly 
because you do not care then just create an even distribution of keys that you 
assign to allow for more aggregation.

On a side note, what is the actual stack trace you are getting when the 
reducers fail and what is the reducer doing? I think for your use case using a 
reduce phase is the best way to go, as long as the job time meets your SLA, so 
we need to figure out why the job is failing.

Matt

-Original Message-
From: Peng, Wei [mailto:wei.p...@xerox.com] 
Sent: Tuesday, September 20, 2011 10:44 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

The input is 9010 files (each 500MB), and I would estimate the output to
be around 50GB.
My hadoop job failed because of out of memory (with 66 reducers). I
guess that the key from each mapper output is unique so the sorting
would be memory-intensive. 
Although I can set another key to reduce the number of unique keys, I am
curious if there is a way to disable sorting/shuffling.

Thanks,
Wei

-Original Message-
From: GOEKE, MATTHEW (AG/1000) [mailto:matthew.go...@monsanto.com] 
Sent: Tuesday, September 20, 2011 8:34 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

Amusingly this is almost the same question that was asked the other day
:)


There isn't currently a way of getting a collated, but unsorted list of
key/value pairs. For most applications, the in memory sort is fairly
cheap relative to the shuffle and other parts of the processing.


If you know that you will be filtering out a significant amount of
information to the point where shuffle will be trivial then the impact
of a reduce phase should be minimal using an identity reducer. It is
either that aggregate as much data as you feel comfortable with into
each split and have 1 file per map. 

How much data/percentage of input are you assuming will be output from
each of these maps?

Matt

-Original Message-
From: Peng, Wei [mailto:wei.p...@xerox.com] 
Sent: Tuesday, September 20, 2011 10:22 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

Thank you all for the quick reply!!

I think I was wrong. It has nothing to do with the number of mappers
because each input file has size 500M, which is not too small in terms
of 64M per block.

The problem is that the output from each mapper is too small. Is there a
way to combine some mappers output together? Setting the number of
reducers to 1 might get a very huge file. Can I set the number of
reducers to 100, but skip sorting, shuffling...etc.?

Wei

-Original Message-
From: Soumya Banerjee [mailto:soumya.sbaner...@gmail.com] 
Sent: Tuesday, September 20, 2011 2:06 AM
To: common-user@hadoop.apache.org
Subject: Re: how to set the number of mappers with 0 reducers?.

Hi,

If you want all your map outputs in a single file you can use a
IdentityReducer and set the number of reducers to 1.
This would ensure that all your mapper output goes into the reducer and
it
wites into a single file.

Soumya

On Tue, Sep 20, 2011 at 2:04 PM, Harsh J  wrote:

> Hello Wei!
>
> On Tue, Sep 20, 2011 at 1:25 PM, Peng, Wei  wrote:
> (snip)
> > However, the output from the mappers result in many small files
(size is
> > ~50k, the block size is however 64M, so it wastes a lot of space).
> >
> > How can I set the number of mappers (say 100)?
>
> What you're looking for is to 'pack' several files per mapper, if I
> get it right.
>
> In that case, you need to check out the CombineFileInputFormat. It can
> pack several files per mapper (with some degree of locality).
>
> Alternatively, pass a list of files (as a text file) as your input,
> and have your Mapper logic read them one by one. This way, if you
> divide 50k filenames over 100 files, you will get 100 mappers as you
> want - but at the cost of losing almost all locality.
>
> > If there is no way to set the number of mappers, the only way to
solve
> > it is "cat" some files together?
>
> Concatenating is an alternative, if affordable - yes. You can lower
> the file count (down from 50k) this way.
>
> --
> Harsh J
>
This e-mail message may contain privileged and/or confidential
information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error,
please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other
use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring,
reading an

RE: how to set the number of mappers with 0 reducers?

2011-09-20 Thread GOEKE, MATTHEW (AG/1000)
Amusingly this is almost the same question that was asked the other day :)


There isn't currently a way of getting a collated, but unsorted list of 
key/value pairs. For most applications, the in memory sort is fairly cheap 
relative to the shuffle and other parts of the processing.


If you know that you will be filtering out a significant amount of information 
to the point where shuffle will be trivial then the impact of a reduce phase 
should be minimal using an identity reducer. It is either that aggregate as 
much data as you feel comfortable with into each split and have 1 file per map. 

How much data/percentage of input are you assuming will be output from each of 
these maps?

Matt

-Original Message-
From: Peng, Wei [mailto:wei.p...@xerox.com] 
Sent: Tuesday, September 20, 2011 10:22 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

Thank you all for the quick reply!!

I think I was wrong. It has nothing to do with the number of mappers
because each input file has size 500M, which is not too small in terms
of 64M per block.

The problem is that the output from each mapper is too small. Is there a
way to combine some mappers output together? Setting the number of
reducers to 1 might get a very huge file. Can I set the number of
reducers to 100, but skip sorting, shuffling...etc.?

Wei

-Original Message-
From: Soumya Banerjee [mailto:soumya.sbaner...@gmail.com] 
Sent: Tuesday, September 20, 2011 2:06 AM
To: common-user@hadoop.apache.org
Subject: Re: how to set the number of mappers with 0 reducers?.

Hi,

If you want all your map outputs in a single file you can use a
IdentityReducer and set the number of reducers to 1.
This would ensure that all your mapper output goes into the reducer and
it
wites into a single file.

Soumya

On Tue, Sep 20, 2011 at 2:04 PM, Harsh J  wrote:

> Hello Wei!
>
> On Tue, Sep 20, 2011 at 1:25 PM, Peng, Wei  wrote:
> (snip)
> > However, the output from the mappers result in many small files
(size is
> > ~50k, the block size is however 64M, so it wastes a lot of space).
> >
> > How can I set the number of mappers (say 100)?
>
> What you're looking for is to 'pack' several files per mapper, if I
> get it right.
>
> In that case, you need to check out the CombineFileInputFormat. It can
> pack several files per mapper (with some degree of locality).
>
> Alternatively, pass a list of files (as a text file) as your input,
> and have your Mapper logic read them one by one. This way, if you
> divide 50k filenames over 100 files, you will get 100 mappers as you
> want - but at the cost of losing almost all locality.
>
> > If there is no way to set the number of mappers, the only way to
solve
> > it is "cat" some files together?
>
> Concatenating is an alternative, if affordable - yes. You can lower
> the file count (down from 50k) this way.
>
> --
> Harsh J
>
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: phases of Hadoop Jobs

2011-09-19 Thread GOEKE, MATTHEW (AG/1000)
Was the command line output really ever intended to be *that* verbose? I can 
see how it would be useful but considering how easy it is to interact with the 
web-ui I can't see why much effort should be put into enhancing it. Even if you 
didn't want to see all of the details the web-ui has to offer it doesn't take 
long to learn how to skim it and get 10x more accurate reading on your job 
progress.

Matt

-Original Message-
From: Arun C Murthy [mailto:a...@hortonworks.com] 
Sent: Sunday, September 18, 2011 11:27 PM
To: common-user@hadoop.apache.org
Subject: Re: phases of Hadoop Jobs

Agreed.

At least, I believe the new web-ui for MRv2 is (or will be soon) more verbose 
about this.

On Sep 18, 2011, at 9:23 PM, Kai Voigt wrote:

> Hi,
> 
> this 0-33-66-100% phases are really confusing to beginners. We see that in 
> our training classes. The output should be more verbose, such as breaking 
> down the phases into seperate progress numbers.
> 
> Does that make sense?
> 
> Am 19.09.2011 um 06:17 schrieb Arun C Murthy:
> 
>> Nan,
>> 
>> The 'phase' is implicitly understood by the 'progress' (value) made by the 
>> map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
>> 
>> For e.g. 
>> Reduce: 
>> 0-33% -> Shuffle
>> 34-66% -> Sort (actually, just 'merge', there is no sort in the reduce since 
>> all map-outputs are sorted)
>> 67-100% -> Reduce
>> 
>> With 0.23 onwards the Map has phases too:
>> 0-90% -> Map
>> 91-100% -> Final Sort/merge
>> 
>> Now,about starting reduces early - this is done to ensure shuffle can 
>> proceed for completed maps while rest of the maps run, there-by pipelining 
>> shuffle and map completion. There is a 'reduce slowstart' feature to control 
>> this - by default, reduces aren't started until 5% of maps are complete. 
>> Users can set this higher.
>> 
>> Arun
>> 
>> On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
>> 
>>> Hi, all
>>> 
>>> recently, I was hit by a question, "how is a hadoop job divided into 2
>>> phases?",
>>> 
>>> In textbooks, we are told that the mapreduce jobs are divided into 2 phases,
>>> map and reduce, and for reduce, we further divided it into 3 stages,
>>> shuffle, sort, and reduce, but in hadoop codes, I never think about
>>> this question, I didn't see any variable members in JobInProgress class
>>> to indicate this information,
>>> 
>>> and according to my understanding on the source code of hadoop, the reduce
>>> tasks are unnecessarily started until all mappers are finished, in
>>> constract, we can see the reduce tasks are in shuffle stage while there are
>>> mappers which are still in running,
>>> So how can I indicate the phase which the job is belonging to?
>>> 
>>> Thanks
>>> -- 
>>> Nan Zhu
>>> School of Electronic, Information and Electrical Engineering,229
>>> Shanghai Jiao Tong University
>>> 800,Dongchuan Road,Shanghai,China
>>> E-Mail: zhunans...@gmail.com
>> 
>> 
> 
> -- 
> Kai Voigt
> k...@123.org
> 
> 
> 
> 

This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Hadoop/CDH + Avro

2011-09-13 Thread GOEKE, MATTHEW (AG/1000)
Would anyone happen to be able to share a good reference for Avro integration 
with Hadoop? I can find plenty of material around using Avro by itself but I 
have found little to no documentation on how to implement it as both the 
protocol and as custom key/value types.

Thanks,
Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


Hadoop multi tier backup

2011-08-30 Thread GOEKE, MATTHEW (AG/1000)
All,

We were discussing how we would backup our data from the various environments 
we will have and I was hoping someone could chime in with previous experience 
in this. My primary concern about our cluster is that we would like to be able 
to recover anything within the last 60 days so having full backups both on tape 
and through distcp is preferred.

Out initial thoughts can be seen in the jpeg attached but just in case any of 
you are weary of attachments it can also be summarized below:

Prod Cluster --DistCp--> On-site Backup cluster with Fuse mount point running 
NetBackup daemon --NetBackup--> Media Server --> Tape

One of our biggest grey areas so far is how do most people accomplish 
incremental backups? Our thought was to tie this into our NetBackup 
configuration as this can be done for other connectors but we do not see 
anything for HDFS yet.

Thanks,
Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


RE: Hadoop in process?

2011-08-26 Thread GOEKE, MATTHEW (AG/1000)
It depends on what scope you want your unit tests to operate at. There is a 
class you might want to look into called MiniMRCluster if you are dead set on 
having as deep of tests as possible but you can still cover quite a bit with 
MRUnit and Junit4/Mockito.

Matt

-Original Message-
From: Frank Astier [mailto:fast...@yahoo-inc.com] 
Sent: Friday, August 26, 2011 1:30 PM
To: common-user@hadoop.apache.org
Subject: Hadoop in process?

Hi -

Is there a way I can start HDFS (the namenode) from a Java main and run unit 
tests against that? I need to integrate my Java/HDFS program into unit tests, 
and the unit test machine might not have Hadoop installed. I'm currently 
running the unit tests by hand with hadoop jar ... My unit tests create a bunch 
of (small) files in HDFS and manipulate them. I use the fs API for that. I 
don't have map/reduce jobs (yet!).

Thanks!

Frank
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Making sure I understand HADOOP_CLASSPATH

2011-08-22 Thread GOEKE, MATTHEW (AG/1000)
If you are asking how to make those classes available at run time you can 
either use the -libjars command for the distributed cache or you can just shade 
those classes into your jar using maven. I have had enough issues in the past 
with classpath being flaky that I prefer the shading method but obviously that 
is not the preferred route.

Matt

-Original Message-
From: W.P. McNeill [mailto:bill...@gmail.com] 
Sent: Monday, August 22, 2011 1:01 PM
To: common-user@hadoop.apache.org
Subject: Making sure I understand HADOOP_CLASSPATH

What does HADOOP_CLASSPATH set in $HADOOP/conf/hadoop-env.sh do?

This isn't clear to me from documentation and books, so I did some
experimenting. Here's the conclusion I came to: the paths in
HADOOP_CLASSPATH are added to the class path of the Job Client, but they are
not added to the class path of the Task Trackers. Therefore if you put a JAR
called MyJar.jar on the HADOOP_CLASSPATH and don't do anything to make it
available to the Task Trackers as well, calls to MyJar.jar code from the
run() method of your job work, but calls from your Mapper or Reducer will
fail at runtime. Is this correct?

If it is, what is the proper way to make MyJar.jar available to both the Job
Client and the Task Trackers?
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


Unit testing MR without dependency injection

2011-08-15 Thread GOEKE, MATTHEW (AG/1000)
Does anyone have any code examples for how they persist join data across 
multiple input splits and how they test it? Currently I populate a singleton in 
the setup method of my mapper (along with having jvm reuse turned on for this 
job) but with no way to have dependency injection into the mapper I am really 
having a hard time with wrapping a UT around the code. I could have a package 
scoped setter simply for testing purposes but that just feels dirty to be 
honest. Any help is greatly appreciated and I have both MRUnit and Mockito at 
my disposal.

  private BitPackedMarkerMap markerMap = 
BitPackedMarkerMapSingleton.getInstance().getMarkerMap();
  private int numberOfIndividuals = -999;
  private int numberOfAlleles = -999;

  @Override
  protected void setup(Context context) throws IOException, 
InterruptedException {
LongPackedDoubleInteger inputSizes;
if(markerMap.getSize() == 0){
FileInputStream scoresInputStream = null;
  try{
Path[] cacheFiles = 
DistributedCache.getLocalCacheFiles(context.getConfiguration());
if (cacheFiles != null && cacheFiles.length > 0){
  scoresInputStream = new FileInputStream(cacheFiles[0].toString());
  inputSizes = markerMap.parse(scoresInputStream);
  numberOfIndividuals = inputSizes.getInt1();
  numberOfAlleles = inputSizes.getInt2();
}
  } catch (IOException e){
System.err.println("Exception reading DistributedCache: " + e);
throw e;
  }finally {
if(scoresInputStream != null){
  scoresInputStream.close();
}
  }
}
  }




Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: hadoop cluster on VM's

2011-08-15 Thread GOEKE, MATTHEW (AG/1000)
I was referring to multiple VM's on a single machine (that you have in house) 
for my previous comment and not EC2. FWIW, I would rather see a single heavy 
data node than to partition off a single box into multiple machines unless you 
are trying to do more on that server than just Hadoop. Obviously every person / 
company has their own constraints but if this box is solely for Hadoop then 
don't partition it otherwise you will incur a decent loss in possible 
map/reduce slots.

Matt

-Original Message-
From: Liam Friel [mailto:liam.fr...@gmail.com] 
Sent: Monday, August 15, 2011 3:04 PM
To: common-user@hadoop.apache.org
Subject: Re: hadoop cluster on VM's

On Mon, Aug 15, 2011 at 7:31 PM, GOEKE, MATTHEW (AG/1000) <
matthew.go...@monsanto.com> wrote:

> Is this just for testing purposes or are you planning on going into
> production with this? If it is the latter than I would STRONGLY advise to
> not give that a second thought due to how the framework handles I/O. However
> if you are just trying to test out distributed daemon setup and get some ops
> documentation then have at it :)
>
> Matt
>
> -Original Message-
> From: Travis Camechis [mailto:camec...@gmail.com]
> Sent: Monday, August 15, 2011 12:45 PM
> To: common-user@hadoop.apache.org
> Subject: hadoop cluster on VM's
>
> Is it recommended to install a hadoop cluster on a set of VM's that are all
> connected to a SAN?
>
>
Could you expand on that? Do you mean multiple VMs on a single server are a
no-no?
Or do you mean running Hadoop on something like Amazon EC2 for production is
also a no-no?
With some pointers to background if the latter please ...

Just for my education. I have run some (test I guess you could call them)
Hadoop clusters on EC2 and it was working OK.
However I didn't have the equivalent pile of physical hardware lying around
to do a comparison ... which I guess is why EC2 is so attractive.

Ta
Liam
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: hadoop cluster on VM's

2011-08-15 Thread GOEKE, MATTHEW (AG/1000)
Is this just for testing purposes or are you planning on going into production 
with this? If it is the latter than I would STRONGLY advise to not give that a 
second thought due to how the framework handles I/O. However if you are just 
trying to test out distributed daemon setup and get some ops documentation then 
have at it :)

Matt

-Original Message-
From: Travis Camechis [mailto:camec...@gmail.com] 
Sent: Monday, August 15, 2011 12:45 PM
To: common-user@hadoop.apache.org
Subject: hadoop cluster on VM's

Is it recommended to install a hadoop cluster on a set of VM's that are all
connected to a SAN?

Thanks,
Travis
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread GOEKE, MATTHEW (AG/1000)
Sofia correct me if I am wrong, but Mike I think this thread was about using 
the output of a previous job, in this case already in sequence file format, as 
in memory join data for another job.

Side note: does anyone know what the rule of thumb on file size is when using 
the distributed cache vs just reading from HDFS (join data not binary files)? I 
always thought that having a setup phase on a mapper read directly from HDFS 
was a asking for trouble and that you should always distribute to each node but 
I am hearing more and more people say to just read directly from HDFS for 
larger file sizes to avoid the IO cost of the distributed cache.

Matt

-Original Message-
From: Ian Michael Gumby [mailto:michael_se...@hotmail.com] 
Sent: Friday, August 12, 2011 10:54 AM
To: common-user@hadoop.apache.org
Subject: RE: Hadoop--store a sequence file in distributed cache?


This whole thread doesn't make a lot of sense.

If your first m/r job creates the sequence files, which you then use as input 
files to your second job, you don't need to use distributed cache since the 
output of the first m/r job is going to be in HDFS.
(Dino is correct on that account.)

Sofia replied saying that she needed to open and close the sequence file to 
access the data in each Mapper.map() call. 
Without knowing more about the specific app, Ashook is correct that you could 
read the file in Mapper.setup() and then access it in memory.
Joey is correct you can put anything in distributed cache, but you don't want 
to put an HDFS file in to distributed cache. Distributed cache is a tool for 
taking something from your job and distributing it to each job tracker as a 
local object. It does have a bit of overhead. 

A better example is if you're distributing binary objects  that you want on 
each node. A c++ .so file that you want to call from within your java m/r.

If you're not using all of the data in the sequence file, what about using 
HBase?


> From: ash...@clearedgeit.com
> To: common-user@hadoop.apache.org
> Date: Fri, 12 Aug 2011 09:06:39 -0400
> Subject: RE: Hadoop--store a sequence file in distributed cache?
> 
> If you are looking for performance gains, then possibly reading these files 
> once during the setup() call in your Mapper and storing them in some data 
> structure like a Map or a List will give you benefits.  Having to open/close 
> the files during each map call will have a lot of unneeded I/O.  
> 
> You have to be conscious of your java heap size though since you are 
> basically storing the files in RAM. If your files are a few MB in size as you 
> said, then it shouldn't be a problem.  If the amount of data you need to 
> store won't fit, consider using HBase as a solution to get access to the data 
> you need.
> 
> But as Joey said, you can put whatever you want in the Distributed Cache -- 
> as long as you have a reader for it.  You should have no problems using the 
> SequenceFile.Reader.
> 
> -- Adam
> 
> 
> -Original Message-
> From: Joey Echeverria [mailto:j...@cloudera.com] 
> Sent: Friday, August 12, 2011 6:28 AM
> To: common-user@hadoop.apache.org; Sofia Georgiakaki
> Subject: Re: Hadoop--store a sequence file in distributed cache?
> 
> You can use any kind of format for files in the distributed cache, so
> yes you can use sequence files. They should be faster to parse than
> most text formats.
> 
> -Joey
> 
> On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
>  wrote:
> > Thank you for the reply!
> > In each map(), I need to open-read-close these files (more than 2 in the 
> > general case, and maybe up to 20 or more), in order to make some checks. 
> > Considering the huge amount of data in the input, making all these file 
> > operations on HDFS will kill the performance!!! So I think it would be 
> > better to store these files in distributed Cache, so that the whole process 
> > would be more efficient -I guess this is the point of using Distributed 
> > Cache in the first place!
> >
> > My question is, if I can store sequence files in distributed Cache and 
> > handle them using e.g. the SequenceFile.Reader class, or if I should only 
> > keep regular text files in distributed Cache and handle them using the 
> > usual java API.
> >
> > Thank you very much
> > Sofia
> >
> > PS: The files have small size, a few KB to few MB maximum.
> >
> >
> >
> > 
> > From: Dino Kečo 
> > To: common-user@hadoop.apache.org; Sofia Georgiakaki 
> > 
> > Sent: Friday, August 12, 2011 11:30 AM
> > Subject: Re: Hadoop--store a sequence file in distributed cache?
> >
> > Hi Sofia,
> >
> > I assume that output of first job is stored on HDFS. In that case I would
> > directly read file from Mappers without using distributed cache. If you put
> > file into distributed cache that would add one more copy operation into your
> > process.
> >
> > Thanks,
> > dino
> >
> >
> > On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
> > wrote:
> >
> >> Good morning,
> >>
> >> I wo

RE: Question about RAID controllers and hadoop

2011-08-11 Thread GOEKE, MATTHEW (AG/1000)
If I read that email chain correctly then they were referring to the classic 
JBOD vs multiple disks striped together conversation. The conversation that was 
started here is referring to JBOD vs 1 RAID 0 per disk and the effects of the 
raid controller on those independent raids.

Matt

-Original Message-
From: Kai Voigt [mailto:k...@123.org] 
Sent: Thursday, August 11, 2011 5:17 PM
To: common-user@hadoop.apache.org
Subject: Re: Question about RAID controllers and hadoop

Yahoo did some testing 2 years ago: http://markmail.org/message/xmzc45zi25htr7ry

But updated benchmark would be interesting to see.

Kai

Am 12.08.2011 um 00:13 schrieb GOEKE, MATTHEW (AG/1000):

> My assumption would be that having a set of 4 raid 0 disks would actually be 
> better than having a controller that allowed pure JBOD of 4 disks due to the 
> cache on the controller. If anyone has any personal experience with this I 
> would love to know performance numbers but our infrastructure guy is doing 
> tests on exactly this over the next couple days so I will pass it along once 
> we have it.
> 
> Matt
> 
> -Original Message-
> From: Bharath Mundlapudi [mailto:bharathw...@yahoo.com] 
> Sent: Thursday, August 11, 2011 5:00 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Question about RAID controllers and hadoop
> 
> True, you need a P410 controller. You can create RAID0 for each disk to make 
> it as JBOD.
> 
> 
> -Bharath
> 
> 
> 
> 
> From: Koert Kuipers 
> To: common-user@hadoop.apache.org
> Sent: Thursday, August 11, 2011 2:50 PM
> Subject: Question about RAID controllers and hadoop
> 
> Hello all,
> We are considering using low end HP proliant machines (DL160s and DL180s)
> for cluster nodes. However with these machines if you want to do more than 4
> hard drives then HP puts in a P410 raid controller. We would configure the
> RAID controller to function as JBOD, by simply creating multiple RAID
> volumes with one disk. Does anyone have experience with this setup? Is it a
> good idea, or am i introducing a i/o bottleneck?
> Thanks for your help!
> Best, Koert
> This e-mail message may contain privileged and/or confidential information, 
> and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in error, 
> please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other use 
> of this e-mail by you is strictly prohibited.
> 
> All e-mails and attachments sent and received are subject to monitoring, 
> reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for checking 
> for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any damage 
> caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
> 
> 
> The information contained in this email may be subject to the export control 
> laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR) and 
> sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
> information you are obligated to comply with all
> applicable U.S. export laws and regulations.
> 
> 

-- 
Kai Voigt
k...@123.org






RE: Question about RAID controllers and hadoop

2011-08-11 Thread GOEKE, MATTHEW (AG/1000)
My assumption would be that having a set of 4 raid 0 disks would actually be 
better than having a controller that allowed pure JBOD of 4 disks due to the 
cache on the controller. If anyone has any personal experience with this I 
would love to know performance numbers but our infrastructure guy is doing 
tests on exactly this over the next couple days so I will pass it along once we 
have it.

Matt

-Original Message-
From: Bharath Mundlapudi [mailto:bharathw...@yahoo.com] 
Sent: Thursday, August 11, 2011 5:00 PM
To: common-user@hadoop.apache.org
Subject: Re: Question about RAID controllers and hadoop

True, you need a P410 controller. You can create RAID0 for each disk to make it 
as JBOD.


-Bharath




From: Koert Kuipers 
To: common-user@hadoop.apache.org
Sent: Thursday, August 11, 2011 2:50 PM
Subject: Question about RAID controllers and hadoop

Hello all,
We are considering using low end HP proliant machines (DL160s and DL180s)
for cluster nodes. However with these machines if you want to do more than 4
hard drives then HP puts in a P410 raid controller. We would configure the
RAID controller to function as JBOD, by simply creating multiple RAID
volumes with one disk. Does anyone have experience with this setup? Is it a
good idea, or am i introducing a i/o bottleneck?
Thanks for your help!
Best, Koert
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Giving filename as key to mapper ?

2011-07-15 Thread GOEKE, MATTHEW (AG/1000)
If you have the source downloaded (and if you don't I would suggest you get it) 
you can do a search for *InputFormat.java and you will have all the references 
you need. Also you might want to check out http://codedemigod.com/blog/?p=120 
or take a look at the books "Hadoop in action" or "Hadoop: The Definitive 
Guide".

Matt

-Original Message-
From: praveenesh kumar [mailto:praveen...@gmail.com] 
Sent: Friday, July 15, 2011 9:42 AM
To: common-user@hadoop.apache.org
Subject: Re: Giving filename as key to mapper ?

I am new to this hadoop API. Can anyone give me some tutorial or code snipet
on how to write your own input format to do these kind of things.
Thanks.

On Fri, Jul 15, 2011 at 8:07 PM, Robert Evans  wrote:

> To add to that if you really want the file name to be the key instead of
> just calling a different API in your map to get it you will probably need to
> write your own input format to do it.  It should be fairly simple and you
> can base it off of an existing input format to do it.
>
> --Bobby
>
> On 7/15/11 7:40 AM, "Harsh J"  wrote:
>
> You can retrieve the filename in the new API as described here:
>
>
> http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filename&subj=Retrieving+Filename
>
> In the old API, its available in the configuration instance of the
> mapper as key "map.input.file". See the table below this section
>
> http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse
> for more such goodies.
>
> On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar 
> wrote:
> > Hi,
> > How can I give filename as key to mapper ?
> > I want to know the occurence of word in set of docs, so I want to keep
> key
> > as filename. Is it possible to give input key as filename in map function
> ?
> > Thanks,
> > Praveenesh
> >
>
>
>
> --
> Harsh J
>
>
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Issue with MR code not scaling correctly with data sizes

2011-07-14 Thread GOEKE, MATTHEW (AG/1000)
All,

I have a MR program that I feed in a list of IDs and it generates the unique 
comparison set as a result. Example: if I have a list {1,2,3,4,5} then the 
resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or 
(n^2-n)/2 number of comparisons. My code works just fine on smaller scaled sets 
(I can verify less than 1000 fairly easily) but fails when I try to push the 
set to 10-20k IDs which is annoying when the end goal is 1-10 million.

The flow of the program is:
1) Partition the IDs evenly, based on amount of output per value, into 
a set of keys equal to the number of reduce slots we currently have
2) Use the distributed cache to push the ID file out to the various 
reducers
3) In the setup of the reducer, populate an int array with the values 
from the ID file in distributed cache
4) Output a comparison only if the current ID from the values iterator 
is greater than the current iterator through the int array

I realize that this could be done many other ways but this will be part of an 
oozie workflow so it made sense to just do it in MR for now. My issue is that 
when I try the larger sized ID files it only outputs part of resulting data set 
and there are no errors to be found. Part of me thinks that I need to tweak 
some site configuration properties, due to the size of data that is spilling to 
disk, but after scanning through all 3 sites I am having issues pin pointing 
anything I think could be causing this. I moved from reading the file from HDFS 
to using the distributed cache for the join read thinking that might solve my 
problem but there seems to be something else I am overlooking.

Any advice is greatly appreciated!

Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Performance Tunning

2011-06-28 Thread GOEKE, MATTHEW (AG/1000)
Mike,

Somewhat of a tangent but it is actually very informative to hear that you are 
getting bound by I/O with a 2:1 core to disk ratio. Could you share what you 
used to make those calls? We have been using both a local ganglia daemon as 
well as the Hadoop ganglia daemon to get an overall look at the cluster and the 
items of interest, I would assume, would be CPU wait i/o as well as the 
throughput of block operations.

Obviously the disconnect on my side was I didn't realize you were dedicating a 
physical core per daemon. I am a little surprised that you found that necessary 
but then again after seeing some of the metrics from my own stress testing I am 
noticing that we might be over extending with our config on heavy loads. 
Unfortunately I am working with lower specced hardware at the moment so I don't 
have the overhead to test that out.

Matt

-Original Message-
From: Michael Segel [mailto:michael_se...@hotmail.com] 
Sent: Tuesday, June 28, 2011 1:31 PM
To: common-user@hadoop.apache.org
Subject: RE: Performance Tunning



Matthew,

I understood that Juan was talking about a 2 socket quad core box.  We run 
boxes with the e5500 (xeon quad core ) chips. Linux sees these as 16 cores. 
Our data nodes are 32GB Ram w 4 x 2TB SATA. Its a pretty basic configuration. 

What I was saying was that if you consider 1 core for each TT, DN and RS jobs, 
thats 3 out of the 8 physical cores, leaving you 5 cores or 10 'hyperthread 
cores'.
So you could put up 10 m/r slots on the machine.  Note that on the main tasks 
(TT, DN, RS) I dedicate the physical core.

Of course your mileage may vary if you're doing non-standard or normal things.  
A good starting point is 6 mappers and 4 reducers. 
And of course YMMV depending on if you're using MapR's release, Cloudera, and 
if you're running HBase or something else on the cluster.

>From our experience... we end up getting disk I/O bound first, and then 
>network or memory becomes the next constraint. Really the xeon chipsets are 
>really good. 

HTH

-Mike


> From: matthew.go...@monsanto.com
> To: common-user@hadoop.apache.org
> Subject: RE: Performance Tunning
> Date: Tue, 28 Jun 2011 14:46:40 +
> 
> Mike,
> 
> I'm not really sure I have seen a community consensus around how to handle 
> hyper-threading within Hadoop (although I have seen quite a few articles that 
> discuss it). I was assuming that when Juan mentioned they were 4-core boxes 
> that he meant 4 physical cores and not HT cores. I was more stating that the 
> starting point should be 1 slot per thread (or hyper-threaded core) but 
> obviously reviewing the results from ganglia, or any other monitoring 
> solution, will help you come up with a more concrete configuration based on 
> the load.
> 
> My brain might not be working this morning but how did you get the 10 slots 
> again? That seems low for an 8 physical core box but somewhat overextending 
> for a 4 physical core box.
> 
> Matt
> 
> -Original Message-
> From: im_gu...@hotmail.com [mailto:im_gu...@hotmail.com] On Behalf Of Michel 
> Segel
> Sent: Tuesday, June 28, 2011 7:39 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Performance Tunning
> 
> Matt,
> You have 2 threads per core, so your Linux box thinks an 8 core box has16 
> cores. In my calcs, I tend to take a whole core for TT DN and RS and then a 
> thread per slot so you end up w 10 slots per node. Of course memory is also a 
> factor.
> 
> Note this is only a starting point.you can always tune up. 
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Jun 27, 2011, at 11:11 PM, "GOEKE, MATTHEW (AG/1000)" 
>  wrote:
> 
> > Per node: 4 cores * 2 processes = 8 slots
> > Datanode: 1 slot
> > Tasktracker: 1 slot
> > 
> > Therefore max of 6 slots between mappers and reducers.
> > 
> > Below is part of our mapred-site.xml. The thing to keep in mind is the 
> > number of maps is defined by the number of input splits (which is defined 
> > by your data) so you only need to worry about setting the maximum number of 
> > concurrent processes per node. In this case the property you want to hone 
> > in on is mapred.tasktracker.map.tasks.maximum and 
> > mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of 
> > other tuning improvements that can be made but it requires an strong 
> > understanding of your job load.
> > 
> > 
> >  
> >mapred.tasktracker.map.tasks.maximum
> >2
> >  
> > 
> >  
> >mapred.tasktracker.reduce.tasks.maximum
> >1
> >  
> > 
> >  
> >mapred.child.java.opts
> >-Xmx512m
> >  
> > 
> >  
> 

RE: Performance Tunning

2011-06-28 Thread GOEKE, MATTHEW (AG/1000)
Mike,

I'm not really sure I have seen a community consensus around how to handle 
hyper-threading within Hadoop (although I have seen quite a few articles that 
discuss it). I was assuming that when Juan mentioned they were 4-core boxes 
that he meant 4 physical cores and not HT cores. I was more stating that the 
starting point should be 1 slot per thread (or hyper-threaded core) but 
obviously reviewing the results from ganglia, or any other monitoring solution, 
will help you come up with a more concrete configuration based on the load.

My brain might not be working this morning but how did you get the 10 slots 
again? That seems low for an 8 physical core box but somewhat overextending for 
a 4 physical core box.

Matt

-Original Message-
From: im_gu...@hotmail.com [mailto:im_gu...@hotmail.com] On Behalf Of Michel 
Segel
Sent: Tuesday, June 28, 2011 7:39 AM
To: common-user@hadoop.apache.org
Subject: Re: Performance Tunning

Matt,
You have 2 threads per core, so your Linux box thinks an 8 core box has16 
cores. In my calcs, I tend to take a whole core for TT DN and RS and then a 
thread per slot so you end up w 10 slots per node. Of course memory is also a 
factor.

Note this is only a starting point.you can always tune up. 

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 27, 2011, at 11:11 PM, "GOEKE, MATTHEW (AG/1000)" 
 wrote:

> Per node: 4 cores * 2 processes = 8 slots
> Datanode: 1 slot
> Tasktracker: 1 slot
> 
> Therefore max of 6 slots between mappers and reducers.
> 
> Below is part of our mapred-site.xml. The thing to keep in mind is the number 
> of maps is defined by the number of input splits (which is defined by your 
> data) so you only need to worry about setting the maximum number of 
> concurrent processes per node. In this case the property you want to hone in 
> on is mapred.tasktracker.map.tasks.maximum and 
> mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of 
> other tuning improvements that can be made but it requires an strong 
> understanding of your job load.
> 
> 
>  
>mapred.tasktracker.map.tasks.maximum
>2
>  
> 
>  
>mapred.tasktracker.reduce.tasks.maximum
>1
>  
> 
>  
>mapred.child.java.opts
>-Xmx512m
>  
> 
>  
>mapred.compress.map.output
>true
>  
> 
>  
>mapred.output.compress
>true
>  
> 
> 
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Performance Tunning

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Per node: 4 cores * 2 processes = 8 slots
Datanode: 1 slot
Tasktracker: 1 slot

Therefore max of 6 slots between mappers and reducers.

Below is part of our mapred-site.xml. The thing to keep in mind is the number 
of maps is defined by the number of input splits (which is defined by your 
data) so you only need to worry about setting the maximum number of concurrent 
processes per node. In this case the property you want to hone in on is 
mapred.tasktracker.map.tasks.maximum and 
mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of other 
tuning improvements that can be made but it requires an strong understanding of 
your job load.


  
mapred.tasktracker.map.tasks.maximum
2
  

  
mapred.tasktracker.reduce.tasks.maximum
1
  

  
mapred.child.java.opts
-Xmx512m
  

  
mapred.compress.map.output
true
  

  
mapred.output.compress
true
  

  
mapred.output.compression.type
BLOCK
  

  
mapred.output.compression.codec
com.hadoop.compression.lzo.LzoCodec
  

  
mapred.map.output.compression.codec
com.hadoop.compression.lzo.LzoCodec
  

  
mapred.reduce.slowstart.completed.maps
0.75
  

  
mapred.reduce.tasks
4
  

  
mapred.reduce.tasks.speculative.execution
false
  

  
io.sort.mb
256
  

  
io.sort.factor
64
  

  
mapred.jobtracker.taskScheduler
org.apache.hadoop.mapred.FairScheduler
true
  

  
mapred.fairscheduler.poolnameproperty
mapreduce.job.user.name
true
  

  

  
mapred.fairscheduler.assignmultiple
true
true
  

  
mapred.hosts
/hadoop/hadoop/conf/mapred-hosts-include
  

  
mapred.hosts.exclude
/hadoop/hadoop/conf/mapred-hosts-exclude
  


-Original Message-
From: Juan P. [mailto:gordoslo...@gmail.com] 
Sent: Monday, June 27, 2011 10:13 PM
To: common-user@hadoop.apache.org
Subject: Re: Performance Tunning

Ok,
So I tried putting the following config in the mapred-site.xml of all of my
nodes


  
mapred.job.tracker
name-node:54311
  
  
mapred.map.tasks
7
  
  
mapred.reduce.tasks
1
  
  
mapred.tasktracker.map.tasks.maximum
7
  
  
mapred.tasktracker.reduce.tasks.maximum
1
  


but when I start a new job it gets stuck at

11/06/28 03:04:47 INFO mapred.JobClient:  map 0% reduce 0%

Any thoughts?
Thanks for your help guys!

On Mon, Jun 27, 2011 at 7:33 PM, Juan P.  wrote:

> Matt,
> Thanks for your help!
> I think I get it now, but this part is a bit confusing:
> *
> *
> *so: tasktracker/datanode and 6 slots left. How you break it up from there
> is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers
> / 1 reducer.*
> *
> *
> If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2
> Processes/Core = 32 Processes Total
>
> So my configuration mapred-site.xml should include these props:
>
> **
> *  mapred.map.tasks*
> *  28*
> **
> **
> *  mapred.reduce.tasks*
> *  4*
> **
> *
> *
>
> Is that correct?
>
> On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) <
> matthew.go...@monsanto.com> wrote:
>
>> If you are running default configurations then you are only getting 2
>> mappers and 1 reducer per node. The rule of thumb I have gone on (and back
>> up by the definitive guide) is 2 processes per core so: tasktracker/datanode
>> and 6 slots left. How you break it up from there is your call but I would
>> suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.
>>
>> Check out the below configs for details on what you are *most likely*
>> running currently:
>> http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
>> http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
>> http://hadoop.apache.org/common/docs/r0.20.2/core-default.html
>>
>> HTH,
>> Matt
>>
>> -Original Message-
>> From: Juan P. [mailto:gordoslo...@gmail.com]
>> Sent: Monday, June 27, 2011 2:50 PM
>> To: common-user@hadoop.apache.org
>> Subject: Performance Tunning
>>
>> I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
>> cores each.
>> My input data is 4GB in size and it's split into 100MB files. Current
>> configuration is default so block size is 64MB.
>>
>> If I understand it correctly Hadoop should be running 64 Mappers to
>> process
>> the data.
>>
>> I'm running a simple data counting MapReduce and it's taking about 30mins
>> to
>> complete. This seems like way too much, doesn't it?
>> Is there any tunning you guys would recommend to try and see an
>> improvement
>> in performance?
>>
>> Thanks,
>> Pony
>> This e-mail message may contain privi

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
At this point if that is the correct ip then I would see if you can actually 
ssh from the DN to the NN to make sure it can actually connect to the other 
box. If you can successfully connect through ssh then it's just a matter of 
figuring out why that port is having issues (netstat is your friend in this 
case). If you see it listening on 54310 then just power cycle the box and try 
again.

Matt

-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu] 
Sent: Monday, June 27, 2011 5:38 PM
To: common-user@hadoop.apache.org
Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup?

Hi Matt and Jeff:

Thanks a lot for your instructions. I corrected the mistakes in conf files
of DN, and now the log on DN becomes:

2011-06-27 15:32:36,025 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 0 time(s).
2011-06-27 15:32:37,028 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 1 time(s).
2011-06-27 15:32:38,031 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 2 time(s).
2011-06-27 15:32:39,034 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 3 time(s).
2011-06-27 15:32:40,037 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 4 time(s).
2011-06-27 15:32:41,040 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 5 time(s).
2011-06-27 15:32:42,043 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 6 time(s).
2011-06-27 15:32:43,046 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 7 time(s).
2011-06-27 15:32:44,049 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 8 time(s).
2011-06-27 15:32:45,052 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 9 time(s).
2011-06-27 15:32:45,053 INFO org.apache.hadoop.ipc.RPC: Server at
clock.ucsd.edu/132.239.95.91:54310 not available yet, Z...

Seems DN is trying to bind with NN but always fails...



Best Regards
Yours Sincerely

Jingwei Lu



On Mon, Jun 27, 2011 at 2:22 PM, GOEKE, MATTHEW (AG/1000) <
matthew.go...@monsanto.com> wrote:

> As a follow-up to what Jeff posted: go ahead and ignore the message you got
> on the NN for now.
>
> If you look at the address that the DN log shows it is 127.0.0.1 and the
> ip:port it is trying to connect to for the NN is 127.0.0.1:54310 ---> it
> is trying to bind to itself as if it was still in single machine mode. Make
> sure that you have correctly pushed the URI for the NN into the config files
> on both machines and then bounce DFS.
>
> Matt
>
> -Original Message-
> From: jeff.schm...@shell.com [mailto:jeff.schm...@shell.com]
> Sent: Monday, June 27, 2011 4:08 PM
> To: common-user@hadoop.apache.org
> Subject: RE: Why I cannot see live nodes in a LAN-based cluster setup?
>
> http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html
>
>
>
> -Original Message-
> From: Jingwei Lu [mailto:j...@ucsd.edu]
> Sent: Monday, June 27, 2011 3:58 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup?
>
> Hi,
>
> I just manually modify the masters & slaves files in the both machines.
>
> I found something wrong in the log files, as shown below:
>
> -- Master :
> namenote.log:
>
> 
> 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14
> 2011-06-27 13:44:47,394 INFO org.mortbay.log: Started
> SelectChannelConnector@0.0.0.0:50070
> 2011-06-27 13:44:47,395 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at:
> 0.0.0.0:50070
> 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
> Responder: starting
> 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
> listener on 54310: starting
> 2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 54310: starting
> 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 54310: starting
> 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 54310: starting
> 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 3 on 54310: starting
> 2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 4 on 54310: starting
> 2011-06-27 13:44:47,404 INFO org.apache.ha

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
.1:54310. Already tried 4 time(s).
 14 2011-06-27 13:45:07,643 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 5 time(s).
 15 2011-06-27 13:45:08,646 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 6 time(s).
 16 2011-06-27 13:45:09,661 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 7 time(s).
 17 2011-06-27 13:45:10,664 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 8 time(s).
 18 2011-06-27 13:45:11,678 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 9 time(s).
 19 2011-06-27 13:45:11,679 INFO org.apache.hadoop.ipc.RPC: Server at
hdl.ucsd.edu/127.0.0.1:54310 not available yet, Z...


(just guess, is this due to some porting problem?)

Any comments will be greatly appreciated!

Best Regards
Yours Sincerely

Jingwei Lu



On Mon, Jun 27, 2011 at 1:28 PM, GOEKE, MATTHEW (AG/1000) <
matthew.go...@monsanto.com> wrote:

> Did you make sure to define the datanode/tasktracker in the slaves file in
> your conf directory and push that to both machines? Also have you checked
> the logs on either to see if there are any errors?
>
> Matt
>
> -Original Message-
> From: Jingwei Lu [mailto:j...@ucsd.edu]
> Sent: Monday, June 27, 2011 3:24 PM
> To: HADOOP MLIST
> Subject: Why I cannot see live nodes in a LAN-based cluster setup?
>
> Hi Everyone:
>
> I am quite new to hadoop here. I am attempting to set up Hadoop locally in
> two machines, connected by LAN. Both of them pass the single-node test.
> However, I failed in two-node cluster setup, as shown in the 2 cases below:
>
> 1) set one as dedicated namenode and the other as dedicated datanode
> 2) set one as both name- and data-node, and the other as just datanode
>
> I launch *start-dfs.sh *on the namenode. Since I have all the *ssh *issues
> cleared, thus I can always observe the startup of daemon in every datanode.
> However, by website of *http://(URI of namenode):50070 *it shows only 0
> live
> node for (1) and 1 live node for (2), which is the same as the output by
> command-line *hadoop dfsadmin -report*
>
> Generally it appears that from the namenode you cannot observe the remote
> datanode alive, let alone a normal across-node MapReduce execution.
>
> Could anyone give some hints / instructions at this point? I really
> appreciate it!
>
> Thank.
>
> Best Regards
> Yours Sincerely
>
> Jingwei Lu
> This e-mail message may contain privileged and/or confidential information,
> and is intended to be received only by persons entitled
> to receive such information. If you have received this e-mail in error,
> please notify the sender immediately. Please delete it and
> all attachments from any servers, hard drives or any other media. Other use
> of this e-mail by you is strictly prohibited.
>
> All e-mails and attachments sent and received are subject to monitoring,
> reading and archival by Monsanto, including its
> subsidiaries. The recipient of this e-mail is solely responsible for
> checking for the presence of "Viruses" or other "Malware".
> Monsanto, along with its subsidiaries, accepts no liability for any damage
> caused by any such code transmitted by or accompanying
> this e-mail or any attachment.
>
>
> The information contained in this email may be subject to the export
> control laws and regulations of the United States, potentially
> including but not limited to the Export Administration Regulations (EAR)
> and sanctions regulations issued by the U.S. Department of
> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
> information you are obligated to comply with all
> applicable U.S. export laws and regulations.
>


RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Did you make sure to define the datanode/tasktracker in the slaves file in your 
conf directory and push that to both machines? Also have you checked the logs 
on either to see if there are any errors?

Matt

-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu] 
Sent: Monday, June 27, 2011 3:24 PM
To: HADOOP MLIST
Subject: Why I cannot see live nodes in a LAN-based cluster setup?

Hi Everyone:

I am quite new to hadoop here. I am attempting to set up Hadoop locally in
two machines, connected by LAN. Both of them pass the single-node test.
However, I failed in two-node cluster setup, as shown in the 2 cases below:

1) set one as dedicated namenode and the other as dedicated datanode
2) set one as both name- and data-node, and the other as just datanode

I launch *start-dfs.sh *on the namenode. Since I have all the *ssh *issues
cleared, thus I can always observe the startup of daemon in every datanode.
However, by website of *http://(URI of namenode):50070 *it shows only 0 live
node for (1) and 1 live node for (2), which is the same as the output by
command-line *hadoop dfsadmin -report*

Generally it appears that from the namenode you cannot observe the remote
datanode alive, let alone a normal across-node MapReduce execution.

Could anyone give some hints / instructions at this point? I really
appreciate it!

Thank.

Best Regards
Yours Sincerely

Jingwei Lu
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


RE: Performance Tunning

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
If you are running default configurations then you are only getting 2 mappers 
and 1 reducer per node. The rule of thumb I have gone on (and back up by the 
definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots 
left. How you break it up from there is your call but I would suggest either 4 
mappers / 2 reducers or 5 mappers / 1 reducer.

Check out the below configs for details on what you are *most likely* running 
currently:
http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

HTH,
Matt

-Original Message-
From: Juan P. [mailto:gordoslo...@gmail.com] 
Sent: Monday, June 27, 2011 2:50 PM
To: common-user@hadoop.apache.org
Subject: Performance Tunning

I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
cores each.
My input data is 4GB in size and it's split into 100MB files. Current
configuration is default so block size is 64MB.

If I understand it correctly Hadoop should be running 64 Mappers to process
the data.

I'm running a simple data counting MapReduce and it's taking about 30mins to
complete. This seems like way too much, doesn't it?
Is there any tunning you guys would recommend to try and see an improvement
in performance?

Thanks,
Pony
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Queue support from HDFS

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Saumitra,

Two questions come to mind that could help you narrow down a solution:

1) How quickly do the downstream processes need the transformed data?
Reason: If you can delay the processing for a period of time, enough to 
batch the data into a blob that is a multiple of your block size, then you are 
obviously going to be working more towards the strong suit of vanilla MR.

2) What else will be running on the cluster?
Reason: If this is primarily setup for this use case then how often it 
runs / what resources it consumes when it does only needs to be optimized if it 
can't process them fast enough. If it is not then you could always setup a 
separate pool for this in the fairscheduler and allow for this to use a certain 
amount of overhead on the cluster when these events are being generated.

Outside of the fact that you would have a lot of small files on the cluster 
(which can be resolved by running a nightly job to blob them and then delete 
originals) I am not sure I would be too concerned about at least trying out 
this method. It would be helpful to know the size and type of data coming in as 
well as what type of operation you are looking to do if you would like a more 
concrete suggestion. Log data is a prime example of this type of workflow and 
there are many suggestions out there as well as projects that attempt to 
address this (i.e. Chukwa). 

HTH,
Matt

-Original Message-
From: saumitra.shahap...@gmail.com [mailto:saumitra.shahap...@gmail.com] On 
Behalf Of Saumitra Shahapure
Sent: Friday, June 24, 2011 12:12 PM
To: common-user@hadoop.apache.org
Subject: Queue support from HDFS

Hi,

Is queue-like structure supported from HDFS where stream of data is
processed when it's generated?
Specifically, I will have stream of data coming; and data independent
operation needs to be applied to it (so only Map function, reducer is
identity).
I wish to distribute data among nodes using HDFS and start processing it as
it arrives, preferably in single MR job.

I agree that it can be done by starting new MR job for each batch of data,
but is starting many MR jobs frequently for small data chunks a good idea?
(Consider new batch arrives after every few sec and processing of one batch
takes few mins)

Thanks,
-- 
Saumitra S. Shahapure
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Poor scalability with map reduce application

2011-06-21 Thread GOEKE, MATTHEW (AG/1000)
Harsh,

Is it possible for mapred.reduce.slowstart.completed.maps to even play a 
significant role in this? The only benefit he would find in tweaking that for 
his problem would be to spread network traffic from the shuffle over a longer 
period of time at a cost of having the reducer using resources earlier. Either 
way he would see this effect across both sets of runs if he is using the 
default parameters. I guess it would all depend on what kind of network layout 
the cluster is on.

Matt

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Tuesday, June 21, 2011 12:09 PM
To: common-user@hadoop.apache.org
Subject: Re: Poor scalability with map reduce application

Alberto,

On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti
 wrote:
> I don't know if speculatives maps are on, I'll check it. One thing I
> observed is that reduces begin before all maps have finished. Let me check
> also if the difference is on the map side or in the reduce. I believe it's
> balanced, both are slower when adding more nodes, but i'll confirm that.

Maps and reduces are speculative by default, so must've been ON. Could
you also post a general input vs. output record counts and statistics
like that between your job runs, to correlate?

The reducers get scheduled early but do not exactly "reduce()" until
all maps are done. They just keep fetching outputs. Their scheduling
can be controlled with some configurations (say, to start only after
X% of maps are done -- by default it starts up when 5% of maps are
done).

-- 
Harsh J
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



TestDFSIO failure

2011-06-20 Thread GOEKE, MATTHEW (AG/1000)
Has anyone else run into issues using output compression (in our case lzo) on 
TestDFSIO and it failing to be able to read the metrics file? I just assumed 
that it would use the correct decompression codec after it finishes but it 
always returns with a 'File not found' exception. Is there a simple way around 
this without spending the time to recompile a cluster/codec specific version?

Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


RE: large memory tasks

2011-06-15 Thread GOEKE, MATTHEW (AG/1000)
Is the lookup table constant across each of the tasks? You could try putting it 
into memcached:

http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf

Matt

-Original Message-
From: Ian Upright [mailto:i...@upright.net] 
Sent: Wednesday, June 15, 2011 3:42 PM
To: common-user@hadoop.apache.org
Subject: large memory tasks

Hello, I'm quite new to Hadoop, so I'd like to get an understanding of
something.

Lets say I have a task that requires 16gb of memory, in order to execute.
Lets say hypothetically it's some sort of big lookuptable of sorts that
needs that kind of memory.

I could have 8 cores run the task in parallel (multithreaded), and all 8
cores can share that 16gb lookup table.

On another machine, I could have 4 cores run the same task, and they still
share that same 16gb lookup table.

Now, with my understanding of Hadoop, each task has it's own memory.

So if I have 4 tasks that run on one machine, and 8 tasks on another, then
the 4 tasks need a 64 GB machine, and the 8 tasks need a 128 GB machine, but
really, lets say I only have two machines, one with 4 cores and one with 8,
each machine only having 24 GB.

How can the work be evenly distributed among these machines?  Am I missing
something?  What other ways can this be configured such that this works
properly?

Thanks, Ian
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: problems with start-all.sh

2011-05-10 Thread GOEKE, MATTHEW [AG/1000]
Keith if you have a chance you might want to look at Hadoop: The
Definitive guide or other various faqs around for rolling a cluster from
tarball. One thing that most recommend is to setup a hadoop user and
then to chown all of the files / directories it needs over to it. Right
now what you are running into is that you have not chown'ed the folder
to your user or correctly chmod'ed the directories.

User / Group permissions will be become increasingly important when you
move into DFS setup so it is important to get the core setup correctly.

Matt

-Original Message-
From: Keith Thompson [mailto:kthom...@binghamton.edu] 
Sent: Tuesday, May 10, 2011 10:54 AM
To: common-user@hadoop.apache.org
Subject: Re: problems with start-all.sh

Thanks for catching that comma.  It was actually my HADOOP_CONF_DIR
rather
than HADOOP_HOME that was the culprit. :)
As for sudo ... I am not sure how to run it as a regular user.  I set up
ssh
for a passwordless login (and am able to ssh localhost without password)
but
I installed hadoop to /usr/local so every time I try to run it, it says
permission denied. So, I have to run hadoop using sudo (and it prompts
for
password as super user).  I should have installed hadoop to my home
directory instead I guess ... :/

On Tue, May 10, 2011 at 11:47 AM, Luca Pireddu  wrote:

> On May 10, 2011 17:39:12 Keith Thompson wrote:
> > Hi Luca,
> >
> > Thank you.  That worked ... at least I didn't get the same error.
Now I
> > get:
> >
> > k_thomp@linux-8awa:/usr/local/hadoop-0.20.2> sudo bin/start-all.sh
> > starting namenode, logging to
> >
/usr/local/hadoop-0.20.2/bin/../logs/hadoop-root-namenode-linux-8awa.out
> > cat: /usr/local/hadoop-0,20.2/conf/slaves: No such file or directory
> > Password:
> > localhost: starting secondarynamenode, logging to
> >
>
/usr/local/hadoop-0.20.2/bin/../logs/hadoop-root-secondarynamenode-linux
-8a
> > wa.out localhost: Exception in thread "main"
> java.lang.NullPointerException
> > localhost:  at
> > org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:134)
> > localhost:  at
> >
>
org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java
:15
> > 6) localhost:  at
> >
>
org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java
:16
> > 0) localhost:  at
> >
>
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(Seco
nda
> > ryNameNode.java:131) localhost:  at
> >
>
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(Secondar
yNa
> > meNode.java:115) localhost:  at
> >
>
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryN
ame
> > Node.java:469) starting jobtracker, logging to
> >
>
/usr/local/hadoop-0.20.2/bin/../logs/hadoop-root-jobtracker-linux-8awa.o
ut
> > cat: /usr/local/hadoop-0,20.2/conf/slaves: No such file or directory
>
> Don't try to run it as root with "sudo".  Just run it as your regular
user.
> If you try to run it as a different user then you'll have to set up
the ssh
> keys for that user (notice the "Password" prompt because ssh was
unable to
> perform a password-less login into localhost).
>
> Also, make sure you've correctly set HADOOP_HOME to the path where you
> extracted the Hadoop archive.  I'm seeing a comma in the path shown in
the
> error ("/usr/local/hadoop-0,20.2/conf/slaves") that probably shouldn't
be
> there :-)
>
>
> --
> Luca Pireddu
> CRS4 - Distributed Computing Group
> Loc. Pixina Manna Edificio 1
> Pula 09010 (CA), Italy
> Tel:  +39 0709250452
>
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: Configuration for small Cluster

2011-05-02 Thread GOEKE, MATTHEW [AG/1000]
Have you tested the performance of adjusting 
mapred.reduce.slowstart.completed.maps property? I'm curious as to what effect 
you have seen by dropping it from the default to .01 because my original 
assumption would have been to try something much higher so that you don't have 
threads spawning so soon for sort and shuffle. Also what kind of network 
interfaces does each of these machines have and how is the "rack" setup?

Matt

-Original Message-
From: baran cakici [mailto:barancak...@gmail.com] 
Sent: Monday, May 02, 2011 10:30 AM
To: common-user@hadoop.apache.org
Subject: Re: Configuration for small Cluster

I got it, I want to run on each Tasktracker one ReduceTask, overall 4
Redeuce Task on all Cluster

2011/5/2 baran cakici 

> Actually it was one, I changed that, and got better Performance by Reduce,
> because my Reduce-Algortihm is a little bit complex.
>
> thanks anyway
>
> Regards,
>
> Baran
>
> 2011/5/2 Richard Nadeau 
>
>> I would change "mapred.tasktracker.reduce.tasks.maximum" to one. With your
>> setting
>>
>> On May 2, 2011 8:48 AM, "baran cakici"  wrote:
>> > without job;
>> >
>> > CPU Usage = 0%
>> > Memory = 585 MB (2GB Ram)
>> >
>> > Baran
>> > 2011/5/2 baran cakici 
>> >
>> >> CPU Usage = 95-100%
>> >> Memory = 650-850 MB (2GB Ram)
>> >>
>> >> Baran
>> >>
>> >>
>> >> 2011/5/2 James Seigel 
>> >>
>> >>> If you have windows and cygwin you probably don't have a lot if memory
>> >>> left at 2 gig.
>> >>>
>> >>> Pull up system monitor on the data nodes and check for free memory
>> >>> when you have you jobs running. I bet it is quite low.
>> >>>
>> >>> I am not a windows guy so I can't take you much farther.
>> >>>
>> >>> James
>> >>>
>> >>> Sent from my mobile. Please excuse the typos.
>> >>>
>> >>> On 2011-05-02, at 8:32 AM, baran cakici 
>> wrote:
>> >>>
>> >>> > yes, I am running under cygwin on my datanodes too. OS of Datanodes
>> are
>> >>> > Windows as well.
>> >>> >
>> >>> > What can I do exactly for a better Performance. I changed
>> >>> > mapred.child.java.opts to default value.How can I solve this
>> "swapping"
>> >>> > problem?
>> >>> >
>> >>> > PS: I dont have a chance to get Slaves(Celeron 2GHz) with Liniux OS.
>> >>> >
>> >>> > thanks, both of you
>> >>> >
>> >>> > Regards,
>> >>> >
>> >>> > Baran
>> >>> > 2011/5/2 Richard Nadeau 
>> >>> >
>> >>> >> Are you running under cygwin on your data nodes as well? That is
>> >>> certain to
>> >>> >> cause performance problems. As James suggested, swapping to disk is
>> >>> going
>> >>> >> to
>> >>> >> be a killer, running on Windows with Celeron processors only
>> compounds
>> >>> the
>> >>> >> problem. The Celeron processor is also sub-optimal for CPU
>> intensive
>> >>> tasks
>> >>> >>
>> >>> >> Rick
>> >>> >>
>> >>> >> On Apr 28, 2011 9:22 AM, "baran cakici" 
>> wrote:
>> >>> >>> Hi Everyone,
>> >>> >>>
>> >>> >>> I have a Cluster with one Master(JobTracker and NameNode - Intel
>> >>> Core2Duo
>> >>> >> 2
>> >>> >>> GB Ram) and four Slaves(Datanode and Tasktracker - Celeron 2 GB
>> Ram).
>> >>> My
>> >>> >>> Inputdata are between 2GB-10GB and I read Inputdata in MapReduce
>> line
>> >>> by
>> >>> >>> line. Now, I try to accelerate my System(Benchmark), but I'm not
>> sure,
>> >>> if
>> >>> >> my
>> >>> >>> Configuration is correctly. Can you please just look, if it is ok?
>> >>> >>>
>> >>> >>> -mapred-site.xml
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.job.tracker
>> >>> >>> apple:9001
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.child.java.opts
>> >>> >>> -Xmx512m -server
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.job.tracker.handler.count
>> >>> >>> 2
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.local.dir
>> >>> >>>
>> >>> >>
>> >>>
>>
>> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/local
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.map.tasks
>> >>> >>> 1
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.reduce.tasks
>> >>> >>> 4
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.submit.replication
>> >>> >>> 2
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.system.dir
>> >>> >>>
>> >>> >>
>> >>> >>
>> >>>
>>
>> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.tasktracker.indexcache.mb
>> >>> >>> 10
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.tasktracker.map.tasks.maximum
>> >>> >>> 1
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.tasktracker.reduce.tasks.maximum
>> >>> >>> 4
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.temp.dir
>> >>> >>>
>> >>> >>
>> >>>
>> /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/temp
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> webinterface.private.actions
>> >>> >>> true
>> >>> >>> 
>> >>> >>>
>> >>> >>> 
>> >>> >>> mapred.reduce.slowstart.completed.maps
>> >>> >>> 0.01
>> >>> >>> 
>> >>> >>>
>> >>> >>> -hdfs-site.xml
>> >>> >>>
>> >>> >>> 
>> >>> >>> dfs.block.size
>> >>> >>> 268435456
>> >>> >>> 
>> >>> >>> PS: I extended dfs.block.size, becaus