Re: Read Little Endian Input File Format

2012-07-09 Thread Owen O'Malley
On Mon, Jul 9, 2012 at 1:33 PM, Mike S  wrote:

> The input file to my M/R job is a file with binary data (20 mix of
> int, long, float and double per record) which are all saved in little
> endian. I have implement my custom record reader to read a record and
> to do so I am currently using the ByteBuffer to convert every entry in
> the file. I am wondering if there is a more efficient way of doing?
>

I would either make a large ByteBuffer and read into it or use:

// read big endian int
int val = in.readInt();
// flip to little endian
val = ((val & 0xff) << 24) | ((val & 0xff00 << 8) | ((val & 0xff) >> 8)
| (val >>> 24);

-- Owen


Re: Which hadoop version shoul I install in a production environment

2012-07-03 Thread Owen O'Malley
On Tue, Jul 3, 2012 at 1:19 PM, Pablo Musa  wrote:

> Which is the latest stable hadoop version to install in a production
> environment with package manager support?
>

 The current stable version of Hadoop is 1.0.3. It is available as both
source and rpms from here:

http://hadoop.apache.org/common/releases.html#Download


Which is the latest stable hadoop version to install in a production
> environment with package manager support?


Apache Ambari is a new project that can be used to install the Hadoop
ecosystem including Hadoop 1.0.3 and HBase 0.92.1 using rpms and providing
a web-based UI to control and monitor your cluster. They are in the process
of making their first release and we would love to discuss it with you on
ambari-u...@apache.org.

-- Owen


Re: hadoop kerberos security / unix kdc

2012-06-29 Thread Owen O'Malley
On Fri, Jun 29, 2012 at 2:07 PM, Tony Dean  wrote:

> Hadoop 1.0.3, JDK1.6.0_21 with JCE export jars for strong encryption.


You need to move up to a JDK > 1.6.0_27. I'd suggest 1.6.0_31.

For details, look at: http://wiki.apache.org/hadoop/HadoopJavaVersions

-- Owen


Re: hadoop kerberos security / unix kdc

2012-06-29 Thread Owen O'Malley
On Fri, Jun 29, 2012 at 1:50 PM, Tony Dean  wrote:

>  First, I’d like to thank the community for the time and effort they put
> into sharing their knowledge…
>

Which version of Hadoop are you running? Which JDK are you using? You
probably need HDFS-2617 and JDK 1.6.0_31.

-- Owen


Re: Need example programs other then wordcount for hadoop

2012-06-29 Thread Owen O'Malley
On Fri, Jun 29, 2012 at 9:46 AM, Saravanan Nagarajan <
saravanan.nagarajan...@gmail.com> wrote:

> HI all,
>
> I ran word count examples in hadoop and it's very good starting point for
> hadoop.But i am looking for more programs with advanced concept . If you
> have any programs or suggestion, please send to me at "
> saravanan.nagarajan...@gmail.com".
>
> If you have best practices,please share with me.
>

Also look at the other examples in the project that I've written over the
years.

* teragen, terasort, teravalidate  at
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0/src/examples/org/apache/hadoop/examples/terasort/
* puzzle solver
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0/src/examples/org/apache/hadoop/examples/dancing/
* secondary sort
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0/src/examples/org/apache/hadoop/examples/SecondarySort.java

-- Owen


Re: Hadoop security

2012-06-25 Thread Owen O'Malley
On Mon, Jun 25, 2012 at 8:02 AM, Fabio Pitzolu wrote:

> Hi community!
> I have a question concerning the Hadoop security, in particular I need some
> advice to configure the Kerberos authentication:
>
> 1 - I have an Active Directory domain, do I have to connect the Linux
> Hadoop nodes to the AD domain?
> 2 - Is it possible to use a KDC to authenticate and another KDC for user /
> groups authorization?
>

It is common to create a domain for the linux machines in the cluster with
the principals for the servers (nn/_HOST, jt/_HOST, dn/_HOST, tt/_HOST,
etc. where the _HOST is replaced by the full host name.) If you have an
Active Directory for the users, you need to set up a trust relationship
between the linux KDC and the ActiveDirectory. The other critical piece is
setting up the auth_to_local mapping so that the kerberos principals are
correctly mapped to unix login ids.

This is a common configuration, so you aren't even on the bleeding edge.
*grin*

-- Owen


Re: Terasort

2012-05-14 Thread Owen O'Malley
On Mon, May 14, 2012 at 10:40 AM, Barry, Sean F  wrote:
> I am having a bit of trouble understanding how the Terasort benchmark works, 
> especially the fundamentals of how the data is sorted. If the data is being 
> split into many chunks wouldn't it all have to be re-integrated back into the 
> entire dataset?

Before the job is launched, the input is sampled to find "cut" points.
Those cut points are used to assign keys to reduces. For example, if
you have 100 reduces, there are 99 keys chosen. All keys less than the
first are sent to the first reduce, between the first two keys are
sent to the second reduce and so on. The logic is done by the
TotalOrderPartitioner, which replaces MapReduce's default
HashPartitioner.

-- Owen


Re: Is TeraGen's generated data deterministic?

2012-04-14 Thread Owen O'Malley
Yes, both versions of teragen are completely deterministic. They each use a 
random number generator with a fixed seed. 

-- Owen

On Apr 14, 2012, at 1:53 PM, David Erickson  wrote:

> Hi we are doing some benchmarking of some of our infrastructure and
> are using TeraGen/TeraSort to do the benchmarking.  I am wondering if
> the data generated by TeraGen is deterministic, in that if I repeat
> the same experiment multiple times with the same configuration options
> if it will continue to generate and sort the exact same data?  And if
> not, is there an easy mod to make this happen?
> 
> Thanks!
> David


Re: Very strange Java Collection behavior in Hadoop

2012-03-19 Thread Owen O'Malley
On Mon, Mar 19, 2012 at 11:05 PM, madhu phatak  wrote:

> Hi Owen O'Malley,
>  Thank you for that Instant reply. It's working now. Can you explain me
> what you mean by "input to reducer is reused" in little detail?


Each time the statement "Text value = values.next();" is executed it always
returns the same Text object with the contents of that object changed. When
you add the Text to the list, you are adding a pointer to the same Text
object. At the end you have 6 copies of the same pointer instead of 6
different Text objects.

The reason that I said it is my fault, is because I added the optimization
that causes it. If you are interested in Hadoop archeology, it was
HADOOP-2399 that made the change. I also did HADOOP-3522 to improve the
documentation in the area.

-- Owen


Re: Very strange Java Collection behavior in Hadoop

2012-03-19 Thread Owen O'Malley
On Mon, Mar 19, 2012 at 10:52 PM, madhu phatak  wrote:

> Hi All,
>  I am using Hadoop 0.20.2 . I am observing a Strange behavior of Java
> Collection's . I have following code in reducer


That is my fault. *sigh* The input to the reducer is reused. Replace:

list.add(value);

with:

list.add(new Text(value));

and the problem will go away.

-- Owen


Re: High quality hadoop logo?

2012-03-01 Thread Owen O'Malley
On Thu, Mar 1, 2012 at 2:14 PM, Keith Wiley  wrote:
> Sorry, false alarm.  I was looking at the popup thumbnails in google image 
> search.  If I click all the way through, there are some high quality
> versions available.  Why is the version on the Apache site (and the Wikipedia 
> page) so poor?

The high resolution images are in subversion:

http://svn.apache.org/repos/asf/hadoop/logos/

-- Owen


Re: Hadoop and Hibernate

2012-02-28 Thread Owen O'Malley
On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
 wrote:

> If I create an executable jar file that contains all dependencies required
> by the MR job do all said dependencies get distributed to all nodes?

You can make a single jar and that will be distributed to all of the
machines that run the task, but it is better in most cases to use the
distributed cache.

See 
http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache

> If I specify but one reducer, which node in the cluster will the reducer
> run on?

The scheduling is done by the JobTracker and it isn't possible to
control the location of the reducers.

-- Owen


Re: Hadoop Oppurtunity

2012-02-19 Thread Owen O'Malley
The Hadoop PMC can create a Hadoop ecosystem specific job list if we want one. 
Would people find it useful?

-- Owen

On Feb 19, 2012, at 3:20 AM, Harsh J  wrote:

> Job-related mails must always go to the dedicated j...@apache.org
> mailing list. For more information, see
> http://www.apachenews.org/archives/000465.html
> 
> On Sun, Feb 19, 2012 at 12:36 PM, real great..
>  wrote:
>> Could we actually create a separate mailing list for Hadoop related jobs?
>> 
>> On Sun, Feb 19, 2012 at 11:40 AM, larry  wrote:
>> 
>>> Hi:
>>> 
>>> We are looking for someone to help install and support hadoop clusters.
>>>  We are in Southern California.
>>> 
>>> Thanks,
>>> 
>>> Larry Lesser
>>> PSSC Labs
>>> (949) 380-7288 Tel.
>>> la...@pssclabs.com
>>> 20432 North Sea Circle
>>> Lake Forest, CA 92630
>>> 
>>> 
>> 
>> 
>> --
>> Regards,
>> R.V.
> 
> 
> 
> -- 
> Harsh J
> Customer Ops. Engineer
> Cloudera | http://tiny.cloudera.com/about


Re: Hadoop Example in java

2012-02-17 Thread Owen O'Malley
On Fri, Feb 17, 2012 at 1:00 AM, vikas jain  wrote:
>
> Hi All,
>
> I am looking for example in java for hadoop. I have done lots of search but
> I have only found word count. Are there any other exapmple for the same.

If you want to find them on the web, you can look in subversion:

http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/examples/org/apache/hadoop/examples/

-- Owen


Re: Sorting text data

2012-02-08 Thread Owen O'Malley
On Wed, Feb 8, 2012 at 5:59 AM, sangroya  wrote:
> Hi,
>
> I tried to run the sort example by specifying the input format. But I got
> the following error, while running it.

You actually need a different mapper to make the whole thing work. I
made a patch for Sort.java that should do the trick.

https://gist.github.com/1770850

Just run the sort with -text and it will set the input format, output
format, key type, value type, and also set the  mapper that I added so
that you move the line to the key instead of the value.

-- Owen


Re: Regarding security in hadoop

2012-01-30 Thread Owen O'Malley
On Mon, Jan 30, 2012 at 12:45 AM, renuka  wrote:
>
>
> Hi All,
>
> As per the below link security feature•Security (strong authentication via
> Kerberos authentication protocol) is added in hadoop 1.0.0 release.
> http://www.infoq.com/news/2012/01/apache-hadoop-1.0.0

Actually, it was first released in the 0.20.203.0 release.

>
> But we didnt find any documentation related to this in 1.0.0 documentation.

It would be great if someone setup a nice how-to page. You can use the
old instructions from here:

http://yahoo.github.com/hadoop-common/installing.html

You can also get some of the motivation for security over here:

http://hortonworks.com/category/apache-hadoop/hadoop-security/

-- Owen


Re: Automate Hadoop installation

2011-12-07 Thread Owen O'Malley
On Mon, Dec 5, 2011 at 2:32 AM, praveenesh kumar  wrote:
> Hi all,
>
> Can anyone guide me how to automate the hadoop installation/configuration
> process?

We are rapidly making progress on Ambari. Ambari is an Apache project
that will deploy, configure, and administer Hadoop clusters with all
of the related tools (Hadoop, Hbase, Pig, Hive, Zookeeper, etc). We
will have a CLI, REST, and Web UI interfaces.

Please come check out the project and come join us if you are
interested in helping build it:

http://incubator.apache.org/ambari/

-- Owen


Re: Authentication

2011-11-18 Thread Owen O'Malley
On Fri, Nov 18, 2011 at 6:52 AM, Jignesh Patel  wrote:
>
> Harsh,
> Does that mean to implement authentication we need to have oozie jars with
> hadoop jars?

To be clear, all of the functionality is in Hadoop. The user "oozie"
was used as an example and we should probably change the example to
look more like:


  hadoop.proxyuser.myproxy.groups
  group1,group2


  hadoop.proxyuser.myproxy.hosts
  host1,host2


Which enables the "myproxy" user to impersonate "group1" or "group2"
when working on "host1" or "host2".

-- Owen


Re: source code of hadoop 0.20.2

2011-11-15 Thread Owen O'Malley
On Tue, Nov 15, 2011 at 5:23 AM, Uma Maheswara Rao G
wrote:

> http://svn.apache.org/repos/asf/hadoop/common/branches/
> all branches code will be under this.
>  You can choose required one.


Actually, you are looking for the tag:

http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.2/

You can either use subversion to check out the whole directory or browse
using http.

-- Owen


Re: Hadoop MapReduce Poster

2011-10-31 Thread Owen O'Malley
On Mon, Oct 31, 2011 at 6:14 AM, Mathias Herberts <
mathias.herbe...@gmail.com> wrote:

> Hi,
>
> I'm in the process of putting together a 'Hadoop MapReduce Poster' so
> my students can better understand the various steps of a MapReduce job
> as ran by Hadoop.


Most of it is probably beneath the radar, but if you want the details of
how the sort actually works in MapReduce, I'd suggest going through Chris
Douglas' presentation on it.

http://www.slideshare.net/hadoopusergroup/ordered-record-collection?from=ss_embed

At the very least, you want to show the serialization before the sort in
the Mapper and deserialization in the Reducer, which gives you a good
platform to talk about why you need to define RawComparators for your key
types if you want reasonable performance out of the sort.

 -- Owen


Re: Combiners

2011-10-31 Thread Owen O'Malley
On Mon, Oct 31, 2011 at 5:41 AM, Mathias Herberts <
mathias.herbe...@gmail.com> wrote:

> Thanks for listing the 5 requirements, if you don't mind I'll add them
> to the Hadoop MapReducer Poster.
>

Sure.


Re: Combiners

2011-10-30 Thread Owen O'Malley
On Sat, Oct 29, 2011 at 3:52 AM, Mathias Herberts <
mathias.herbe...@gmail.com> wrote:

> My question is, what happens if the combiner outputs different keys
> than what it is being fed? The output of the combiner will suffer two
> flaws:
>
> 1. It won't be sorted
> 2. It might end up in the wrong partition
>

Yes. We've talked about adding various checks, but I don't think anyone has
added them. We obviously have the input key and one option would be to
ignore the output key.


> Since a Combiner is simply a Reducer with no other constraints,
>

That isn't true. Combiners are required to be:
  1. Idempotent - The number of times the combiner is applied can't change
the output
  2. Transititive -  The order of the inputs can't change the output
  3. Side-effect free - Combiners can't have side effects (or they won't be
idempotent).
  4. Preserve the sort order - They can't change the keys to disrupt the
sort order
  5. Preserve the partitioning - They can't change the keys to change the
parititioning

All 5 of them are required for combiners.

-- Owen


Re: Sudoku Example Program Inputs

2011-10-18 Thread Owen O'Malley
On Tue, Oct 18, 2011 at 3:20 PM, Owen O'Malley  wrote:

>
>
> On Tue, Oct 18, 2011 at 1:23 PM, Adam  wrote:
>
>> Does anyone know the syntax for the sudoku example program input and if I
>> can find some datasets for it?
>>
>
> There is an example puzzle at: 
> puzzle1.dta<http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security205/src/examples/org/apache/hadoop/examples/dancing/puzzle1.dta>
>

Oops, the link got messed up.
puzzle1.dta<http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205/src/examples/org/apache/hadoop/examples/dancing/puzzle1.dta>


Re: Sudoku Example Program Inputs

2011-10-18 Thread Owen O'Malley
On Tue, Oct 18, 2011 at 1:23 PM, Adam  wrote:

> Does anyone know the syntax for the sudoku example program input and if I
> can find some datasets for it?
>

There is an example puzzle at:
puzzle1.dta

Roughly, each cell can be either a number or '?' with spaces as separators.

-- Owen


Re: Disable Sorting?

2011-09-10 Thread Owen O'Malley
On Sat, Sep 10, 2011 at 12:33 PM, Meng Mao  wrote:

> Is there a way to collate the possibly large number of map output files,
> though?


You can make fewer mappers by setting the mapred.min.split.size to define
the smallest input that will be given to a mapper.

There isn't currently a way of getting a collated, but unsorted list of
key/value pairs. For most applications, the in memory sort is fairly cheap
relative to the shuffle and other parts of the processing.

-- Owen


Re: Binary content

2011-09-01 Thread Owen O'Malley
On Thu, Sep 1, 2011 at 8:37 AM, Mohit Anchlia wrote:

Thanks! Is there a specific tutorial I can focus on to see how it could be
> done?
>

Take the word count example and change its output format to be
SequenceFileOutputFormat.

job.setOutputFormatClass(SequenceFileOutputFormat.class);

and it will generate SequenceFiles instead of text. There is
SequenceFileInputFormat for reading.

-- Owen


Re: Skipping Bad Records in M/R Job

2011-08-09 Thread Owen O'Malley
On Tue, Aug 9, 2011 at 5:28 PM, Maheshwaran Janarthanan <
ashwinwa...@hotmail.com> wrote:

>
> Hi,
>
> I have written a Map reduce job which uses third party libraries to process
> unseen data which makes job fail because of errors in records.
>
> I realized 'Skipping Bad Records' feature in Hadoop Map/Reduce. Can Anyone
> send me the code snippet which enables this feature by setting properties on
> JobConf
>

I wouldn't recommend using the bad record skipping, since it was always
experimental and I don't think it has been well maintained.

If your 3rd part library crashes the jvm, I'd suggest using a subprocess to
call it and handle the errors yourself.

-- Owen


Re: Is it ok to manually delta ~hadoop/mapred/local/taskTracker/archive/*

2011-08-09 Thread Owen O'Malley
On Tue, Aug 9, 2011 at 8:34 AM, Robert J Berger  wrote:

> Looks like I have something not configured particularly well so that
> mapred/local/taskTracker/archive is a local filesystem and its filling
> things up.
>

Configure the size of the distributed cache on each node using
local.cache.size, which defaults to 10gb.


> Is it ok to delete mapred/local/taskTracker/archive/* at the Unix level? Or
> is there some other way to force that to be deleted.


If you restart the task tracker, I believe it will delete it. You shouldn't
delete it behind the scenes, because you'll cause failures for any running
tasks and confuse the task tracker with what it has stored.


> I can't really restart my hadoop cluster just to fix this right n ow. I'm
> running hadoop 0.20.1.


I'd highly recommend you upgrade to 0.20.203.0.

-- Owen


Re: Which release to use?

2011-07-15 Thread Owen O'Malley

On Jul 15, 2011, at 7:58 AM, Michael Segel wrote:

> So while you can use the Apache release, it may not make sense for your 
> organization to do so. (Said as I don the flame retardant suit...)

I obviously disagree. *grin* Apache Hadoop 0.20.203.0 is the most stable and 
well tested release and has been deployed on Yahoo's 40,000 Hadoop machines in 
clusters of up to 4,500 machines and has been used extensively for running 
production work loads. We are actively working to make the install and 
deployment of Apache Hadoop easier

In terms of commercial support, HortonWorks is absolutely supporting the Apache 
releases. IBM is also supporting the Apache releases:

http://davidmenninger.ventanaresearch.com/2011/05/18/ibm-chooses-hadoop-unity-not-shipping-the-elephant/

So lack of commercial support isn't a problem...

-- Owen

Re: Which release to use?

2011-07-14 Thread Owen O'Malley

On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote:

> I'm a newbie and I am confused by the Hadoop releases.
> I thought 0.21.0 is the latest & greatest release that I
> should be using but I noticed 0.20.203 has been released
> lately, and 0.21.X is marked "unstable, unsupported".
> 
> Should I be using 0.20.203?

Yes, I apologize for confusing release numbering, but the best release to use 
is 0.20.203.0. It includes security, job limits, and many other improvements 
over 0.20.2 and 0.21.0. Unfortunately, it doesn't have the new sync support so 
it isn't suitable for using with HBase. Most large clusters use a separate 
version of HDFS for HBase.

-- Owen



Re: Can Mapper get paths of inputSplits ?

2011-05-13 Thread Owen O'Malley
On Thu, May 12, 2011 at 10:16 PM, Mark question  wrote:

>   Who's filling the map.input.file and map.input.offset (ie. which class)
> so I can extend it to have a function to return these strings.


MapTask.updateJobWithSplit is the method doing the work.

-- Owen


Re: Can Mapper get paths of inputSplits ?

2011-05-12 Thread Owen O'Malley
On Thu, May 12, 2011 at 9:23 PM, Mark question  wrote:

>  So there is no way I can see the other possible splits (start+length)?
> like
> some function that returns strings of map.input.file and map.input.offset
> of
> the other mappers ?
>

No, there isn't any way to do it using the public API.

The only way would be to look under the covers and read the split file
(job.split).

-- Owen


Re: Can Mapper get paths of inputSplits ?

2011-05-12 Thread Owen O'Malley
On Thu, May 12, 2011 at 8:59 PM, Mark question  wrote:

> Hi
>
>   I'm using FileInputFormat which will split files logically according to
> their sizes into splits. Can the mapper get a pointer to these splits? and
> know which split it is assigned ?
>

Look at
http://hadoop.apache.org/common/docs/r0.20.203.0/mapred_tutorial.html#Task+JVM+Reuse

 In particular, map.input.file and map.input.offset are the configuration
parameters that you want.

-- Owen


Re: Stable Release

2011-04-29 Thread Owen O'Malley
On Thu, Apr 28, 2011 at 12:28 PM, Juan P.  wrote:

> Hi guys,
> I wanted to know exactly which was the latest stable release of Hadoop.


0.20.2 is the current stable release. I actually rolled a 0.20.3 release
candidate, but didn't call a vote on it since 0.20.203.0 will
quickly supersede it. I've created a release candidate for 0.20.203.0 that
is currently being voted on. The 0.20.203.0 release has all of security and
many many improvements. It is currently in production on Yahoo's 4500
machine production clusters. The 0.20.2xx branch will be maintained for
the foreseeable future.

0.21.0 was never tested at scale, does not contain security, and is
unsupported.

0.22.0 is not ready for release yet.

-- Owen


Re: Applications creates bigger output than input?

2011-04-29 Thread Owen O'Malley
On Fri, Apr 29, 2011 at 5:02 AM, elton sky  wrote:

> For my benchmark purpose, I am looking for some non-trivial, real life
> applications which creates *bigger* output than its input. Trivial example
> I
> can think about is cross join...
>

As you say, almost all cross join jobs have that property. The other case
that almost always fits into that category is generating an index. For
example, if your input is a corpus of documents and you want to generate the
list of documents that contain each word, the output (and especially the
shuffle data) is much larger than the input.

-- Owen


Re: Does it mean that single disk failure causes the whole datanode to fail?

2011-04-26 Thread Owen O'Malley
On Tue, Apr 26, 2011 at 6:46 AM, Xiaobo Gu  wrote:

> How can I download the patched version of hadoop, I only know the
> initial versions of each release from the official download website.


The 0.20.204 version is still being tested. I'd expect a release next month.
You can look at the sources at:

http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-204

-- Owen


Re: Sequence.Sorter Performance

2011-04-25 Thread Owen O'Malley
The SequenceFile sorter is ok. It used to be the sort used in the shuffle.
*grin*

Make sure to set io.sort.factor and io.sort.mb to appropriate values for
your hardware. I'd usually use io.sort.factor as 25 * drives and io.sort.mb
is the amount of memory you can allocate to the sorting.

-- Owen


Re: Does it mean that single disk failure causes the whole datanode to fail?

2011-04-25 Thread Owen O'Malley
On Mon, Apr 25, 2011 at 9:17 AM, Mathias Herberts <
mathias.herbe...@gmail.com> wrote:

> You can configure how many failed volumes a datanode can tolerate.


 That code doesn't handle the corner cases very well. In particular, we've
had problems with nodes with bad drives causing problems when they are
restarted. A more stable solution has been committed to the
branch-0.20-security and will come out as part of 0.20.204.0.

-- Owen


Re: Hadoop client jar dependencies

2011-03-01 Thread Owen O'Malley
On Tue, Mar 1, 2011 at 2:33 AM, Bryan Keller  wrote:

> I am writing an application that submits job jar files to the job tracker.
> The application writes some files to HDFS among other things before
> triggering the job. I am using the hadoop-core library in the Maven central
> repository. Unfortunately this library has several dependencies that I don't
> believe I need for a client application, such as Jasper, Jetty, and such. Is
> there a list of the jar files needed for developing Hadoop client
> applications? Or alternatively, a list of jar files only needed when running
> the Hadoop server processes?


We should filter the dependencies. In theory to get the client jar
dependencies, you should use:

% hadoop classpath

unfortunately, currently that returns both the server and client jars.

-- Owen


Slides and videos from Feb 2011 Bay Area HUG posted

2011-02-24 Thread Owen O'Malley
The February 2011 Bay Area HUG had a record turn out with 336 people  
signed up. We had two great talks:


* The next generation of Hadoop MapReduce by Arun Murthy
* The next generation of Hadoop Operations at Facebook by Andrew Ryan

The videos and slides are posted on Yahoo's blog:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/

-- Owen

Re: MRUnit and Herriot

2011-02-02 Thread Owen O'Malley
Please keep user questions off of general and use the user lists instead.
This is defined here .

MRUnit is for testing user's MapReduce applications. Herriot is for testing
the framework in the presence of failures.

-- Owen

On Wed, Feb 2, 2011 at 5:44 AM, Edson Ramiro  wrote:

> Hi all,
>
> Plz, could you explain me the difference between MRUnit and Herriot?
>
> I've read the documentation of both and they seem very similar to me.
>
> Is Herriot an evolution of MRUnit?
>
> What can Herriot do that MRUnit can't?
>
> Thanks in Advance
>
> --
> Edson Ramiro Lucas Filho
> {skype, twitter, gtalk}: erlfilho
> http://www.inf.ufpr.br/erlf07/
>


Re: Problem write on HDFS

2011-01-26 Thread Owen O'Malley
Please direct user questions to common-user@hadoop.apache.org.

-- Owen

On Tue, Jan 25, 2011 at 3:27 AM, Alessandro Binhara wrote:

> I build a servlet with a hadoop...
> i think that tomcat enviroment will be find a hadoop-core-0.20.2.jar .. but
> a get a same error
>
> *ype* Exception report
>
> *message***
>
> *description* *The server encountered an internal error () that prevented
> it
> from fulfilling this request.*
>
> *exception*
>
> javax.servlet.ServletException: Servlet execution threw an exception
>
> *root cause*
>
> java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.hadoop.conf.Configuration
> HadoopWriterLib.HadoopWriter.OpenFileSystem(HadoopWriter.java:22)
>HadoopWriterLib.HadoopWriter.(HadoopWriter.java:16)
>HadoopServletTest.doGet(HadoopServletTest.java:35)
>javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
>javax.servlet.http.HttpServlet.service(HttpServlet.java:722)
>
> *note* *The full stack trace of the root cause is available in the Apache
> Tomcat/7.0.6 logs.*
>
> The error can be a problem on my ubuntu server ?
>
> thanks
>
> On Mon, Jan 24, 2011 at 1:01 PM, Alessandro Binhara  >wrote:
>
> > ..i try
> >  java -classpath hadoop-core-0.20.1.jar -jar HahoopHdfsHello.jar
> >
> > i got a same error..
> > i will try build a servlet and run on tomcat...
> > i try many issues to config a classpath... all fail..
> >
> > thanks
> >
> >
> > On Mon, Jan 24, 2011 at 12:54 PM, Harsh J 
> wrote:
> >
> >> The issue would definitely lie with your CLASSPATH.
> >>
> >> Ideally, while beginning development using Hadoop 0.20, it is better
> >> to use the `hadoop jar` command to launch jars of any kind that
> >> require Hadoop libraries; be it MapReduce or not. The command will
> >> ensure that all the classpath requirements for Hadoop-side libraries
> >> are satisfied, so you don't have to worry.
> >>
> >> Anyhow, try launching it this way:
> >> $ java -classpath hadoop-0.20.2-core.jar -jar HadoopHdfsHello.jar; #
> >> This should run just fine.
> >>
> >> On Mon, Jan 24, 2011 at 5:06 PM, Alessandro Binhara 
> >> wrote:
> >> > Hello ..
> >> >
> >> > i solve problem in jar..
> >> > i put a hadoop-core-0.20.2.jar   in same jar dir.
> >> >
> >> > i configure a class path
> >> > export CLASSPATH=.:$JAVA_HOME
> >> >
> >> > i got this erro in shell
> >> >
> >> > root:~# java -jar HahoopHdfsHello.jar
> >> > Exception in thread "main" java.lang.NoClassDefFoundError:
> >> > org/apache/hadoop/conf/Configuration
> >> >at HadooHdfsHello.main(HadooHdfsHello.java:18)
> >> > Caused by: java.lang.ClassNotFoundException:
> >> > org.apache.hadoop.conf.Configuration
> >> >at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> >> >at java.security.AccessController.doPrivileged(Native Method)
> >> >at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> >> >at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
> >> >at
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> >> >at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
> >> >... 1 more
> >> >
> >> >
> >> > What is the problem?
> >> >
> >> > thanks
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> www.harshj.com
> >>
> >
> >
>


Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

2011-01-13 Thread Owen O'Malley
At some point, we'll replace Jetty in the shuffle, because it imposes too
much overhead and go to Netty or some other lower level library. I don't
think that using HTTP adds that much overhead although it would be
interesting to measure that.

-- Owen


Re: SequenceFiles and streaming or hdfs thrift api

2011-01-04 Thread Owen O'Malley
On Tue, Jan 4, 2011 at 10:02 AM, Marc Sturlese wrote:

> The thing is I want this file to be a SequenceFile, where the key should be
> a Text and the value a Thrift serialized object. Is it possible to reach
> that goal?
>

I've done the work to support that in Java. See my patch in HADOOP-6685. It
also adds seamless support for ProtocolBuffers and Avro in SequenceFiles
with arbitrary combinations of keys and values using different
serializations.

-- Owen


Re: Caution using Hadoop 0.21

2010-11-15 Thread Owen O'Malley
I'm very sorry that you got burned by the change. Most MapReduce
applications don't extend the Context classes since those are objects that
are provided by the framework. In 0.21, we've marked which interfaces are
stable and which are still evolving. We try and hold all of the interfaces
stable, but evolving ones do change as we figure out what they should look
like.

Can I ask why you were extending the Context classes?

-- Owen


Re: Problem with custom WritableComparable

2010-11-12 Thread Owen O'Malley
On Thu, Nov 11, 2010 at 4:29 PM, Aaron Baff  wrote:

> I'm having a problem with a custom WritableComparable that I created to use
> as a Key object. I basically have a number of identifier's with a timestamp,
> and I'm wanting to group the Identifier's together in the reducer, and order
> the records by the timestamp (oldest to newest)


The reduce is called for each distinct key. Fortunately, there is an option
to get different grouping going into the reduce called the "grouping"
comparator. Look at the SecondarySort example for how to do it. Also note
that your partitioner needs to make sure that the partition is only picked
based on the primary key. (This can be effected by making the hashcode only
depend on it, if you use the HashPartitioner.

-- Owen


Re: Prime number of reduces vs. linear hash function

2010-10-27 Thread Owen O'Malley
Prime numbers only matter if the hash function is bad and you are using a
hash partitioner. In most cases, the hashes are fine and thus the number of
reduces can be dictated by the desired degree of parallelism.

-- Owen


Re: BUG: Anyone use block size more than 2GB before?

2010-10-21 Thread Owen O'Malley
The block sizes were 2G. The input format made splits that were more than a
block because that led to better performance.

-- Owen


Re: BUG: Anyone use block size more than 2GB before?

2010-10-18 Thread Owen O'Malley
Block sizes larger than 2**31 are known to not work. I haven't ever  
tracked down the problem, just set my block size to be smaller than  
that.


-- Owen


Re: Hadoop - Solaris

2010-10-17 Thread Owen O'Malley

On Oct 16, 2010, at 1:08 PM, Bruce Williams wrote:

If anyone with experience with Hadoop and Solaris can contact me off  
list,

even to just say I am doing it and it is OK it would be appreciated.


LinkedIn is currently running Hadoop on Solaris. Hopefully, Allen  
Wittenauer can get back to you on some hints.


-- Owen


Re: Architecture

2010-10-13 Thread Owen O'Malley
Here is a presentation from Hadoop Summit 2009 "HBase goes Realtime"
that gives numbers for latency with HBase. Redirecting to common-user.

http://bit.ly/aJEwYj

-- Owen


Re: Why hadoop is written in java?

2010-10-10 Thread Owen O'Malley
The real answer is that Hadoop was written originally to support Nutch, which 
is in Java. Java has mostly served us well being reliable, extremely powerful 
libraries, and being far easier to debug than C++. There are issues of 
course... Java's interface to the OS is very weak, object memory overhead is 
high, and program startup is very slow. 

-- Owen

On Oct 9, 2010, at 21:40, elton sky  wrote:

> I always have this question but couldn't find proper answer for this. For
> system level applications, c/c++ is preferable. But why this one using java?


Re: Generating an Index for sequence files

2010-10-02 Thread Owen O'Malley
On Sat, Oct 2, 2010 at 5:25 AM, Harsh J  wrote:
> Maybe you should take a look at the TFile classes?

The TFiles give you the meta information you want including row counts
and an index that is integrated with the compression. The only
downside is that you'll need to handle the serialization yourself,
because TFiles only handle binary data. I'm working on a patch that
include OFiles, which are TFiles that include serialization. The patch
also includes support for Writables, Avro, ProtocolBuffers, and Thrift
in SequenceFiles, MapFiles, and OFiles. (See
https://issues.apache.org/jira/browse/HADOOP-6685 .)

MapFiles are SequenceFiles with an index. (They are actually are
implemented as two SequenceFiles, one as the index (key and position)
and one as the data (key and value). MapFiles don't record the number
of rows.

-- Owen


Re: Hive Configuration

2010-09-28 Thread Owen O'Malley


On Sep 28, 2010, at 2:18 PM, Matt Tanquary wrote:


How do I change the port that it tries to connect to?


Please move this discussion over to hive-u...@hadoop.apache.org.

Thanks,
   Owen


Re: do you need to call super in Mapper.Context.setup()?

2010-09-17 Thread Owen O'Malley


On Sep 17, 2010, at 7:29 AM, David Rosenstrauch wrote:


On 09/16/2010 11:38 PM, Mark Kerzner wrote:

Hi,

any need for this,

protected void setup(Mapper.Context context) throws IOException,
InterruptedException {
super.setup(context); // TODO - does this need to be done?
this.context = context;
}

Thank you,
Mark


"Use the source Luke".

If you take a look through the source code, you'll see the answer is  
no.


It is generally a good practice to call the super method, but it  
doesn't do anything currently and in practice that is unlikely to  
change.


-- Owen


Re: changing SequenceFile format

2010-09-14 Thread Owen O'Malley


On Sep 13, 2010, at 9:19 PM, Matthew John wrote:

To sum it up, I should be writing InputFormat , OutputFormat where I  
will be
defining my RecordReader/Writer and InputSplits. Now, why cant I use  
the
FpMetadata and FpMetaId I implemented as the value and key classes.  
Would

not that solve a lot of problem since I have defined in.readfields and
out.write there itself.


You could, it just isn't very reusable. If you use BytesWritable, it  
is easy to make the input format parameterable to handle different  
size keys and values. It would work either way...


-- Owen


Re: changing SequenceFile format

2010-09-13 Thread Owen O'Malley


On Sep 13, 2010, at 12:11 PM, Matthew John wrote:

The terasort input you have implemented is text type. And the input  
is line
format where as I am dealing with sequence binary file. For my  
requirement I
have created two writable implementables for the key and value  
respectively


I would just use BytesWritable directly. The reader/writer should  
insist on the fixed lengths, not the types. The only restriction is  
that you can't use the BytesWritable readFields and write methods.  
You'll need to implement them in the file reader and writer.


I assume I should also implement a inputformat and outputformat  
along with

these. But I am not able to figure out how to provide the respective
filesplit and recordreader/writer.


To implement InputFormat, you'll need to implement getSplits and  
createRecordReader. You'll need to create a RecordReader class that  
understands your file's reader class. Once you implement an  
InputFormat, just set the class as the InputFormat for your job.


-- Owen


Re: changing SequenceFile format

2010-09-13 Thread Owen O'Malley


On Sep 13, 2010, at 2:15 AM, Matthew John wrote:


Hi guys,

I wanted to take in file with input :  
..
binary sequence file (key and value length are constant) as input  
for the
Sort (examples) . But as I understand the data in a standard  
Sequencefile of
hadoop is in the format :  
. . Where

should I modify the code so as to use my inputfile as input to the
recordreader.


Instead of modifying SequenceFile, I'd suggest that you create a new  
FixedRecordFile that has a fixed width for keys and values. In the  
terasort example in MapReduce I create an InputFormat that has 10 byte  
keys and 90 byte values with no markers.


See http://bit.ly/9RybHw .

The terasort example's InputFormat also does sampling, which you  
probably don't need. You will need to pay attention to the getSplits  
to ensure that you cut on record boundaries.


-- Owen


Re: Sorting Numbers using mapreduce

2010-09-05 Thread Owen O'Malley
The critical item is that your map's output key should be IntWritable
instead of Text. The default comparator for IntWritable will give you
properly sorted numbers. If you stringify the numbers and output them
as text, they'll get sorted as strings.

-- Owen


Re: Do I need to write a RawComparator if my custom writable is not used as a Key?

2010-09-02 Thread Owen O'Malley
No, RawComparator is only needed for Keys. 

-- Owen

On Sep 2, 2010, at 3:35, Vitaliy Semochkin  wrote:

> Hello,
> 
> Do I need to write a  RawComparator if my custom writable is not used
> as a Key to improve performance?
> 
> Regards,
> Vitaliy S


Re: Job performance issue: output.collect()

2010-09-01 Thread Owen O'Malley


On Sep 1, 2010, at 5:18 AM, Oded Rosen wrote:

I would like to know what happens in the output.collect line that  
takes lots

of time, in order to cut down this job's running time.
Please keep in mind that I have a combiner, and to my understanding
different things happen to the map output when a combiner is present.


The best presentation on the map side sort is the one that Chris  
Douglas (who did most of the implementation) did for the Bay Area HUG.


http://developer.yahoo.net/blogs/hadoop/2010/01/hadoop_bay_area_january_2010_u.html

There are both slides and a video of the presentation. I'd run through  
that first.


You most likely are getting more spills than you deserve. The  
variables to look at:


io.sort.mb - should be most of the task's ram budget
io.sort.record.percent - depends on record size
io.sort.factor - typically 25 * (# of disks / node)

-- Owen


Re: api doc incomplete

2010-09-01 Thread Owen O'Malley


On Sep 1, 2010, at 8:56 AM, Gang Luo wrote:


Hi all,
does anybody notice the online api doc is incomplete? At
http://hadoop.apache.org/common/docs/current/api/ there is even no  
mapred or

mapreduce package there. I remember I use it well before. What happen?


When {common,hdfs,mapreduce}-0.21.0 was released, it became current.  
Since the project split happened between 0.20 to 0.21, that means the  
"current" docs are now split. If you look at http://hadoop.apache.org/mapreduce/docs/current/api 
, you'll find what you are looking for. Additionally, we should make a  
stable link that points to the latest of the 0.20 line.


-- Owen


Re: Combining Only Once?

2010-08-31 Thread Owen O'Malley
There used to be a compatibility switch, but I believe it was removed
in 0.19 or 0.20.

Can you describe what you are trying to accomplish? Combiners were
always intended to only be used for  operations that are idempotent,
associative, and commutative. Clearly your combiner doesn't satisfy
one of those properties or you wouldn't care if it was applied more
than once.

-- Owen


Re: Job in 0.21

2010-08-29 Thread Owen O'Malley
On Sun, Aug 29, 2010 at 4:39 PM, Mark  wrote:
>  How should I be creating a new Job instance in 0.21. It looks like
> Job(Configuration conf, String jobName) has been deprecated.

Go ahead and use that method. I have a jira open to undeprecate it.

-- Owen


Re: Command line arguments

2010-08-29 Thread Owen O'Malley
You would need to save the arguments into the Configuration (aka
JobConf) that you create your job with.

-- Owen


Re: Ivy

2010-08-27 Thread Owen O'Malley


On Aug 27, 2010, at 8:04 AM, Mark wrote:


Is there a public ivy repo that has the latest hadoop? Thanks


The hadoop jars and poms should be pushed into the central Maven  
repositories, which Ivy uses.


-- Owen


Re: svn/git revisions for 0.20.2

2010-08-25 Thread Owen O'Malley


On Aug 25, 2010, at 3:20 PM, Johannes Zillmann wrote:


Hey folks,

can somebody tell me how to get the source versions from git/svn for  
hadoop-hdfs and hadoop-mapreduce ?
In hadoop-common there are branches and tags for the release. But  
how to get the corresponding version of the other 2 projects ?


0.20 was pre-project split, so common included hdfs and mapreduce.

-- Owen


Re: Hadoop sorting algorithm on equal keys

2010-08-24 Thread Owen O'Malley


On Aug 24, 2010, at 2:21 AM, Teodor Macicas wrote:


Hello,

Let's say that we have two maps outputs which will be sorted before  
the reducer will start. Doesn't matter what {a,b0,b1,c} mean, but  
let's assume that b0=b1.

Map output1 : a, b0
Map output2:  c, b1
In this case we can have 2 different sets of sorted data:
1. {a,b0,b1,c}  and
2. {a,b1,b0,c}  since b0=b1 .

In my particular problem I want to distingush between b0 and b1.  
Basically, they are numbers but I have extra-info on which my  
comparison will be made.
Now, the question is: how can I change Hadoop default behaviour in  
order to control the sorting algorithm on equal keys ?


You need to extend the keys with the extra information to sort on. To  
get exactly one call to reduce for each logical key, you define a  
grouping comparator that determines when two keys should be distinct  
calls to reduce. Look at the SecondarySort example in MapReduce. http://bit.ly/a9B7hh


-- Owen


Re: Where is Hadoop 20.3?

2010-08-14 Thread Owen O'Malley
I'll probably roll a 0.20.3 in a couple of weeks. 

-- Owen

On Aug 14, 2010, at 5:32, thinke365  wrote:

> 
> 0.20.3 is not released yet, the latest release is 0.21.0rc1
> 
> Pete Tyler wrote:
>> 
>> Apologies for the newbie question but I think I'm a little lost. Hadoop
>> 20.2 came out in Feb 2010 but the fix I'm looking for is in Hadoop 20.3,
>> 
>> 
>> https://issues.apache.org/jira/browse/MAPREDUCE-118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> 
>> However I can only find the somewhat mature 20.2 version under
>> 'downloads'. Is it possible that 20.3 is not out yet?
>> 
> 
> -- 
> View this message in context: 
> http://old.nabble.com/Where-is-Hadoop-20.3--tp29319909p29436584.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 


Re: Passing information to Map Reduce

2010-08-13 Thread Owen O'Malley
Use Sequence Files if the objects are Writable. Otherwise, you can use the Java 
serialization. I'm working on a patch to allow Protocol Buffers, Thrift, 
Writables, Java serialization, and Avro in Sequence Files. 

-- Owen

On Aug 13, 2010, at 17:41, Pete Tyler  wrote:

> Distributed cache looks hopeful. However, at first glance it looks good for 
> distributing files but not instance data. Ideally I'm looking for something 
> similar to, say, objects being passed between client and server by RMI.
> 
> -Pete
> 
> On Aug 13, 2010, at 3:15 PM, Owen O'Malley  wrote:
> 
>> 
>> On Aug 13, 2010, at 12:55 PM, Pete Tyler wrote:
>> 
>>> I have only found two options, neither of which I really like,
>>> 1. Encode information in the job name string - a bit hokey and limited to 
>>> strings
>> 
>> I'd state this as encode the information into a string and add it to the 
>> JobConf. Look at the Base64 class if you want to uuencode your data. This is 
>> easiest, but causes problems if the JobConf gets much above 2MB or so.
>> 
>>> 2. Persist the information, which changes from job to job - if every map 
>>> task and every reduce task has to read one piece if specific, persisted 
>>> data that may be stored on another node won't this have significant 
>>> performance implications?
>> 
>> This is generally the preferred strategy. In particular, the framework 
>> supports the "distributed cache" which will cause files from HDFS to be 
>> downloaded to each node before the tasks run. The files will only be 
>> downloaded once for each node. Files in the distributed cache can be a 
>> couple GB without huge performance problems.
>> 
>> -- Owen


Re: Passing information to Map Reduce

2010-08-13 Thread Owen O'Malley


On Aug 13, 2010, at 12:55 PM, Pete Tyler wrote:


I have only found two options, neither of which I really like,
1. Encode information in the job name string - a bit hokey and  
limited to strings


I'd state this as encode the information into a string and add it to  
the JobConf. Look at the Base64 class if you want to uuencode your  
data. This is easiest, but causes problems if the JobConf gets much  
above 2MB or so.


2. Persist the information, which changes from job to job - if every  
map task and every reduce task has to read one piece if specific,  
persisted data that may be stored on another node won't this have  
significant performance implications?


This is generally the preferred strategy. In particular, the framework  
supports the "distributed cache" which will cause files from HDFS to  
be downloaded to each node before the tasks run. The files will only  
be downloaded once for each node. Files in the distributed cache can  
be a couple GB without huge performance problems.


-- Owen


Re: Partitioner in Hadoop 0.20

2010-08-04 Thread Owen O'Malley


On Aug 4, 2010, at 10:58 AM, David Rosenstrauch wrote:

So my partitioner needs to implement Configurable, then not  
JobConfigurable.  Tnx much!


ReflectionUtils.newInstance will use either Configurable or  
JobConfigurable (or both!). So implementing either one will work fine.


-- Owen


Re: Partitioner in Hadoop 0.20

2010-08-04 Thread Owen O'Malley


On Aug 4, 2010, at 8:38 AM, David Rosenstrauch wrote:

Anyone know if there's any particular reason why the new Partitioner  
class doesn't implement JobConfigurable?  (And, if not, whether  
there's any plans to fix this omission?)  We're working on a  
somewhat complex partitioner, and it would be extremely helpful to  
be able to pass it some parms via the jobconf.


The short answer is that it doesn't need to. If you make your  
partitioner either Configured or JobConfigurable, it will be  
configured. The API class doesn't depend on it precisely because it is  
not required for all partitioners.


-- Owen


Re: Set variables in mapper

2010-08-03 Thread Owen O'Malley


On Aug 3, 2010, at 6:12 AM, Erik Test wrote:


Really? This seems pretty nice.

In the future, with your implementation, would the value always have  
to be

wrapped in a MyMapper instance? How would parameters be removed if
necessary?


Sorry, I wasn't clear. I mean that if you make the sub-classes of  
Mapper serializable, the framework will serialize them for you and  
deserialize them on the cluster.


So a fuller example would look like:

public class MyMapper extends  
Mapper implements Writable {

  int param;

  public MyMapper() { param = 0; }
  public MyMapper(int param) { this.param = param; }

  public void map(IntWritable key, Text value, Context context) {...}

  public void readFields(DataInputStream in) throws IOException {
param = in.readInt();
  }

  public void write(DataOutputStream out) throws IOException {
 out.writeInt(param);
  }
}

You won't need to use Writable, you can use ProtocolBuffers, Thrift,  
or Avro. Where this comes in really handy is places like the  
InputFormats and OutputFormats. It enables you to replace the current:


job.setInputFormatClass(SequenceFileInputFormat.class);
FileInputFormat.setInputPath(job, inDir);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
FileOutputFormat.setOutputPath(job, outDir);

with the more natural:

job.setInputFormat(new SequenceFileInputFormat(inDir));
job.setOutputFormat(new SequenceFileOutputFormat(outDir));

Is that clearer now?

-- Owen


Re: Set variables in mapper

2010-08-02 Thread Owen O'Malley


On Aug 2, 2010, at 9:17 AM, Erik Test wrote:

I'm trying to set a variable in my mapper class by reading an  
argument from
the command line and then passing the entry to the mapper from main.  
Is this

possible?


Others have already answered with the current solution of using  
JobConf to store the value. I should also note that I plan to  
implement MAPREDUCE-1183 for 0.22. It will allow you to do this  
directly like:


job.setMapper(new MyMapper(someIntegerParameter));

which will serialize MyMapper's state, including the integer  
parameter, and store it as part of your job.


-- Owen


Re: It is possible a bug,about BooleanWritable

2010-07-25 Thread Owen O'Malley
It is a bug. It was fixed as part of MAPREDUCE-365. The relevant fix is:

Index: src/java/org/apache/hadoop/io/BooleanWritable.java
===
--- src/java/org/apache/hadoop/io/BooleanWritable.java  (revision 769338)
+++ src/java/org/apache/hadoop/io/BooleanWritable.java  (revision 769339)
@@ -100,9 +100,7 @@

 public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
-  boolean a = (readInt(b1, s1) == 1) ? true : false;
-  boolean b = (readInt(b2, s2) == 1) ? true : false;
-  return ((a == b) ? 0 : (a == false) ? -1 : 1);
+  return compareBytes(b1, s1, l1, b2, s2, l2);
 }
   }


-- Owen


Re: WritableComparable question

2010-07-19 Thread Owen O'Malley


On Jul 19, 2010, at 2:15 PM, Raymond Jennings III wrote:

The only way I could fix this was to re-initialize my vectors in the  
"public
void readFields(DataInput in)" method.  This does not seem like I  
should have to

do this or do I ???


Yes, readFields has to clear the data structures. MapReduce reuses  
objects in the loops.


-- Owen


Re: Terasort problem

2010-07-11 Thread Owen O'Malley


On Jul 10, 2010, at 4:29 AM, Tonci Buljan wrote:


mapred.tasktracker.reduce.tasks.maximum <- Is this configured on every
datanode separately? What number shall I put here?

mapred.tasktracker.map.tasks.maximum <- same question  as
mapred.tasktracker.reduce.tasks.maximum


Generally, RAM is the scarce resource. Decide how you want to divide  
your worker's RAM between tasks. So with 6 G of RAM,  I'd probably  
make 4 map slots of 0.75G each and 2 reduce slots of 1.5G each.


mapred.reduce.tasks <- Is this configured ONLY on Namenode and what  
value

should it have for my 8 node cluster?


You should set it to your reduce task capacity of 2 * 8 = 16.


mapred.map.tasks <- same question as mapred.reduce.tasks


It matters less, but go ahead and set it to the map capacity of 4 * 8  
= 32. More important is to set your vm and buffer sizes for the tasks.  
You also want to set your HDFS block size to be 0.5G to 2G. That will  
make your map inputs the right size.


-- Owen





Re: Next Release of Hadoop version number and Kerberos

2010-07-10 Thread Owen O'Malley
On Wed, Jul 7, 2010 at 8:54 AM, Todd Lipcon  wrote:
> On Wed, Jul 7, 2010 at 8:29 AM, Ananth Sarathy
> wrote:
>
> The Security/Kerberos support is a huge project that has been in progress
> for several months, so the implementation spans tens (if not hundreds?) of
> patches. Manually adding these patches to a prior Apache release will take
> days if not weeks of work, is my guess.

Based on a quick check from Yahoo's github
(http://github.com/yahoo/hadoop-common):

Between yahoo 0.20.10 to yahoo 0.20.104.2:

421 commits
combined diff of 8.75 mb
12 person-years worth of work
consists almost exclusively of security work

For a single person, who doesn't know the code it will take months to
apply it to one of the Apache branches.

-- Owen


Re: Terasort problem

2010-07-09 Thread Owen O'Malley
I would guess that you didn't set the number of reducers for the job,
and it defaulted to 2.

-- Owen


Re: Is the sort(in sort and shuffle) always required

2010-06-19 Thread Owen O'Malley
On Sat, Jun 19, 2010 at 9:16 AM, Saptarshi Guha
 wrote:
> My question: is the sort (in the sort and shuffle) absolutely required?
> If I wanted mapreduce to partition (using the map) and then aggregate(using
> reduce) without a need for the keys to be sorted
> is it possible to turn of the sorting? Or is the fact that keys come to the
> reducer in sorted order just a side effect of sorting and that
> the sorting is vital for the efficient operation of MapReduce?

If you have 0 reduces, you don't get any sorting or aggregation. It
isn't possible to turn off the sorting and leaving the aggregation. In
practice, the sort doesn't cost as much as the data transfer between
the map and reduce.

-- Owen


Re: Using wget to download file from HDFS

2010-06-15 Thread Owen O'Malley


On Jun 15, 2010, at 9:30 AM, Jaydeep Ayachit wrote:

Thanks, data node may not be known. Is it possible to direct url to  
namenode and namenode handling streaming by fetching data from  
various data nodes?


If you access the servlet on the NameNode, it will automatically  
redirect you to a data node that has some of the data on it. You  
certainly should not pick a random data node yourself.


Also note that in yahoo 0.20.104 or 0.22, you'll need a Kerberos  
ticket or delegation token to use the servlet.


-- Owen


Re: Caching in HDFS C API Client

2010-06-14 Thread Owen O'Malley
Indeed. On the terasort benchmark, I had to run intermediate jobs that
were larger than ram on the cluster to ensure that the data was not
coming from the file cache.

-- Owen


Re: Is it possible ....!!!

2010-06-10 Thread Owen O'Malley
You can define your own socket factory by setting the configuration parameter:

hadoop.rpc.socket.factory.class.default

to a class name of a SocketFactory. It is also possible to define
socket factories on a protocol by protocol basis. Look at the code in
NetUtils.getSocketFactory.

-- Owen


Re: the same key in different reducers

2010-06-09 Thread Owen O'Malley
On Wed, Jun 9, 2010 at 3:15 PM, Alex Kozlov  wrote:
> So I assume it is entirely possible to write a partitioner that distributes
> the same key to multiple reducers and it does not have to be
> non-deterministic.  It can assign the partition based on the value.
>
> Is this correct?

Yes. I've never liked the fact that Partitioners get the value for
exactly that reason. It was originally put in for some obscure corner
case in Nutch. Fixing it now would be difficult.

Also note that "non-deterministic" doesn't imply using Random. You
could just fail to overload the hashcode method and take the default
from Object. That would cause you to hash based on the object's
address, which is different for each jvm.

-- Owen


Re: the same key in different reducers

2010-06-09 Thread Owen O'Malley


On Jun 9, 2010, at 1:17 AM, Oleg Ruchovets wrote:

So is that case possible or every and every reducer has unique  
output key?


The partitioner controls which reduce a given key is sent to. If the  
partitioner is non-deterministic, the key can end up going to  
different reduces. If you are using the default hash partitioner, that  
would imply that you didn't define a proper hash code for your key.


-- Owen


Re: calling C programs from Hadoop

2010-05-29 Thread Owen O'Malley
On Sat, May 29, 2010 at 12:52 PM, Asif Jan  wrote:
> Look at Hadoop streaming, may be it is helpful to you.

There is also Pipes, which is the C++ interface to MapReduce.

-- Owen


Re: Encryption in Hadoop 0.20.1?

2010-05-27 Thread Owen O'Malley
On Thu, May 27, 2010 at 6:58 AM, Arv Mistry  wrote:
> Thanks for responding Ted. I did see that link before but there wasn't enough 
> details there for me to make sense of it. I'm not sure who Owen is ;(

I'm Owen, although I think I've used at least 5 different email
addresses on these lists at various times. *smile*

Since you specify 0.20, you'd probably want to put your keys in to
HDFS and read it from the tasks. Note that this is *not* secure and
other users of your cluster can access your data in HDFS with only a
tiny bit of mis-direction. (This will be fixed in 0.22, where we are
adding strong authentication based on Kerberos.)

The next step would be to define a compression codec that does the
encryption. So let's say you define a XorEncryption that does a simple
xor with a byte. (Obviously, you would use something better than xor,
it is just an example!) XorEncryption would need to implement
org.apache.hadoop.io.compression.CompressionCodec. You'd also need add
your new class to the list of codecs in the configuration variable
io.compression.codecs.

For details of how to configure your mapreduce job with compression
(or in this case encryption), look at http://bit.ly/9PMHUA. If
XorEncryption returned ".xor" getDefaultExtension, then any file that
ended in .xor would automatically be put through the encryption. So
input is automatically handled. You need to define some configuration
variables to get it applied to the output of MapReduce.

-- Owen


Re: Can a Partitioner access the Reporter?

2010-05-12 Thread Owen O'Malley


On May 11, 2010, at 11:06 PM, gmar wrote:



I'd like to be able to have my customised Partitioner update  
counters in the

Reporter.
i.e. So that I know how many keys have been sent to each partition.

So, is it possible for the partitioner to obtain a reference to the
reporter?


No, even in the new API where we give access to the context within the  
close method, it isn't passed to the partitioner, unfortunately.


I guess it'd need to obtain this via the JobConf object it has  
access to in

the configure() method.


You can the JobConf, but you can't get to the Reporter in the  
configure method.



Or is there another way to skin this cat?


Roughly, you need to either set a static in the Mapper.map (in the new  
API use Mapper.setup) or emit a pseudo key or value with it. I'd lean  
toward a static...


-- Owen


Re: Questions about SequenceFiles

2010-05-11 Thread Owen O'Malley
On Tue, May 11, 2010 at 7:48 AM, Ananth Sarathy
 wrote:
> Ok,  how can I report that?

File a jira on the project that manages the type. I assume it is
Lucene in this case.

>  Also, it seems that requiring a no argument constructor but using an
> interface is kind of a broken paradigm. Shouldn't there be some other
> mechanism for this?

The problem is that given a class name from the SequenceFile, we need
to build an "empty" object. The most natural way to provide that
capability is with a 0 argument constructor.

-- Owen


Re: Questions about SequenceFiles

2010-05-11 Thread Owen O'Malley
Assumption for Writables that should be documented somewhere:
  * Each type must have a 0 argument constructor.
  * Each call to write must not assume any shared state.
  * Each call to readFields must consume exactly the number of bytes
produced by write.

SequenceFile also assumes:
  * All keys are exactly the same type (not polymorphic).
  * All values are exactly the same type.
  * Both types are specified by the writer in the create call.

-- Owen


Re: Applying HDFS-630 patch to hadoop-0.20.2 tarball release?

2010-05-04 Thread Owen O'Malley
On Tue, May 4, 2010 at 10:03 AM, Joseph Chiu  wrote:
> Thanks Todd.    Where I really need help is to get up to speed on that
> process of recompiling (and re-installing the build outputs) with ant.

The place to look is in the wiki:

http://wiki.apache.org/hadoop/HowToRelease

It walks through the build process very well.

-- Owen


Re: hadoop conf for dynamically changing ips

2010-03-26 Thread Owen O'Malley

On Mar 26, 2010, at 9:39 AM, Gokulakannan M wrote:

I have a LAN in which the IPs of the machines will be  
changed

dynamically by the DHCP sever.


I think you'd need to use a NAT translation so that inside your  
cluster you have stable IP addrs in 10.x.x.x but the external IP addr  
is set by DHCP.


-- Owen


Re: DeDuplication Techniques

2010-03-26 Thread Owen O'Malley


On Mar 25, 2010, at 11:09 AM, Joseph Stein wrote:


I have been researching ways to handle de-dupping data while running a
map/reduce program (so as to not re-calculate/re-aggregate data that
we have seen before[possibly months before]).


So roughly, your problem is that you have large amounts of historic  
data and you need to merge in the current month. The best solution  
that I've seen looks like:


Keep your historic data sorted by md5.

Run a MapReduce job to sort your new data into md5 order. Note that  
you want a total order, but because the md5's are evenly spaced across  
the key space this is easy. Basically, you pick a number of reduces  
(eg. 256) and then use the top N bits of the MD5 to pick your reduce.  
Since this job is only processing your new data, it is very fast.


Next you do a map-side join where each input split consists of an md5  
range. The RecordReader reads from the historic and new datasets  
merging them as they go. (You can use the map-side join library for  
this.) Your map does the merge of the new and old. This is a map-only  
job, so it also is very fast.


Of course if the new data is small enough you can read all of the new  
input in each of the maps and just keep (and sort in ram) the new  
records that are in the right range and do the merge from ram. This  
lets you avoid the step where you sort the new data. This kind of  
merge optimization is where Pig and Hive hide lots of the details from  
the developer.


-- Owen


Re: Measuring running times

2010-03-17 Thread Owen O'Malley


On Mar 17, 2010, at 4:47 AM, Antonio D'Ettole wrote:


Hi everybody,
as part of my project work at school I'm running some Hadoop jobs on a
cluster. I'd like to measure exactly how long each phase of the  
process

takes: mapping, shuffling (ideally divided in copying and sorting) and
reducing.


Look at the job history logs. They break down the times for each task.  
You need to run a script to aggregate them. You can see an example of  
the aggregation on my petabyte sort description:


http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html

-- Owen


Re: Security issue: hadoop fs shell bypass authentication?

2010-03-06 Thread Owen O'Malley


On Mar 5, 2010, at 4:49 PM, Allen Wittenauer wrote:


On 3/5/10 1:57 PM, "jiang licht"  wrote:
So, this means that hadoop fs shell does not require any  
authentication and

can be fired from anywhere?


There is no authentication/security layer in any released version of  
Hadoop.


True, although we are busily adding it. *Smile* It is going into trunk  
and Yahoo is back porting all of the security work on top of the Yahoo  
0.20 branch. The primary coding is done, it is undergoing QA now. The  
plan is to get it on to the alpha clusters by April, and production  
clusters by August. Although we haven't pushed the security branch out  
yet to our github repository, we should soon. (http://github.com/yahoo/hadoop-common 
)


-- Owen


Re: problem building trunk

2010-02-26 Thread Owen O'Malley


On Feb 26, 2010, at 10:22 AM, Massoud Mazar wrote:


I'm having issues building the trunk. I follow steps mentioned at 
http://wiki.apache.org/hadoop/BuildingHadoopFromSVN



It is a documentation error. Giri, can you update it with the current  
targets (ie. mvn-install)?


Thanks,
   Owen


Re: CDH2 or Apache Hadoop - Official Debian packages

2010-02-25 Thread Owen O'Malley


On Feb 25, 2010, at 10:20 AM, Allen Wittenauer wrote:

Actually my hope is in the plan of hadoop to once establish a  
stable API (as

planned) so that an upgrade will be backwards compatible.


History shows you are in for a long wait.


I hope not and I'm trying to make sure that isn't true. At this point,  
we have a lot of customers inside Yahoo who yell at our SVP when  
anyone breaks API compatibility with the previous release.


My hope to get to the point where we do one major release a year and  
each major release is backwards compatible with the previous major  
release (as in you don't need to recompile your code). Bonus points if  
we can get a minor release out at the half year point. And of course  
bug fix releases as needed...


-- Owen


Re: Security Mechanisms in HDFS

2010-01-05 Thread Owen O'Malley


On Jan 5, 2010, at 7:44 AM, Yu Xi wrote:


Could any hadoop gurus tell me what kinds of security mechanisms are
already(or planed to be) implemented in hadoop filesystem?


It looks like you've found the ones that are already there.  You can  
see my slides about it here:


http://www.slideshare.net/oom65/plugging-the-holes-security-and-compatability-in-hadoop

We are actively working on it and have published a design document here:

http://bit.ly/75011o

-- Owen


I know there're some kind of Linux-like 9 bits(ie. ower,group,other)  
access
control existing in hdfs. Unfortunately there're no user  
authentication
modules. Seems like a big defect for hdfs since without  
authentication, user

authorization makes little sense.


It is enough to keep people from deleting things accidently, like the  
student that accidently deleted /Users.



It has mentioned in the official HDFS docs
that another Kerberos authentication module will be added to HDFS in  
the

future. Could anybody tell me when this will happen?


We are planning to be feature complete in Feb 2010. We are also back  
porting the changes into Yahoo's 20 branch as well as putting them  
into trunk.


-- Owen


Re: use List in reducer

2009-12-26 Thread Owen O'Malley


On Dec 26, 2009, at 5:00 PM, Bryan McCormick wrote:

What appears to be going on is that the Iterable values seemed  
to be reusing the Text object being exposed in the for loop and just  
changing the content of the Text.


That is correct.


activeList.add(new Text(val.toString));


It would be more efficient to just do:
   activeList.add(new Text(val));

-- Owen


  1   2   >