hadoop 0.18.0 ec2 images?

2008-08-20 Thread Karl Anderson
Are there any publicly available EC2 images for Hadoop 0.18.0 yet?   
There don't seem to be any in the hadoop-ec2-images bucket.


RE: Cannot read reducer values into a list

2008-08-20 Thread Deepika Khera
Thanks...this works beautifully :) !

Deepika

-Original Message-
From: Owen O'Malley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 20, 2008 7:52 AM
To: core-user@hadoop.apache.org
Subject: Re: Cannot read reducer values into a list


On Aug 19, 2008, at 4:57 PM, Deepika Khera wrote:

> Thanks for the clarification on this.
>
> So, it seems like cloning the object before adding to the list is the
> only solution for this problem. Is that right?

Yes. You can use WritableUtils.clone to do the job.

-- Owen


RE: Why is scaling HBase much simpler then scaling a relational db?

2008-08-20 Thread Jim Kellerman
Stuart,

In general you will get a quicker response to HBase questions by posting them 
to the HBase mailing list ([EMAIL PROTECTED]) see 
http://hadoop.apache.org/hbase/mailing_lists.html for how to subscribe.

Perhaps the best document on scaling HBase is actually the Bigtable paper:
http://labs.google.com/papers/bigtable.html


---
Jim Kellerman, Senior Engineer; Powerset (a Microsoft Company)

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
> Behalf Of Stuart Sierra
> Sent: Wednesday, August 20, 2008 1:03 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Why is scaling HBase much simpler then scaling a relational db?
>
> On Tue, Aug 19, 2008 at 9:44 AM, Mork0075 <[EMAIL PROTECTED]> wrote:
> > Can you please explain, why someone should use HBase for horizontal
> > scaling instead of a relational database? One reason for me would be,
> > that i don't have to implement the sharding logic myself. Are there other?
>
> A slight tangent -- there are various tools that implement sharding
> over relational databases like MySQL.  Two that I know of are
> DBSlayer,
> http://code.nytimes.com/projects/dbslayer
> and MySQL Proxy,
> http://forge.mysql.com/wiki/MySQL_Proxy
>
> I don't know of any formal comparisons between sharding traditional
> database servers and distributed databases like HBase.
> -Stuart


Re: Know how many records remain?

2008-08-20 Thread Qin Gao
Thanks Chris, that's exactly what I am trying to do. It solves my problem.

On Wed, Aug 20, 2008 at 4:36 PM, Chris Dyer <[EMAIL PROTECTED]> wrote:

> Qin, since I can guess what you're trying to do with this (emit a
> bunch of expected counts at the end of EM?), you can write output
> during the call to close().  It involves having to store the output
> collector object as a member of the class, but this is a way to do a
> final flush on the object before it is destroyed.
>
> Chris
>
> On Wed, Aug 20, 2008 at 7:02 PM, Qin Gao <[EMAIL PROTECTED]> wrote:
> > Hi mailing,
> >
> > Are there any way to know whether the mapper is processing the last
> record
> > that assigned to this node, or know how many records remain to be
> processed
> > in this node?
> >
> >
> > Qin
> >
>


Re: Know how many records remain?

2008-08-20 Thread Chris Dyer
Qin, since I can guess what you're trying to do with this (emit a
bunch of expected counts at the end of EM?), you can write output
during the call to close().  It involves having to store the output
collector object as a member of the class, but this is a way to do a
final flush on the object before it is destroyed.

Chris

On Wed, Aug 20, 2008 at 7:02 PM, Qin Gao <[EMAIL PROTECTED]> wrote:
> Hi mailing,
>
> Are there any way to know whether the mapper is processing the last record
> that assigned to this node, or know how many records remain to be processed
> in this node?
>
>
> Qin
>


Re: Why is scaling HBase much simpler then scaling a relational db?

2008-08-20 Thread Stuart Sierra
On Tue, Aug 19, 2008 at 9:44 AM, Mork0075 <[EMAIL PROTECTED]> wrote:
> Can you please explain, why someone should use HBase for horizontal
> scaling instead of a relational database? One reason for me would be,
> that i don't have to implement the sharding logic myself. Are there other?

A slight tangent -- there are various tools that implement sharding
over relational databases like MySQL.  Two that I know of are
DBSlayer,
http://code.nytimes.com/projects/dbslayer
and MySQL Proxy,
http://forge.mysql.com/wiki/MySQL_Proxy

I don't know of any formal comparisons between sharding traditional
database servers and distributed databases like HBase.
-Stuart


Know how many records remain?

2008-08-20 Thread Qin Gao
Hi mailing,

Are there any way to know whether the mapper is processing the last record
that assigned to this node, or know how many records remain to be processed
in this node?


Qin


Reminder: Monthly Hadoop User Group Meeting (Bay Area) today

2008-08-20 Thread Ajay Anand
Reminder: The next Hadoop User Group (Bay Area) meeting is scheduled for
today, Wednesday, Aug 20th from 6 - 7:30 pm at Yahoo! Mission College,
Santa Clara, CA, Building 1, Training Rooms 3&4.

 

Agenda:

Pig Update: Olga Natkovich
Hadoop 0.18 and post 0.18 - Sameer Paranjpye

 

Registration and directions: http://upcoming.yahoo.com/event/1011188

 

Look forward to seeing you there!

Ajay



Re: pseudo-global variable constuction

2008-08-20 Thread Sandy
Thank you very much, Paco and Jason. It works!

For any users who may be curious what this may look like in code, here is a
small snippet of mine:

file: myLittleMRProgram.java
package.org.apache.hadoop.examples;

  public static class Reduce extends MapReduceBase implements Reducer {
private int nTax = 0;

public void configure(JobConf job) {
super.configure(job);
String Tax = job.get("nTax");
nTax = Integer.parseInt(Tax);
}

public void reduce() throws IOException {
  
   System.out.println("nTax is: " + nTax);
}

main() {

conf.set("nTax", other_args.get(2));
JobClient.runJob(conf);

return 0;
}



-SM

On Tue, Aug 19, 2008 at 5:02 PM, Jason Venner <[EMAIL PROTECTED]> wrote:

> Since the map & reduce tasks generally run in a separate java virtual
> machine and on distinct machines from your main task's java virtual machine,
> there is no sharing of variables between the main task and the map or reduce
> tasks.
>
> The standard way is to store the variable in the Configuration (or JobConf)
> object in your main task
> Then in the configure method of your map and reduce task class, extract the
> variable value from the JobConf object.
>
> You will need to implement an overriding to the configure method in your
> map and reduce classes.
>
> This will also require that the variable value be serializable.
>
> For lots of large variables this can be expensive.
>
>
> Sandy wrote:
>
>> Hello,
>>
>>
>> My M/R program is going smoothly, except for one small problem. I have a
>> "global" variable that is set by the user (and thus in the main function),
>> that I want one of my reduce functions to access. This is a read-only
>> variable. After some reading in the forums, I tried something like this:
>>
>> file: MyGlobalVars.java
>> package org.apache.hadoop.examples;
>> public class MyGlobalVars {
>>static public int nTax;
>> }
>> --
>>
>> file: myLittleMRProgram.java
>> package.org.apache.hadoop.examples;
>> map function() {
>>   System.out.println("in map function, nTax is: " + MyGlobalVars.nTax);
>> }
>> 
>> main() {
>> MyGlobalVars.nTax = other_args.get(2);
>> System.out.println("in main function, nTax is: " + MyGlobalVars.nTax);
>> 
>> JobClient.runJob(conf);
>> 
>> return 0;
>> }
>> 
>>
>> When I run it, I get:
>> in main function, nTax is 20 (which is what I want)
>> in map function, nTax is 0 (<--- this is not right).
>>
>>
>> I am a little confused on how to resolve this. I apologize in advance if
>> this is an blatant java error; I only began programming in the language a
>> few weeks ago.
>>
>> Since Map Reduce tries to avoid the whole shared-memory scene, I am more
>> than willing to have each reduce function receive a local copy of this
>> user
>> defined value. However, I am a little confused on what the best way to do
>> this would be. As I see it, my options are:
>>
>> 1.) write the user defined value to the hdfs in the main function, and
>> have
>> it read from the hdfs in the reduce function. I can't quite figure out the
>> code to this though. I know how to specify -an- input file for the map
>> reduce task, but if I did it this way, won't I need to specify two
>> separate
>> input files?
>>
>> 2. Put it in the construction of the reduce object (I saw this mentioned
>> in
>> the archives). How would I accomplish this exactly when the value is user
>> defined? Parameter Passing? If so, won't this require me changing the
>> underlying map reduce base (which makes me a touch nervous, since i'm
>> still
>> very new to hadoop).
>>
>> What would be the easiest way to do this?
>>
>> Thanks in advance for the help. I appreciate your time.
>>
>> -SM
>>
>>
>>
> --
> Jason Venner
> Attributor - Program the Web 
> Attributor is hiring Hadoop Wranglers and coding wizards, contact if
> interested
>


Re: Cannot read reducer values into a list

2008-08-20 Thread Owen O'Malley


On Aug 19, 2008, at 4:57 PM, Deepika Khera wrote:


Thanks for the clarification on this.

So, it seems like cloning the object before adding to the list is the
only solution for this problem. Is that right?


Yes. You can use WritableUtils.clone to do the job.

-- Owen


Hadoop 0.17.2 released

2008-08-20 Thread Owen O'Malley
Hadoop Core 0.17.2 has been released and the website updated. It fixes  
a couple of critical bugs in the 0.17 branch. It can be downloaded from:


http://www.apache.org/dyn/closer.cgi/hadoop/core/

-- Owen


Re: Missing lib/native/Linux-amd64-64 on hadoop-0.17.2.tar.gz

2008-08-20 Thread Yi-Kai Tsai

hi

Could anyone help to re-pack the 0.17.2 with missing  
lib/native/Linux-amd64-64  ?


thanks

On Wed, Aug 20, 2008 at 9:31 AM, Yi-Kai Tsai <[EMAIL PROTECTED]> wrote:

  

But we do have  lib/native/Linux-amd64-64 on  hadoop-0.17.1.tar.gz and
hadoop-0.18.0.tar.gz ?




At least for -0.17.1, yes there is.

Regards,

Leon Mergen
  



--
Yi-Kai Tsai (cuma) <[EMAIL PROTECTED]>, Asia Regional Search Engineering.



Re: Missing lib/native/Linux-amd64-64 on hadoop-0.17.2.tar.gz

2008-08-20 Thread Leon Mergen
On Wed, Aug 20, 2008 at 9:31 AM, Yi-Kai Tsai <[EMAIL PROTECTED]> wrote:

> But we do have  lib/native/Linux-amd64-64 on  hadoop-0.17.1.tar.gz and
> hadoop-0.18.0.tar.gz ?


At least for -0.17.1, yes there is.

Regards,

Leon Mergen


Re: Missing lib/native/Linux-amd64-64 on hadoop-0.17.2.tar.gz

2008-08-20 Thread Yi-Kai Tsai

hi

But we do have  lib/native/Linux-amd64-64 on  hadoop-0.17.1.tar.gz and 
hadoop-0.18.0.tar.gz ?



ya, looks like Owen never built the 64bit native library.  It's an
optional build step:
wiki.apache.org/hadoop/HowToRelease

Nige

On Aug 19, 2008, at 9:24 PM, Yi-Kai Tsai wrote:

  

hi

I found we miss lib/native/Linux-amd64-64 on hadoop-0.17.2.tar.gz ?

thanks

--
Yi-Kai Tsai (cuma) <[EMAIL PROTECTED]>, Asia Regional Search
Engineering.




  



--
Yi-Kai Tsai (cuma) <[EMAIL PROTECTED]>, Asia Regional Search Engineering.



Re: input files

2008-08-20 Thread Amareshwari Sriramadasu
You can add more paths to input using 
FileInputFormat.addInputPath(JobConf, Path).
You can also specify comma separated filenames as input path using 
FileInputFormat.setInputPaths(JobConf, String commaSeparatedPaths)
More details at 
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html


You can also use glob path to specify multiple paths in a single path.

Thanks
Amareshwari
Deepak Diwakar wrote:

Hadoop usually takes either a single file or a folder as an input parameter.
But is it possible to modify it so that it can take list of files(not a
folder) as input parameter