Re: Hadoop-Archive Error for size of input data >2GB

2008-07-21 Thread Pratyush Banerjee

Thanks Mahadev,
Thanks for letting me know of the patch. I have already applied it and 
the archiving seems to run fine for input directory size of about 5GB.


Currently am testing the same programatically,  but since it is working 
from the command line, it should ideally also work this way.


thanks and regards~

Pratyush

[EMAIL PROTECTED] wrote:

HI Pratyush,

  I think this bug was fixed in
https://issues.apache.org/jira/browse/HADOOP-3545.

Can you apply the patch and see if it works?

Mahadev


On 7/21/08 5:56 AM, "Pratyush Banerjee" <[EMAIL PROTECTED]> wrote:

  

Hi All,

I have been using hadoop archives programmatically  to generate  har
archives from some logfiles  which are being dumped into the hdfs.

When the input directory to Hadoop Archiving program has files of size
more than 2GB, strangely the archiving fails with a error message saying

INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=   Illegal Capacity: -1

Going into the code i found out that this was due to numMaps having the
Value of -1.

As per the code in org.apache.hadoop.util.HadoopArchives:
archive(List srcPaths, String archiveName, Path dest)

the numMaps is initialized as
int numMaps = (int)(totalSize/partSize);
//run atleast one map.
conf.setNumMapTasks(numMaps == 0? 1:numMaps);

partSize has been statically assigned the value of 2GB in the beginning
of the class as,

static final long partSize = 2 * 1024 * 1024 * 1024

Strangely enough, the value i find assigned to partSize is  =  - 2147483648

Hence as a result in case of input directories of greater size, numMaps
is assigned -1 which leads to the code throwing up error.

I am using hadoop-0.17.1 and I got the archiving facility after applying
the patch hadoop-3307_4 patch.

This looks like a bug for me, so please let me know how to go about it.

Pratyush Banerjee




  




Re: Volunteer recruitment for matrix library project on Hadoop.

2008-07-21 Thread Edward J. Yoon
Thank you for all interest.

BTW, Please subscribe to the Hama developer mailing list instead of
send a mail to [EMAIL PROTECTED]

[EMAIL PROTECTED]

- Edward

On Thu, Jul 17, 2008 at 11:26 AM, Edward J. Yoon <[EMAIL PROTECTED]> wrote:
> Hello all,
>
> The Hama team which is trying to port typical linear algebra
> operations on Hadoop looking for a couple of more volunteers. This
> would essentially speedup development time for typical machine
> learning algorithms.
>
> If you interested in here contact [EMAIL PROTECTED]
>
> Thanks.
> --
> Best regards,
> Edward J. Yoon,
> http://blog.udanax.org
>



-- 
Best regards,
Edward J. Yoon,
http://blog.udanax.org


Regarding reading data from distributed hadoop cluster

2008-07-21 Thread Ninad Raut
Hi,

Can any one help me understand how to read data distributed oover a cluster.

For instance if we give a path /user/hadoop/parsed_data/part-/data , to
the map reduce program will that find the data on same path on all the
servers in the cluster , or will it be only the local file?

If it only reads from the local file how to read data all the clusters?


NR.


Re: New York user group?

2008-07-21 Thread Matt Kangas
Count me as another interested party.

--Matt

On Fri, Jul 18, 2008 at 8:59 AM, Alex Dorman <[EMAIL PROTECTED]> wrote:
> Please let me know if you would be interested in joining NY Hadoop user group 
> if one existed.
>
> I know about 5-6 people in New York City running Hadoop. I am sure there are 
> many more.
>
> Let me know. If there is some interest, I will try to put together first 
> meeting.
>
>
> thanks
>
> -Alex

-- 
Matt Kangas
[EMAIL PROTECTED]


Re: more than one reducer?

2008-07-21 Thread Taeho Kang
I don't know if there is any in-place mechanism for what you're looking for.


However, you could write a partitioner that distributes data in a way that
lower keys go to lower numbered reduce, and higher keys go to higher
numbered reduce. (e.g. Key starting with 'A~D' goes to part-, 'E~H' goes
to part-0001, and so on.) If you knew how well keys are distributed
beforehand, then you could distribute data quite equally to each reducer as
well.

When you are done, simply download the result files and just merge them
together and you have sorted output.



On Tue, Jul 22, 2008 at 9:08 AM, Mori Bellamy <[EMAIL PROTECTED]> wrote:

> hey all,
> i was wondering if its possible to split up the reduce task amongst more
> than one machine. i figured it might be possible for the map output to be
> copied to multiple machines; then each reducer could sort its keys and then
> combine them into one big sorted output (a la mergesort). does anybody know
> if there is an in-place mechanism for this?
>


hadoop-ec2 log access

2008-07-21 Thread Karl Anderson
I'm unable to access my logs with the JobTracker/TaskTracker web  
interface for a Hadoop job running on Amazon EC2.  The URLs given for  
the task logs are of the form:


  http://domu-[...].compute-1.internal:50060/

The Hadoop-EC2 docs suggest that I should be able to get onto port  
50060 for the master and the task boxes, is there a way to reach the  
logs?  Maybe by finding out what IP address to use?  Or is there a way  
to see the logs on the master?  When I run pseudo-distributed, the  
logs show up in the logs/userlogs subdirectory of the Hadoop root, but  
not on my EC2 instances.


I'm running a streaming job, so I need to be able to look at the  
stderr of my tasks.


Thanks for any help.


max number of files opened at the same time on Hdfs?

2008-07-21 Thread Eric Zhang

Hi,
Apology if this question has been answered before, but I could not find 
in the archive and twiki pages.   I am wondering what's the max number 
of open files for writes  at the same time given a Hdfs cluster?   I am 
streaming data into many different files (in the order of thousands) at 
the same time on Hdfs (200 node cluster) and seemed to have problem 
doing this.Any insights is greatly appreciated.





--
Eric Zhang
408-349-2466
Vespa Content team



Problem of Hadoop's Partitioner

2008-07-21 Thread Gopal Gandhi
I am following the example in 
http://hadoop.apache.org/core/docs/current/streaming.html about Hadoop's 
partitioner: org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner . It seems 
that the sorted values are based on dictionary, for eg:
1
12
15
2
28

What if I want to get numerical sorted list:
1
2
12
15
28
What partitioner shall I use?



  

Re: HQL usage

2008-07-21 Thread Edward J. Yoon
HQL will be integrated to HRdfStore project.
See http://groups.google.com/group/hrdfstore

Thanks,
Edward J. Yoon

On 7/22/08, stack <[EMAIL PROTECTED]> wrote:
> lucio Piccoli wrote:
> > hi Tho Pham
> >
> > i have checked the HQL api but the only reference i found was the
> org.apache.hadoop.hbase.hql package.
> > is this correct?
> >
> >
>
> HQL was the shell in 0.1.x.  Its been deprecated in TRUNK and the pending
> 0.2.0 release.
> > Since it use a TableFormatter to display the result set, i *cant* see how
> to use it efficently to retrieve a result set. i guess i was expecting a
> JDBC like semantic.
> > i really would like to port my existing application over to HBase but use
> existing SQL like (Hibernate)  query syntax.
> >  Is this what HBase is all about or have i missunderstood?
> >
> > any help appreciated.
> >
> >
> >
>
> Disabuse yourself of any notion that hbase is an RDBMS.  There's no SQL,
> JDBC, hibernate connector, etc.
>
> Going by your questions above, I'd suggest you do a little background
> reading so you get better sense of what hbase is about.  Start in with the
> hbase architecture paper up on our wiki:
> http://wiki.apache.org/hadoop/Hbase.   Feel free to ask
> questions in here in this forum if there is anything you need help with.
>
> Yours,
> St.Ack
>
>
>
> > adios dude
> >
> > -lucioto pop
> > ---
> > [EMAIL PROTECTED]
> >
> >
> >
> > > Date: Mon, 21 Jul 2008 22:14:33 +0700
> > > From: [EMAIL PROTECTED]
> > > To: [EMAIL PROTECTED]
> > > Subject: Re: HQL usage
> > >
> > > Dear lucio,
> > >
> > > Yes, sure.
> > > 1. Download hbase-0.1.3 library
> > > 2. Create new Java project
> > > 3. Refer to the download package
> > > 4. Using hql package in HBase. You should see the API docs of HBase for
> more detail.
> > >
> > > Good luck,
> > > Best regards,
> > > Tho Pham
> > >
> > > lucio Piccoli wrote:
> > >
> > >
> > > > hi all,
> > > >
> > > > i  have been reading the docs on HQL and thought it would be a great
> to use programatically.
> > > >
> > > > but after checking out the src, it seems it is only used for the
> shell.
> > > >
> > > > is that the intended usage of HQL or can it (or its replacement)  be
> used programatically?
> > > >
> > > > adios dude
> > > >
> > > > -lucio
> > > > ---
> > > > [EMAIL PROTECTED]
> > > >
> _
> > > > Want to help Windows Live Messenger plant more Aussie trees?
> > > > http://livelife.ninemsn.com.au/article.aspx?id=443698
> > > >
> > > >
> > >
> >
> >
> _
> > Meet singles near you. Try ninemsn dating now!
> >
> http://a.ninemsn.com.au/b.aspx?URL=http%3A%2F%2Fdating%2Eninemsn%2Ecom%2Eau%2Fchannel%2Findex%2Easpx%3Ftrackingid%3D1046247&_t=773166080&_r=WL_TAGLINE&_m=EXT
> >
> >
>
>


-- 
Best regards,
Edward J. Yoon,
http://blog.udanax.org


more than one reducer?

2008-07-21 Thread Mori Bellamy

hey all,
i was wondering if its possible to split up the reduce task amongst  
more than one machine. i figured it might be possible for the map  
output to be copied to multiple machines; then each reducer could sort  
its keys and then combine them into one big sorted output (a la  
mergesort). does anybody know if there is an in-place mechanism for  
this?


Re: question about Counters

2008-07-21 Thread Daniel Yu
thats great, thanks a lot!

Daniel

2008/7/21 Christian Ulrik Søttrup <[EMAIL PROTECTED]>:

> Hi,
>
> I use a counter in my reducer to check whether another iteration (of map
> reduce cycle) is necessary. I have a similar declaration as yours.
> Then in my main program i have:
>
> ***
> client.setConf(conf);
> RunningJob rj = JobClient.runJob(conf);
> Counters cs = rj.getCounters();
> long swaps=cs.getCounter(Red.Count.SWAPS);
>
> where Red is the class that defines the reducer and contains the
> enumeration.
>
> Cheers,
> Christian
>
> Daniel Yu skrev:
>
>  hi,
>>  i defined a counter of my own, and updated it in map method,
>>
>> protected static enum MyCounter {
>>INPUT_WORDS
>>};
>> ...
>> public void map(...) {
>>  ...
>>reporter.incrCounter(MyCounter.INPUT_WORDS, 1);
>> }
>>
>> and can i fetch the counts later? like in the  run() method after the job
>> is
>> finished,
>> coz i want to store it as a parameter file for later MR jobs.
>>
>> looks like the framework will use Counters.getCounter(enum key) to return
>> a
>> certain count,
>> which can be seen when the JobClient tries to run a job, but in my
>> application(outside the
>> job running) how can i get the Counters object? my guess is its maintained
>> by the framework,
>> so user doesnt need to create Counters object in the application.
>>
>> Thanks.
>>
>> Daniel
>>
>>
>>
>
>


Re: question about Counters

2008-07-21 Thread Christian Ulrik Søttrup

Hi,

I use a counter in my reducer to check whether another iteration (of map 
reduce cycle) is necessary. I have a similar declaration as yours.

Then in my main program i have:

***
client.setConf(conf);
RunningJob rj = JobClient.runJob(conf);
Counters cs = rj.getCounters();
long swaps=cs.getCounter(Red.Count.SWAPS);

where Red is the class that defines the reducer and contains the 
enumeration.


Cheers,
Christian

Daniel Yu skrev:

hi,
  i defined a counter of my own, and updated it in map method,

protected static enum MyCounter {
INPUT_WORDS
};
...
public void map(...) {
  ...
reporter.incrCounter(MyCounter.INPUT_WORDS, 1);
}

and can i fetch the counts later? like in the  run() method after the job is
finished,
coz i want to store it as a parameter file for later MR jobs.

looks like the framework will use Counters.getCounter(enum key) to return a
certain count,
which can be seen when the JobClient tries to run a job, but in my
application(outside the
job running) how can i get the Counters object? my guess is its maintained
by the framework,
so user doesnt need to create Counters object in the application.

Thanks.

Daniel

  




DFS, write sequence number and consistency

2008-07-21 Thread Kevin
Hi there,

It looks that current hadoop dfs puts the DFSClient as the "primary
node". See http://wiki.apache.org/hadoop/DFS_requirements

In Google file system, the write synchronization by multiple clients
is controlled by the primary node which decide the sequence of the
mutations to a block and apply to every replica. But in hadoop, how is
this achieved? If multiple clients write to the same block, what will
happen? Moreover, is this scenario possible under current situation?

Thanks and regards,
-Kevin


question about Counters

2008-07-21 Thread Daniel Yu
hi,
  i defined a counter of my own, and updated it in map method,

protected static enum MyCounter {
INPUT_WORDS
};
...
public void map(...) {
  ...
reporter.incrCounter(MyCounter.INPUT_WORDS, 1);
}

and can i fetch the counts later? like in the  run() method after the job is
finished,
coz i want to store it as a parameter file for later MR jobs.

looks like the framework will use Counters.getCounter(enum key) to return a
certain count,
which can be seen when the JobClient tries to run a job, but in my
application(outside the
job running) how can i get the Counters object? my guess is its maintained
by the framework,
so user doesnt need to create Counters object in the application.

Thanks.

Daniel


[Streaming] I figured out a way to do combining using mapper, would anybody check it?

2008-07-21 Thread Gopal Gandhi
I am using Hadoop Streaming. 
I figured out a way to do combining using mapper, is it the same as using a 
separate combiner?

For example: the input is a list of words, I want to count their total number 
for each word. 
The traditional mapper is:

while () {
  chomp ($_);
  $word = $_;
  print ($word\t1\n);
}


Instead of using a additional combiner, I modify the mapper to use a hash

%hash = ();
while () {
  chomp ($_);
  $word = $_;
  $hash{$word} ++;
}

foreach $key (%hash){
  print "$key\t$hash{$key}\n";
}

Is it the same as using a seperate combiner?


  

Re: type mismatch from key to map

2008-07-21 Thread Khanh Nguyen
Nevermind, I figured out my problem. I did not configure OutputFormat.

On Mon, Jul 21, 2008 at 1:44 PM, Khanh Nguyen <[EMAIL PROTECTED]> wrote:
> Hi Daniel,
>
> The outputformat of my 1st hadoop job is TextOutputFormat. The
> skeleton of my code follows:
>
> public int run(String[] args) throws Exception {
> //set up and run job 1
> ...
>conf.setOutputFormat(TextOutputFormat.class);
>FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
> 
> //set up and run job 2
> ...
>  FileInputFormat.addInputPath(sortJob, new Path(args[1] +
> "/part-0"));
>  FileOutputFormat.setOutputPath(sortJob, new Path(args[1]
> + "/result/"));
>  sortJob.setInputFormat(KeyValueTextInputFormat.class);
>  sortJob.setMapperClass(InverseMapper.class)
> ..
>
> }
>
> Please help.
>
> -k
>
> On Mon, Jul 21, 2008 at 1:30 PM, Daniel Yu <[EMAIL PROTECTED]> wrote:
>> hi k,
>>  i think u should look at ur map output format setting, and check if that
>> fits ur reduce input .
>>
>> Daniel
>>
>> 2008/7/21 Khanh Nguyen <[EMAIL PROTECTED]>:
>>
>>> Hello,
>>>
>>> I am getting this error
>>>
>>> java.io.IOException: Type mismatch in key from map: expected
>>> org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text
>>>
>>>
>>> Could someone please explain to me what i am doing wrong. Follow is
>>> the code I think is responsible...
>>>
>>> public int run() {
>>> .
>>>
>>> sortJob.setInputFormat(KeyValueTextInputFormat.class);
>>>
>>> sortJob.setMapperClass(InverseMapper.class);
>>> 
>>>
>>> }
>>>
>>> Thanks
>>>
>>> -k
>>>
>>
>


Re: Hadoop-Archive Error for size of input data >2GB

2008-07-21 Thread Mahadev Konar
HI Pratyush,

  I think this bug was fixed in
https://issues.apache.org/jira/browse/HADOOP-3545.

Can you apply the patch and see if it works?

Mahadev


On 7/21/08 5:56 AM, "Pratyush Banerjee" <[EMAIL PROTECTED]> wrote:

> Hi All,
> 
> I have been using hadoop archives programmatically  to generate  har
> archives from some logfiles  which are being dumped into the hdfs.
> 
> When the input directory to Hadoop Archiving program has files of size
> more than 2GB, strangely the archiving fails with a error message saying
> 
> INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=   Illegal Capacity: -1
> 
> Going into the code i found out that this was due to numMaps having the
> Value of -1.
> 
> As per the code in org.apache.hadoop.util.HadoopArchives:
> archive(List srcPaths, String archiveName, Path dest)
> 
> the numMaps is initialized as
> int numMaps = (int)(totalSize/partSize);
> //run atleast one map.
> conf.setNumMapTasks(numMaps == 0? 1:numMaps);
> 
> partSize has been statically assigned the value of 2GB in the beginning
> of the class as,
> 
> static final long partSize = 2 * 1024 * 1024 * 1024
> 
> Strangely enough, the value i find assigned to partSize is  =  - 2147483648
> 
> Hence as a result in case of input directories of greater size, numMaps
> is assigned -1 which leads to the code throwing up error.
> 
> I am using hadoop-0.17.1 and I got the archiving facility after applying
> the patch hadoop-3307_4 patch.
> 
> This looks like a bug for me, so please let me know how to go about it.
> 
> Pratyush Banerjee
> 



Re: type mismatch from key to map

2008-07-21 Thread Khanh Nguyen
Hi Daniel,

The outputformat of my 1st hadoop job is TextOutputFormat. The
skeleton of my code follows:

public int run(String[] args) throws Exception {
//set up and run job 1
...
conf.setOutputFormat(TextOutputFormat.class);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));


//set up and run job 2
...
  FileInputFormat.addInputPath(sortJob, new Path(args[1] +
"/part-0"));
  FileOutputFormat.setOutputPath(sortJob, new Path(args[1]
+ "/result/"));
  sortJob.setInputFormat(KeyValueTextInputFormat.class);
  sortJob.setMapperClass(InverseMapper.class)
..

}

Please help.

-k

On Mon, Jul 21, 2008 at 1:30 PM, Daniel Yu <[EMAIL PROTECTED]> wrote:
> hi k,
>  i think u should look at ur map output format setting, and check if that
> fits ur reduce input .
>
> Daniel
>
> 2008/7/21 Khanh Nguyen <[EMAIL PROTECTED]>:
>
>> Hello,
>>
>> I am getting this error
>>
>> java.io.IOException: Type mismatch in key from map: expected
>> org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text
>>
>>
>> Could someone please explain to me what i am doing wrong. Follow is
>> the code I think is responsible...
>>
>> public int run() {
>> .
>>
>> sortJob.setInputFormat(KeyValueTextInputFormat.class);
>>
>> sortJob.setMapperClass(InverseMapper.class);
>> 
>>
>> }
>>
>> Thanks
>>
>> -k
>>
>


Re: type mismatch from key to map

2008-07-21 Thread Daniel Yu
hi k,
  i think u should look at ur map output format setting, and check if that
fits ur reduce input .

Daniel

2008/7/21 Khanh Nguyen <[EMAIL PROTECTED]>:

> Hello,
>
> I am getting this error
>
> java.io.IOException: Type mismatch in key from map: expected
> org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text
>
>
> Could someone please explain to me what i am doing wrong. Follow is
> the code I think is responsible...
>
> public int run() {
> .
>
> sortJob.setInputFormat(KeyValueTextInputFormat.class);
>
> sortJob.setMapperClass(InverseMapper.class);
> 
>
> }
>
> Thanks
>
> -k
>


type mismatch from key to map

2008-07-21 Thread Khanh Nguyen
Hello,

I am getting this error

java.io.IOException: Type mismatch in key from map: expected
org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text


Could someone please explain to me what i am doing wrong. Follow is
the code I think is responsible...

public int run() {
.

sortJob.setInputFormat(KeyValueTextInputFormat.class);

sortJob.setMapperClass(InverseMapper.class);


}

Thanks

-k


Reminder - User Group Meeting July 22nd

2008-07-21 Thread Ajay Anand
A reminder that the next user group meeting is scheduled for July 22nd
from 6 - 7:30 pm at Yahoo! Mission College, Building 1, Training Rooms 3
and 4. 

 

Agenda:

Cascading - Chris Wensel

Performance Benchmarking on Hadoop (Terabyte Sort, Gridmix) - Sameer
Paranjpye, Owen O'Malley, Runping Qi

 

Registration and directions: http://upcoming.yahoo.com/event/869166
 

 

Look forward to seeing you there!

Ajay



null objects in records.

2008-07-21 Thread Marc de Palol

Hi all,

I have a mapper's Value type which comes from a record like this one:

module org {

  class Something {
AnotherRecord aRecord;
int number1;
int number2;
  }
}


So, i'm creating one of this Something objects and pass it to the 
output.collect(keyClass, Something). Sometimes i do not need the 
AnotherRecord object, so i tried to pass it a null object instead an 
empty AnotherRecord object, but at runtime i get a NullPointerException, 
which comes from Hadoop serializer trying to serialize the AnotherRecord 
field, which is null.


Is there any way to achieve this without having to mess with Hadoop's 
RccCompiler code? Or maybe this is a question for the dev mailing list?


I've been googling, but i really haven't been able to find an answer to 
this, and i think it's quite strange that Hadoop's serialization does 
not provide null objects support.


thanks in advance,

--
Marc de Palol

Java Developer @ Last.fm
http://last.fm/user/grindthemall


Re: Timeouts when running balancer

2008-07-21 Thread David J. O'Dell
You are correct.
The default 1mb/sec is too low.
1gb/sec is too high.
I changed it to 10mb/sec and its humming along.
Thanks.


Taeho Kang wrote:
> By setting "dfs.balance.bandwidthPerSec" to 1GB/sec, each datanode is able
> to utilize up to 1GB/sec for block balancing. It seems to be too high as
> even a gigabit ethernet can't handle that much data per sec.
>
> When you get timeouts, it probably means your network is saturated. Maybe
> you were running a big map reduce job which required lots of data transfer
> among nodes by then?
>
> Try setting it to be 10~30MB/sec and see what happens.
>
> On Sat, Jul 19, 2008 at 1:56 AM, David J. O'Dell <[EMAIL PROTECTED]>
> wrote:
>
>   
>> I'm trying to re balance my cluster as I've added to more nodes.
>> When I run balancer with the default threshold I am seeing timeouts in
>> the logs:
>>
>> 2008-07-18 09:50:46,636 INFO org.apache.hadoop.dfs.Balancer: Decided to
>> move block -8432927406854991437 with a length of 128 MB bytes from
>> 10.11.6.234:50010 to 10.11.6.235:50010 using proxy source
>> 10.11.6.234:50010
>> 2008-07-18 09:50:46,636 INFO org.apache.hadoop.dfs.Balancer: Starting
>> Block mover for -8432927406854991437 from 10.11.6.234:50010 to
>> 10.11.6.235:50010
>> 2008-07-18 09:52:46,826 WARN org.apache.hadoop.dfs.Balancer: Timeout
>> moving block -8432927406854991437 from 10.11.6.234:50010 to
>> 10.11.6.235:50010 through 10.11.6.234:50010
>>
>> I read in the balancer guide->
>> http://issues.apache.org/jira/secure/attachment/12370966/BalancerUserGuide2
>> That the default transfer rate is 1mb/sec
>> I tried increasing this to 1gb/sec but I'm still seeing the timeouts.
>> All of the nodes have gigE nics and are on the same switch.
>>
>>
>> --
>> David O'Dell
>> Director, Operations
>> e: [EMAIL PROTECTED]
>> t:  (415) 738-5152
>> 180 Townsend St., Third Floor
>> San Francisco, CA 94107
>>
>>
>> 

-- 
David O'Dell
Director, Operations
e: [EMAIL PROTECTED]
t:  (415) 738-5152
180 Townsend St., Third Floor
San Francisco, CA 94107 



Re: Scandinavian user group?

2008-07-21 Thread Mads Toftum
On Mon, Jul 21, 2008 at 03:52:01PM +0200, tim robertson wrote:
> Is there a user base in Scandinavia that would be interested in meeting to
> exchange feedback / ideas ?
> (in English...)
> 
Yeah, I'd be interested although I barely qualify as a hadoop user yet.

> I can probably host a meeting in Copenhagen if there were interest.
> 
Cool. short commute for me then ;)

vh

Mads Toftum
-- 
http://soulfood.dk


Scandinavian user group?

2008-07-21 Thread tim robertson
Hi all,

I think these user groups are a great idea, but I can't get to any easily...

Is there a user base in Scandinavia that would be interested in meeting to
exchange feedback / ideas ?
(in English...)

I can probably host a meeting in Copenhagen if there were interest.

Cheers

Tim


Hadoop-Archive Error for size of input data >2GB

2008-07-21 Thread Pratyush Banerjee

Hi All,

I have been using hadoop archives programmatically  to generate  har 
archives from some logfiles  which are being dumped into the hdfs.


When the input directory to Hadoop Archiving program has files of size 
more than 2GB, strangely the archiving fails with a error message saying


INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=   Illegal Capacity: -1


Going into the code i found out that this was due to numMaps having the 
Value of -1.


As per the code in org.apache.hadoop.util.HadoopArchives: 
archive(List srcPaths, String archiveName, Path dest)


the numMaps is initialized as 
int numMaps = (int)(totalSize/partSize);

//run atleast one map.
conf.setNumMapTasks(numMaps == 0? 1:numMaps);

partSize has been statically assigned the value of 2GB in the beginning 
of the class as,


static final long partSize = 2 * 1024 * 1024 * 1024

Strangely enough, the value i find assigned to partSize is  =  - 2147483648

Hence as a result in case of input directories of greater size, numMaps 
is assigned -1 which leads to the code throwing up error.


I am using hadoop-0.17.1 and I got the archiving facility after applying 
the patch hadoop-3307_4 patch.


This looks like a bug for me, so please let me know how to go about it.

Pratyush Banerjee



RE: New York user group?

2008-07-21 Thread montag

I'd be up for a New York user group.




Alex Newman-3 wrote:
> 
> I am down as well.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/New-York-user-group--tp18528862p18567093.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



building C++ API for windows, is it just bsd sockets that is incompatible with a native build?

2008-07-21 Thread Marc Vaillant
I see that cygwin is the only supported option for building Hadoop Pipes
for windows.  I'm trying a mingw build and it looks like the only
thing needing porting is the communications from bsd sockets to say
winsock?  Is that correct?

Thanks,
Marc


Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Zhou, Yunqing
I've tried it and it works.
Thank you very much

On Mon, Jul 21, 2008 at 6:33 PM, Miles Osborne <[EMAIL PROTECTED]> wrote:

> then just do what i said --set the number of reducers to zero.  this should
> just run the mapper phase
>
> 2008/7/21 Zhou, Yunqing <[EMAIL PROTECTED]>:
>
> > since the whole data is 5TB.  the Identity reducer still cost a lot of
> > time.
> >
> > On Mon, Jul 21, 2008 at 5:09 PM, Christian Ulrik Søttrup <
> [EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi,
> > >
> > > you can simply use the built in reducer that just copies the map
> output:
> > >
> > >
> conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);
> > >
> > > Cheers,
> > > Christian
> > >
> > >
> > > Zhou, Yunqing wrote:
> > >
> > >> I only use it to do something in parallel,but the reduce step will
> cost
> > me
> > >> additional several days, is it possible to make hadoop do not use a
> > reduce
> > >> step?
> > >>
> > >> Thanks
> > >>
> > >>
> > >>
> > >
> > >
> >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>


Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Miles Osborne
then just do what i said --set the number of reducers to zero.  this should
just run the mapper phase

2008/7/21 Zhou, Yunqing <[EMAIL PROTECTED]>:

> since the whole data is 5TB.  the Identity reducer still cost a lot of
> time.
>
> On Mon, Jul 21, 2008 at 5:09 PM, Christian Ulrik Søttrup <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
> >
> > you can simply use the built in reducer that just copies the map output:
> >
> > conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);
> >
> > Cheers,
> > Christian
> >
> >
> > Zhou, Yunqing wrote:
> >
> >> I only use it to do something in parallel,but the reduce step will cost
> me
> >> additional several days, is it possible to make hadoop do not use a
> reduce
> >> step?
> >>
> >> Thanks
> >>
> >>
> >>
> >
> >
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.


Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Zhou, Yunqing
since the whole data is 5TB.  the Identity reducer still cost a lot of time.

On Mon, Jul 21, 2008 at 5:09 PM, Christian Ulrik Søttrup <[EMAIL PROTECTED]>
wrote:

> Hi,
>
> you can simply use the built in reducer that just copies the map output:
>
> conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);
>
> Cheers,
> Christian
>
>
> Zhou, Yunqing wrote:
>
>> I only use it to do something in parallel,but the reduce step will cost me
>> additional several days, is it possible to make hadoop do not use a reduce
>> step?
>>
>> Thanks
>>
>>
>>
>
>


Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Gert Pfeifer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Did you try to use the IdentityReducer?

Zhou, Yunqing wrote:
> I only use it to do something in parallel,but the reduce step will cost me
> additional several days, is it possible to make hadoop do not use a reduce
> step?
> 
> Thanks
> 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iQIVAwUBSIRfTP4RHiapZN5BAQLIwA/7Bmkza40S/UrFi1JLECppLfFwe7v+WcM1
H5keFsV3xrQ7Pyz8WiR8ERwFbNKc0Men0Msp5CSoZQTRpEiKYhhbQVKTlz9tfc2w
qcB23j8pPFWxP11mKYciUZFexDIz9+rNvmHQFFFxiVoib6URve3a6cxbz6zuScac
KHSynC/x+2tS4BDCmJ7mhJWUIcTGLhHxig5ruz7rMWJQLXAIg0JP0m1nQCyREmxs
FlAgc+SBYdvLBygpE+CkB1JDWDfa6PKS6RqMmzAsiQU6vVQxd603KWkTOSCrDTbd
QZkDTntHIcpLDQ2ReCdttM4QoA2k2t3UFfveDzKSJcfnO33gedlZ4uVdu+t7tNUd
JLtRQyTpql1k1nFA9TzfWl2S/py913QOhfesfVnZpbGNfrNPh7DI//EsO0BKW80g
L2hGzfW386LhgDwG0w9FWrMh1PDQZEvc6NOzW3DbjIzaBkdIxM3+J2tVs7xA9idj
H0kXCFVYGzBQ/FgcJtg1qecf9mIQ35xkTbRH9G+HEd/4XK0iQeTnB7I/e3F+OP1h
85pf6JN6do70Cr8YKvTq6n7M4IZ3nbMYcXiNS9isB+VOriJ4qGrJK4DnEQh0eICX
L2sPXw7gt7a0r+kUexpprFfscSIm2YljrCKb/2zxR+hYai4+/gZNguYb3g14dP2Q
TV1K2XN/VA8=
=GWU6
-END PGP SIGNATURE-


Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Miles Osborne
... or better still, set the number of reducers to zero

Milles

2008/7/21 Christian Ulrik Søttrup <[EMAIL PROTECTED]>:

> Hi,
>
> you can simply use the built in reducer that just copies the map output:
>
> conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);
>
> Cheers,
> Christian
>
>
> Zhou, Yunqing wrote:
>
>> I only use it to do something in parallel,but the reduce step will cost me
>> additional several days, is it possible to make hadoop do not use a reduce
>> step?
>>
>> Thanks
>>
>>
>>
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.


Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Christian Ulrik Søttrup

Hi,

you can simply use the built in reducer that just copies the map output:

conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);

Cheers,
Christian

Zhou, Yunqing wrote:

I only use it to do something in parallel,but the reduce step will cost me
additional several days, is it possible to make hadoop do not use a reduce
step?

Thanks

  




Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Zhou, Yunqing
I only use it to do something in parallel,but the reduce step will cost me
additional several days, is it possible to make hadoop do not use a reduce
step?

Thanks


Re: Memory leak in DFS client

2008-07-21 Thread Gert Pfeifer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I found out that it is not a bug in my code. I can run a

bin $ ./hadoop fs -ls /seDNS/data/33
ls: timed out waiting for rpc response

It times out for this directory, but before it does so, the name node
takes 2GB more heap and never gives it back.

Any ideas?

Gert


Gert Pfeifer wrote:
> Hi,
> I am running some code dealing with file system operations (copying
> files and deleting). While it is runnung the web interface of the name
> node tells me that the heap size grows dramatically.
> 
> Are there any server-side data structures that I have to close
> explicitly, except FSData{IN|Out}putStreams ? Anything that takes heap
> in the name node...
> 
> I had something in mind like Statements in JDBC, but I just can't find
> anything.
> 
> Gert
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iQIVAwUBSIRKA/4RHiapZN5BAQKosRAA81JlhloItywcgwwuA8kxm/aRLDzzyAs+
EBvaC3FtJvzQuKMo8oxQtYliCxb3xMqi78Bg9DkRHB+xV2rCWVbB0uE8w17CQdLq
HnJ8H9/sz5TkFlLe8kDNBvKCfyMr5LXwVf5CQYIr3vj26pgqt2e49jg2pohuQCaq
g1oF5BzVTBWWGPDMOPjvcl5l1YEfVqZoOT5uZytkYYkvOGonWOrykOoDrDsFt3aH
VkWmY9lvouzsUFeDCeSI7EWrFRMcb7BOf45RhcUOdBJtNKSBLbGj8U5+o5iGB6gk
GY8GVlv27mfH9t0UOPnWAo9SfjIQqxVx95WrZNKFzj0j/XyaX9lyUM5zN055MyrT
ZqDTjWsEq3uWEErKSqvpYY+v5XZJVTa7M7Rb4LSUslhVmEG2+S7UudyjZAZlWmEk
1SPkrnxOUDT/gI/0nS24obCpBmLmM91HtDi88RPGnOVXzp6gcO4oTg46cfXeVCNQ
yCTACKfKzaUaARekPVWt64roM3t7/lbfjc59ZihCUhGwI3pDVs+vaCojyFMHlq0s
TrWUfAdNAdHb7H+d6JcW1SUJ++IL6WQgHigHsq++nOlaEdIHFFaq2z1QTJAfkCcR
D5odrmp4r1PeKxsZSl88yfnSflrvgCc4o4ccG5IVPDIzbMnelBE2NizdTVIKFXdP
0VzISbi9N1Q=
=Hf+i
-END PGP SIGNATURE-


Memory leak in DFS client

2008-07-21 Thread Gert Pfeifer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,
I am running some code dealing with file system operations (copying
files and deleting). While it is runnung the web interface of the name
node tells me that the heap size grows dramatically.

Are there any server-side data structures that I have to close
explicitly, except FSData{IN|Out}putStreams ? Anything that takes heap
in the name node...

I had something in mind like Statements in JDBC, but I just can't find
anything.

Gert
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iQIVAwUBSIQ4Bv4RHiapZN5BAQKEZRAAmqtfLezbX9j3i2PPLGSnywcetiiSKM3D
3CvrzG0gt+MI93GzsVtY/jOZH/m50kA4Ty0gFZfvOLKdbKLF6Z3EUsBFSWaQJUq2
LYBUe187494Tlu4uBVQLAeV0vJaDNiDo1iIjC1nqg8zSy4ucHjYEF1UUH+Y5nfBM
gSPzwXhlM7QF7NRNR3uI8OZ4pbMrSH//mDG16XGUQNFfD/HcBXZZ29ChMbEnUT5z
b2cxxtr5bBNJi5z38VAwfFlIXQa2w5JU/5Sbq48KujMKSaI9uzfAYD+/B1paNfcb
Aqxpdma4cU4BYxBRwkrhrZnA8r1v8t0GvxVu/t0CW08g6r6PUWyfk+W5mqNOs7jl
TCHH0Q74EatUkaXwY8roNAhiOb1nnJFQIZ/OYki2JUdahc0CeqQfy6J0VyAUttMc
o0qahpeXyTREe+XbbCxmks/Q6BP7x0ElLfmXKYuegrvDRgqzPmcPDayYA4vACquJ
Vw8wBA1VfzJid+aBR3M0WxjufK+6+u/JMoPMa6MdU7YFgsesGyUWWti1jjIGx1+C
fgXhTtKPA2eO/shj+svt9Ivn1Zdsi+phD5CsRavRbJCzcBSfP7ByYsqtYX4XfP5J
0STPZFqLESA+or5O2bAL2nwRtIMnYzCg7GQrpIfJAll+YE8LPVlteQCg6BpoLlAS
/tqLmL/DI98=
=ATq2
-END PGP SIGNATURE-