Re: Hadoop-Archive Error for size of input data >2GB

2008-07-21 Thread Pratyush Banerjee
Thanks Mahadev, Thanks for letting me know of the patch. I have already applied it and the archiving seems to run fine for input directory size of about 5GB. Currently am testing the same programatically, but since it is working from the command line, it should ideally also work this way. t

Re: Volunteer recruitment for matrix library project on Hadoop.

2008-07-21 Thread Edward J. Yoon
Thank you for all interest. BTW, Please subscribe to the Hama developer mailing list instead of send a mail to [EMAIL PROTECTED] [EMAIL PROTECTED] - Edward On Thu, Jul 17, 2008 at 11:26 AM, Edward J. Yoon <[EMAIL PROTECTED]> wrote: > Hello all, > > The Hama team which is trying to port typical

Regarding reading data from distributed hadoop cluster

2008-07-21 Thread Ninad Raut
Hi, Can any one help me understand how to read data distributed oover a cluster. For instance if we give a path /user/hadoop/parsed_data/part-/data , to the map reduce program will that find the data on same path on all the servers in the cluster , or will it be only the local file? If it on

Re: New York user group?

2008-07-21 Thread Matt Kangas
Count me as another interested party. --Matt On Fri, Jul 18, 2008 at 8:59 AM, Alex Dorman <[EMAIL PROTECTED]> wrote: > Please let me know if you would be interested in joining NY Hadoop user group > if one existed. > > I know about 5-6 people in New York City running Hadoop. I am sure there are

Re: more than one reducer?

2008-07-21 Thread Taeho Kang
I don't know if there is any in-place mechanism for what you're looking for. However, you could write a partitioner that distributes data in a way that lower keys go to lower numbered reduce, and higher keys go to higher numbered reduce. (e.g. Key starting with 'A~D' goes to part-, 'E~H' goes

hadoop-ec2 log access

2008-07-21 Thread Karl Anderson
I'm unable to access my logs with the JobTracker/TaskTracker web interface for a Hadoop job running on Amazon EC2. The URLs given for the task logs are of the form: http://domu-[...].compute-1.internal:50060/ The Hadoop-EC2 docs suggest that I should be able to get onto port 50060 for t

max number of files opened at the same time on Hdfs?

2008-07-21 Thread Eric Zhang
Hi, Apology if this question has been answered before, but I could not find in the archive and twiki pages. I am wondering what's the max number of open files for writes at the same time given a Hdfs cluster? I am streaming data into many different files (in the order of thousands) at the

Problem of Hadoop's Partitioner

2008-07-21 Thread Gopal Gandhi
I am following the example in http://hadoop.apache.org/core/docs/current/streaming.html about Hadoop's partitioner: org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner . It seems that the sorted values are based on dictionary, for eg: 1 12 15 2 28 What if I want to get numerical sorted list:

Re: HQL usage

2008-07-21 Thread Edward J. Yoon
HQL will be integrated to HRdfStore project. See http://groups.google.com/group/hrdfstore Thanks, Edward J. Yoon On 7/22/08, stack <[EMAIL PROTECTED]> wrote: > lucio Piccoli wrote: > > hi Tho Pham > > > > i have checked the HQL api but the only reference i found was the > org.apache.hadoop.hbase.

more than one reducer?

2008-07-21 Thread Mori Bellamy
hey all, i was wondering if its possible to split up the reduce task amongst more than one machine. i figured it might be possible for the map output to be copied to multiple machines; then each reducer could sort its keys and then combine them into one big sorted output (a la mergesort).

Re: question about Counters

2008-07-21 Thread Daniel Yu
thats great, thanks a lot! Daniel 2008/7/21 Christian Ulrik Søttrup <[EMAIL PROTECTED]>: > Hi, > > I use a counter in my reducer to check whether another iteration (of map > reduce cycle) is necessary. I have a similar declaration as yours. > Then in my main program i have: > > *** > client.setC

Re: question about Counters

2008-07-21 Thread Christian Ulrik Søttrup
Hi, I use a counter in my reducer to check whether another iteration (of map reduce cycle) is necessary. I have a similar declaration as yours. Then in my main program i have: *** client.setConf(conf); RunningJob rj = JobClient.runJob(conf); Counters cs = rj.getCounters(); long swaps=cs.getCou

DFS, write sequence number and consistency

2008-07-21 Thread Kevin
Hi there, It looks that current hadoop dfs puts the DFSClient as the "primary node". See http://wiki.apache.org/hadoop/DFS_requirements In Google file system, the write synchronization by multiple clients is controlled by the primary node which decide the sequence of the mutations to a block and

question about Counters

2008-07-21 Thread Daniel Yu
hi, i defined a counter of my own, and updated it in map method, protected static enum MyCounter { INPUT_WORDS }; ... public void map(...) { ... reporter.incrCounter(MyCounter.INPUT_WORDS, 1); } and can i fetch the counts later? like in the run() method after the job is finis

[Streaming] I figured out a way to do combining using mapper, would anybody check it?

2008-07-21 Thread Gopal Gandhi
I am using Hadoop Streaming. I figured out a way to do combining using mapper, is it the same as using a separate combiner? For example: the input is a list of words, I want to count their total number for each word. The traditional mapper is: while () { chomp ($_); $word = $_; print ($

Re: type mismatch from key to map

2008-07-21 Thread Khanh Nguyen
Nevermind, I figured out my problem. I did not configure OutputFormat. On Mon, Jul 21, 2008 at 1:44 PM, Khanh Nguyen <[EMAIL PROTECTED]> wrote: > Hi Daniel, > > The outputformat of my 1st hadoop job is TextOutputFormat. The > skeleton of my code follows: > > public int run(String[] args) throws Ex

Re: Hadoop-Archive Error for size of input data >2GB

2008-07-21 Thread Mahadev Konar
HI Pratyush, I think this bug was fixed in https://issues.apache.org/jira/browse/HADOOP-3545. Can you apply the patch and see if it works? Mahadev On 7/21/08 5:56 AM, "Pratyush Banerjee" <[EMAIL PROTECTED]> wrote: > Hi All, > > I have been using hadoop archives programmatically to generat

Re: type mismatch from key to map

2008-07-21 Thread Khanh Nguyen
Hi Daniel, The outputformat of my 1st hadoop job is TextOutputFormat. The skeleton of my code follows: public int run(String[] args) throws Exception { //set up and run job 1 ... conf.setOutputFormat(TextOutputFormat.class); FileOutputFormat.setOutputPath(conf, new

Re: type mismatch from key to map

2008-07-21 Thread Daniel Yu
hi k, i think u should look at ur map output format setting, and check if that fits ur reduce input . Daniel 2008/7/21 Khanh Nguyen <[EMAIL PROTECTED]>: > Hello, > > I am getting this error > > java.io.IOException: Type mismatch in key from map: expected > org.apache.hadoop.io.LongWritable, re

type mismatch from key to map

2008-07-21 Thread Khanh Nguyen
Hello, I am getting this error java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text Could someone please explain to me what i am doing wrong. Follow is the code I think is responsible... public int run() { . sor

Reminder - User Group Meeting July 22nd

2008-07-21 Thread Ajay Anand
A reminder that the next user group meeting is scheduled for July 22nd from 6 - 7:30 pm at Yahoo! Mission College, Building 1, Training Rooms 3 and 4. Agenda: Cascading - Chris Wensel Performance Benchmarking on Hadoop (Terabyte Sort, Gridmix) - Sameer Paranjpye, Owen O'Malley, Runping Qi

null objects in records.

2008-07-21 Thread Marc de Palol
Hi all, I have a mapper's Value type which comes from a record like this one: module org { class Something { AnotherRecord aRecord; int number1; int number2; } } So, i'm creating one of this Someth

Re: Timeouts when running balancer

2008-07-21 Thread David J. O'Dell
You are correct. The default 1mb/sec is too low. 1gb/sec is too high. I changed it to 10mb/sec and its humming along. Thanks. Taeho Kang wrote: > By setting "dfs.balance.bandwidthPerSec" to 1GB/sec, each datanode is able > to utilize up to 1GB/sec for block balancing. It seems to be too high as >

Re: Scandinavian user group?

2008-07-21 Thread Mads Toftum
On Mon, Jul 21, 2008 at 03:52:01PM +0200, tim robertson wrote: > Is there a user base in Scandinavia that would be interested in meeting to > exchange feedback / ideas ? > (in English...) > Yeah, I'd be interested although I barely qualify as a hadoop user yet. > I can probably host a meeting in

Scandinavian user group?

2008-07-21 Thread tim robertson
Hi all, I think these user groups are a great idea, but I can't get to any easily... Is there a user base in Scandinavia that would be interested in meeting to exchange feedback / ideas ? (in English...) I can probably host a meeting in Copenhagen if there were interest. Cheers Tim

Hadoop-Archive Error for size of input data >2GB

2008-07-21 Thread Pratyush Banerjee
Hi All, I have been using hadoop archives programmatically to generate har archives from some logfiles which are being dumped into the hdfs. When the input directory to Hadoop Archiving program has files of size more than 2GB, strangely the archiving fails with a error message saying INF

RE: New York user group?

2008-07-21 Thread montag
I'd be up for a New York user group. Alex Newman-3 wrote: > > I am down as well. > > -- View this message in context: http://www.nabble.com/New-York-user-group--tp18528862p18567093.html Sent from the Hadoop core-user mailing list archive at Nabble.com.

building C++ API for windows, is it just bsd sockets that is incompatible with a native build?

2008-07-21 Thread Marc Vaillant
I see that cygwin is the only supported option for building Hadoop Pipes for windows. I'm trying a mingw build and it looks like the only thing needing porting is the communications from bsd sockets to say winsock? Is that correct? Thanks, Marc

Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Zhou, Yunqing
I've tried it and it works. Thank you very much On Mon, Jul 21, 2008 at 6:33 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: > then just do what i said --set the number of reducers to zero. this should > just run the mapper phase > > 2008/7/21 Zhou, Yunqing <[EMAIL PROTECTED]>: > > > since the whol

Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Miles Osborne
then just do what i said --set the number of reducers to zero. this should just run the mapper phase 2008/7/21 Zhou, Yunqing <[EMAIL PROTECTED]>: > since the whole data is 5TB. the Identity reducer still cost a lot of > time. > > On Mon, Jul 21, 2008 at 5:09 PM, Christian Ulrik Søttrup <[EMAIL

Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Zhou, Yunqing
since the whole data is 5TB. the Identity reducer still cost a lot of time. On Mon, Jul 21, 2008 at 5:09 PM, Christian Ulrik Søttrup <[EMAIL PROTECTED]> wrote: > Hi, > > you can simply use the built in reducer that just copies the map output: > > conf.setReducerClass(org.apache.hadoop.mapred.lib

Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Gert Pfeifer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Did you try to use the IdentityReducer? Zhou, Yunqing wrote: > I only use it to do something in parallel,but the reduce step will cost me > additional several days, is it possible to make hadoop do not use a reduce > step? > > Thanks > -BEGIN P

Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Miles Osborne
... or better still, set the number of reducers to zero Milles 2008/7/21 Christian Ulrik Søttrup <[EMAIL PROTECTED]>: > Hi, > > you can simply use the built in reducer that just copies the map output: > > conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class); > > Cheers, > Chr

Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Christian Ulrik Søttrup
Hi, you can simply use the built in reducer that just copies the map output: conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class); Cheers, Christian Zhou, Yunqing wrote: I only use it to do something in parallel,but the reduce step will cost me additional several days, is

Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Zhou, Yunqing
I only use it to do something in parallel,but the reduce step will cost me additional several days, is it possible to make hadoop do not use a reduce step? Thanks

Re: Memory leak in DFS client

2008-07-21 Thread Gert Pfeifer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I found out that it is not a bug in my code. I can run a bin $ ./hadoop fs -ls /seDNS/data/33 ls: timed out waiting for rpc response It times out for this directory, but before it does so, the name node takes 2GB more heap and never gives it back. A

Memory leak in DFS client

2008-07-21 Thread Gert Pfeifer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am running some code dealing with file system operations (copying files and deleting). While it is runnung the web interface of the name node tells me that the heap size grows dramatically. Are there any server-side data structures that I have t