Re: binary format for streaming

2009-03-03 Thread Yasuyuki Watanabe
Amareshwari, All right. So I will try the patch for branch 0.18 first. Thanks, Yasu At Wed, 04 Mar 2009 12:16:21 +0530, Amareshwari Sriramadasu wrote: > > [HADOOP-1722] Make streaming to handle non-utf8 byte array > http://issues.apache.org/jira/browse/HADOOP-1722 > is committed to branch 0.2

Re: binary format for streaming

2009-03-03 Thread Amareshwari Sriramadasu
[HADOOP-1722] Make streaming to handle non-utf8 byte array http://issues.apache.org/jira/browse/HADOOP-1722 is committed to branch 0.21 Yasuyuki Watanabe wrote: Hi, I would like to know the status of binary input/output format support for streaming. We found HADOOP-3227 and it was open. So we

Re: Jobs run slower and slower

2009-03-03 Thread Sean Laurent
It's quite possible that's the problem. I'll re-run the tests over night and collect the run times according to the JobTracker. If I want to test the patch in HADOOP-4780, should I pull down branch-0.19 and go from there? This is not a production environment, so I'm not worried about data loss or

binary format for streaming

2009-03-03 Thread Yasuyuki Watanabe
Hi, I would like to know the status of binary input/output format support for streaming. We found HADOOP-3227 and it was open. So we just posted some class files and patches we created. They will work with Hadoop 0.19.1. [HADOOP-3227] Implement a binary input/output format for Streaming http://i

Re: Re: Re: Re: Re: Re: Re: Regarding "Hadoop multi cluster" set-up

2009-03-03 Thread shefali pawar
We set-up a dedicated LAN consisting of the 2 computers using a switch. I think that made a difference and the 2 node cluster is working fine now. Also now we are working on Ubuntu and not Fedora. Thanks for all the help. Shefali On Thu, 12 Feb 2009 shefali pawar wrote : >I changed the value

Re: Reduce doesn't start until map finishes

2009-03-03 Thread Nick Cen
Thanks, about the "Secondary Sort", can you provide some example. What does the intermediate keys stands for? Assume I have two mapper, m1 and m2. The output of m1 is (k1,v1),(k2,v2) and the output of m2 is (k1,v3),(k2,v4). Assume k1 and k2 belongs to the same partition and k1 < k2, so i think the

Re: Reduce doesn't start until map finishes

2009-03-03 Thread Chris Douglas
The output of each map is sorted by partition and by key within that partition. The reduce merges sorted map output assigned to its partition into the reduce. The following may be helpful: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html If your job requires total order, consi

Re: Jobs run slower and slower

2009-03-03 Thread Amar Kamat
Runping Qi wrote: Could it be the case that the latter jobs ran slower because the tasks took longer time to get initialized? If so, you may hit https://issues.apache.org/jira/browse/HADOOP-4780 Runping On Tue, Mar 3, 2009 at 2:02 PM, Sean Laurent wrote: Hrmmm. According to hadoop-defaults

Re: Reduce doesn't start until map finishes

2009-03-03 Thread Nick Cen
can you provide more info about sortint? The sort is happend on the whole data set, or just on the specified partion? 2009/3/4 Mikhail Yakshin > On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote: > > This is normal behavior. The Reducer is guaranteed to receive all the > > results for its part

Re: Best way to write multiple files from a MR job?

2009-03-03 Thread Stuart White
On Tue, Mar 3, 2009 at 9:16 PM, Nick Cen wrote: > have you try the MultipleOutputFormat and it is subclass? Nope (didn't know it existed). I'll take a look at it. Both of these suggestions sound great. Thanks for the tips!

Re: Best way to write multiple files from a MR job?

2009-03-03 Thread Nick Cen
have you try the MultipleOutputFormat and it is subclass? 2009/3/4 Stuart White > I have a large amount of data, from which I'd like to extract multiple > different types of data, writing each type of data to different sets > of output files. What's the best way to accomplish this? (I should

RE: Best way to write multiple files from a MR job?

2009-03-03 Thread Saranath Raghavan
This should help. String jobId = jobConf.get("mapred.job.id"); String taskId = jobConf.get("mapred.task.partition"); String filename = "file_" + jobId + "_" + taskId; - Saranath -Original Message- From: Stuart White [mailto:stuart.whi...@gmail.com] Sent: Tuesday, March 03, 2009 6:50 PM

Best way to write multiple files from a MR job?

2009-03-03 Thread Stuart White
I have a large amount of data, from which I'd like to extract multiple different types of data, writing each type of data to different sets of output files. What's the best way to accomplish this? (I should mention, I'm only using a mapper. I have no need for sorting or reduction.) Of course, i

Re: Mappers become less utilized as time goes on?

2009-03-03 Thread Nathan Marz
Nope... and there were no failed tasks. On Mar 3, 2009, at 5:16 PM, Runping Qi wrote: Were task Trackers black-listed? On Tue, Mar 3, 2009 at 3:25 PM, Nathan Marz wrote: I'm seeing some really bizarre behavior from Hadoop 0.19.1. I have a fairly large job with about 29000 map tasks an

Re: Mappers become less utilized as time goes on?

2009-03-03 Thread Runping Qi
Were task Trackers black-listed? On Tue, Mar 3, 2009 at 3:25 PM, Nathan Marz wrote: > I'm seeing some really bizarre behavior from Hadoop 0.19.1. I have a fairly > large job with about 29000 map tasks and 72 reducers. there are 304 map task > slots in the cluster. When the job starts, it runs 3

Re: Jobs run slower and slower

2009-03-03 Thread Runping Qi
Could it be the case that the latter jobs ran slower because the tasks took longer time to get initialized? If so, you may hit https://issues.apache.org/jira/browse/HADOOP-4780 Runping On Tue, Mar 3, 2009 at 2:02 PM, Sean Laurent wrote: > Hrmmm. According to hadoop-defaults.xml, > mapred.jobtrac

Re: Reduce doesn't start until map finishes

2009-03-03 Thread Mikhail Yakshin
On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote: > This is normal behavior. The Reducer is guaranteed to receive all the > results for its partition in sorted order. No reduce can start until all the > maps are completed, since any running map could emit a result that would > violate the order

Mappers become less utilized as time goes on?

2009-03-03 Thread Nathan Marz
I'm seeing some really bizarre behavior from Hadoop 0.19.1. I have a fairly large job with about 29000 map tasks and 72 reducers. there are 304 map task slots in the cluster. When the job starts, it runs 304 map tasks at a time. As time goes on the number of map tasks run concurrently drops

Re: Reduce doesn't start until map finishes

2009-03-03 Thread Chris Douglas
This is normal behavior. The Reducer is guaranteed to receive all the results for its partition in sorted order. No reduce can start until all the maps are completed, since any running map could emit a result that would violate the order for the results it currently has. -C On Mar 1, 2009,

Running 0.19.2 branch in production before release

2009-03-03 Thread Nathan Marz
I would like to get the community's opinion on this. Do you think it's safe to run the unreleased 0.19.2 branch in production? Or do you recommend sticking with 0.19.1 for production use? There are some bug fixes in 0.19.2 which we would like to take advantage of although they are not block

Re: Jobs run slower and slower

2009-03-03 Thread Sean Laurent
Hrmmm. According to hadoop-defaults.xml, mapred.jobtracker.completeuserjobs.maximum defaults to 100. So I tried setting it to 1, but that had no effect. I still see each successive run taking longer than the previous run. 1) Restart M/R 2) Run #1: 142.12 (secs) 3) Run #2 181.96 (secs) 4) Run #3 2

Re: Issues installing FUSE_DFS

2009-03-03 Thread Brian Bockelman
I've never heard of such a thing, but I would be (pleasantly) surprised if that worked. What kind of issues are you having? I would run things in debug mode, see what issues come up in the terminal, and whack at it until (a) it works or (b) you find Samba requires something simply not supp

Re: Jobs run slower and slower

2009-03-03 Thread Runping Qi
The jobtracker's memory increased as you ran more and more jobs because the job tracker still kept some data about those completed jobs. The maximum number of completed jobs kept is determined by the config variable mapred.jobtracker.completeuserjobs.maximum. You can lower that to lower the job tra

RE: Issues installing FUSE_DFS

2009-03-03 Thread Patterson, Josh
Brian, Do you know of anyone using Samba to access the FUSE-DFS mount point via windows? We have FUSE-DFS working, but read/write doesn't work via Samba. Josh Patterson -Original Message- From: Brian Bockelman [mailto:bbock...@cse.unl.edu] Sent: Tuesday, March 03, 2009 11:26 AM To: core

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread tim robertson
Thanks Taran for the detailed explanation - makes perfect sense. 2009/3/3 Tarandeep Singh : > Hi Tim, > > I am currently writing developer's guide for CloudBase. It will explain the > CloudBase design in detail as well as the algorithms used to do Joins, > Indexing etc. > > However, for indexin

Re: [ANNOUNCE] Hadoop release 0.19.1 available

2009-03-03 Thread Aviad sela
Steve, I am sorry but it is not clear to me what to do. When I build an the project in Eclipse (IBM Rational Application Developer 7.5) I use the procedure documented in the wiki. Did you mean that I need to comment out the jsp-comiple sections ? If I do this, what should be copied to the cluste

Read from a buffer/stream..

2009-03-03 Thread Mithila Nagendra
Hey guys I am currently working on a project where I need the input to be read by the map/reduce word count program as and when it is generated - I don't want the input to be stored in a text file. Is there a way hadoop can read from a stream? Its similar to the producer-consumer problem - word cou

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread Tarandeep Singh
Hi Tim, I am currently writing developer's guide for CloudBase. It will explain the CloudBase design in detail as well as the algorithms used to do Joins, Indexing etc. However, for indexing here is a brief introduction which you can also find on the CloudBase website- http://cloudbase.sourceforg

Time Series Analysis using CloudBase

2009-03-03 Thread Tarandeep Singh
Hi, [ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce architecture. It uses ANSI SQL as its query language and comes with a JDBC driver. It is developed by Business.com and is released to open source community under GNU GPL license. One c

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread tim robertson
Hi Taran, Have you a blog or something that explains how you process joins on cloudbase? E.g. how are indexes used, and how do you go through the joining using the data files and index files. Do you look at all possible indexes, determine the cardinality of each and from this pick a join order, o

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread Tarandeep Singh
Tim is right. CloudBase is not equivalent to HBase. HBase is column oriented database based on Google's BigTable. CloudBase is a database/data warehosue layer on top of Hadoop and by means of its SQL interface makes it easier to mine logs. So instead of writing Map-Reduce jobs for analyzing data,

Re: Jobs run slower and slower

2009-03-03 Thread Sean Laurent
Interesting... from reading HADOOP-4766, I'm not entirely clear if that problem is related to the number of jobs or the number of tasks. - I'm only running a single job with approximately 900 map tasks as opposed to the 500-600+ jobs and 100K tasks described in HADOOP-4766. - I am seeing increase

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread tim robertson
Hi Praveen, I think it is more equivalent to Hive than HBase - both offer joins and structured querying where HBase is more a column oriented data store with many to ones embedded in a single row and (currently) only indexes on the primary key, but secondary keys are coming. I anticipate using HB

RE: Problem running jobs: Wrong FS exception

2009-03-03 Thread Bill Habermaas
Look at: https://issues.apache.org/jira/browse/HADOOP-5191 I had the same problem. Bill -Original Message- From: Andrich van Wyk [mailto:avan...@cs.up.ac.za] Sent: Tuesday, March 03, 2009 11:27 AM To: core-user@hadoop.apache.org Subject: Problem running jobs: Wrong FS exception Hallo

RE: Announcing CloudBase-1.2.1 release

2009-03-03 Thread Guttikonda, Praveen
Hi , Will this be competing in a sense with HBASE then ? Cheers, Praveen -Original Message- From: Tarandeep Singh [mailto:tarand...@gmail.com] Sent: Tuesday, March 03, 2009 10:12 PM To: core-user@hadoop.apache.org Subject: Re: Announcing CloudBase-1.2.1 release Hi Lukas, Yes, you are ri

Problem running jobs: Wrong FS exception

2009-03-03 Thread Andrich van Wyk
Hallo I am using Hadoop version 0.19. I set up a hadoop cluster with 10 nodes. I set up keyless ssh between the master node and all the slave nodes as well as modified the /etc/hosts/ on all nodes so hostname lookup works. The master file on the (10.160.0.52 node) contains a single IP: 10.160

Re: [ANNOUNCE] Hadoop release 0.19.1 available

2009-03-03 Thread Steve Loughran
Aviad sela wrote: Steve Hi, I have run ant -verbose and found that the org.apache.log4j.Category is missing The release 0.19.1 comes wiht jar file log4j-1.2.15.jar. Indeed this jar does not include this class. I have looked into another jar version log4j-1.2.13.jar, which INCLUDE this class. S

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread Tarandeep Singh
Hi Lukas, Yes, you are right. As of now, CloudBase does not support unique keys and foreign keys on tables. CloudBase is designed as a database abstraction layer on top of Hadoop, thus making it easier to query/mine logs/huge data easily. -Taran On Tue, Mar 3, 2009 at 1:15 AM, Lukáš Vlček wrot

Re: Issues installing FUSE_DFS

2009-03-03 Thread Brian Bockelman
On Mar 3, 2009, at 10:01 AM, Patterson, Josh wrote: Hey Brian, I'm working with Matthew on our hdfs install, and he's doing the server admin on this project; We just tried the settings you suggested, and we got the following error: [r...@socdvmhdfs1 ~]# fuse_dfs -oserver=socdvmhdfs1 -o

RE: Issues installing FUSE_DFS

2009-03-03 Thread Patterson, Josh
Hey Brian, I'm working with Matthew on our hdfs install, and he's doing the server admin on this project; We just tried the settings you suggested, and we got the following error: [r...@socdvmhdfs1 ~]# fuse_dfs -oserver=socdvmhdfs1 -oport=9000 /hdfs -oallow_ot her -ordbuffer=131072 fuse-dfs did

Re: contrib EC2 with hadoop 0.17

2009-03-03 Thread falcon164
I am new to hadoop. I want to run hadoop on eucalyptus. Please let me know how to do this. -- View this message in context: http://www.nabble.com/contrib-EC2-with-hadoop-0.17-tp17711758p22310068.html Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: [ANNOUNCE] Hadoop release 0.19.1 available

2009-03-03 Thread Aviad sela
Steve Hi, I have run ant -verbose and found that the org.apache.log4j.Category is missing The release 0.19.1 comes wiht jar file log4j-1.2.15.jar. Indeed this jar does not include this class. I have looked into another jar version log4j-1.2.13.jar, which INCLUDE this class. So I guess that one s

Re: [ANNOUNCE] Hadoop release 0.19.1 available

2009-03-03 Thread Steve Loughran
Aviad sela wrote: Nigel Thanks, I have extracted the new project. However, I am having problems building the project I am using Eclipse 3.4 and ant 1.7 I recieve error compiling core classes * compile-core-classes*: BUILD FAILED * D:\Work\AviadWork\workspace\cur\WSAD\Hadoop_Core_19_1\Had

hadoop balancer failed

2009-03-03 Thread 王红宝
-bash-3.00$ hadoop balancer Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved 09/03/03 14:46:19 INFO net.NetworkTopology: Adding a new node: /default-rack/10.26.6.249:50010 09/03/03 14:46:19 INFO net.NetworkTopology: Adding a new node: /default-rack/10.

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread Lukáš Vlček
Hi Taran, This looks impressive. I quickly looked at the documentation, am I right that it does not support unique keys and foreign keys for tables? Regards, Lukas On Mon, Mar 2, 2009 at 8:33 PM, Tarandeep Singh wrote: > Hi, > > We have just released 1.2.1 version of CloudBase on sourceforge- >

Re: Splittable lzo files

2009-03-03 Thread tim robertson
Thanks for posting this Johan, I tried unsuccessfully to handle GZip files for the reasons you state and resorted to uncompressed. I will try the Lzo format and post the performance difference of compressed vs uncompressed on EC2 which seems to have very slow disk IO. We have seen really bad imp

Re: Splittable lzo files

2009-03-03 Thread Johan Oskarsson
We use it with python (dumbo) and streaming, so it should certainly be possible. I haven't tried it myself though, so can't give any pointers. /Johan Miles Osborne wrote: that's very interesting. for us poor souls using streaming, would we be able to use it? (right now i'm looking at a 100+

Re: Splittable lzo files

2009-03-03 Thread Miles Osborne
that's very interesting. for us poor souls using streaming, would we be able to use it? (right now i'm looking at a 100+ GB gzipped file ...) Miles 2009/3/3 Johan Oskarsson : > Hi, > > thought I'd pass on this blog post I just wrote about how we compress our > raw log data in Hadoop using Lzo a

Splittable lzo files

2009-03-03 Thread Johan Oskarsson
Hi, thought I'd pass on this blog post I just wrote about how we compress our raw log data in Hadoop using Lzo at Last.fm. The essence of the post is that we're able to make them splittable by indexing where each compressed chunk starts in the file, similar to the gzip input format being wor