heterogeneous cluster

2008-07-14 Thread Sandhya E
Hi We currently have a cluster of 3 nodes, all of same configuration[4cpu, 8GB RAM, 2TB]. Namenode website shows that DFS used % is 30%[Nearly 700GB used per node]. The CPU utilization is continuously remaining high for all the nodes in cluster. So to fit to the needs of growing number of jobs tha

Re: Hadoop and lucene integration

2008-07-14 Thread bhupendar
Thanks for the reply Nice to get more information about hadoop and i have also started looking at nutch The only thing i am not able to get yet is weather we can integrate hadoop with lucene or not and any tutorial on it that will help me do this ? Thanks a lot again for your response and help

Re: multiple Output Collectors ?

2008-07-14 Thread Alejandro Abdelnur
check MultipleOutputFormat and MultipleOutputs (this has been committed to the trunk last week) On Mon, Jul 14, 2008 at 11:49 PM, Khanh Nguyen <[EMAIL PROTECTED]> wrote: > Hello, > > Is it possible to have more than one output collector for one map? > > My input are records of html pages. I am ma

different dfs block size

2008-07-14 Thread Rong-en Fan
Hi, I'm wondering what would be the memory consumption of dfs.block.size for a fixed set of data in NameNode? I know it is determined by # of blocks and # of replications, but how many memory does one block will use in NameNode? In addition, what would be the pros/cons of bigger/smaller block size

Re: Writable readFields and write functions

2008-07-14 Thread Chris Douglas
-- Presently, my RecordReader converts XML strings from a file to MyWritable object -- When readFields is called, RecordReader should provide the next MyWritable object, if there is one -- When write is called, MyWriter should write the objects out Not quite. Your RecordReader may produce My

Re: Is this supported that Combiner emits keys other than its input key set?

2008-07-14 Thread Chris Douglas
Yes; a combiner that emits a key that should go to a different partition is incorrect. If this were legal, then the combiner output would also need to be buffered, sorted, spilled, etc., effectively requiring another map phase. The combiner's purpose is to decrease the volume of data that n

Setting inputs in configure()

2008-07-14 Thread schnitzi
I have some mapping jobs that are chained together, and would like to set the inputs for them in an overridden configure(JobConf) method. When I try to do it this way, though, I get an error like this: aggregatorJob failed: java.io.IOException: No input paths specified in input at org.ap

Re: Writable readFields and write functions

2008-07-14 Thread Kylie McCormick
Hello Chris: Thanks for the prompt reply! So, to conclude from your note: -- Presently, my RecordReader converts XML strings from a file to MyWritable object -- When readFields is called, RecordReader should provide the next MyWritable object, if there is one -- When write is called, MyWriter shou

Is this supported that Combiner emits keys other than its input key set?

2008-07-14 Thread Keliang Zhao
Hi there, I read the code a bit, though I am not sure if I get it right. It appears to me that when memory buffer of mapper is full, it spills and gets sorted by partition id and by keys. Then, if there is a combiner defined, it will work on each partition. However, it seems that the outputs of a

Re: Writable readFields and write functions

2008-07-14 Thread Chris Douglas
It's easiest to consider write as a function that converts your record to bytes and readFields as a function restoring your record from bytes. So it should be the case that: MyWritable i = new MyWritable(); i.initWithData(some_data); i.write(byte_stream); ... MyWritable j = new MyWritable();

Re: When does reducer read mapper's intermediate result?

2008-07-14 Thread Chris Douglas
Not quite; the intermediate output is written to the local disk on the node executing MapTask and fetched over HTTP by the ReduceTask. The ReduceTask need only wait for the MapTask to complete successfully before fetching its output, but it cannot start before all MapTasks have finished. Th

Re: When does reducer read mapper's intermediate result?

2008-07-14 Thread Mori Bellamy
i'm pretty sure that the reducer waits for all of the map tasks' output to be written to HDFS (or else i nee no use for the Combiner class). i'm not sure about your second question though. my gut tells me "no" On Jul 14, 2008, at 3:50 PM, Kevin wrote: Hi, there, I am interested in the

Re: How to chain multiple hadoop jobs?

2008-07-14 Thread Mori Bellamy
Weird. I use eclipse, but that's never happened to me. When you set up your JobConfs, for example: JobConf conf2 = new JobConf(getConf(),MyClass.class) is your "MyClass" in the same package as your driver program? also, do you run from eclipse or from the command line (i've never tried to l

When does reducer read mapper's intermediate result?

2008-07-14 Thread Kevin
Hi, there, I am interested in the implementation details of hadoop mapred. In particular, does the reducer wait till a map task ends and then fetch the output (key-value pairs)? If so, is the very file produced by a mapper for the reducer sorted before reducer gets it? (which means that the reduce

Writable readFields and write functions

2008-07-14 Thread Kylie McCormick
Hi There! I'm currently working on code for my own Writable object (called ServiceWritable) and I've been working off LongWritable for this one. I was wondering, however, about the following two functions: public void readFields(java.io.DataInput in) and public void write(java.io.DataOutput out)

Re: How does org.apache.hadoop.mapred.join work?

2008-07-14 Thread Kevin
Thank you, Chris. This solves my questions. -Kevin On Mon, Jul 14, 2008 at 11:17 AM, Chris Douglas <[EMAIL PROTECTED]> wrote: > "Yielding equal partitions" means that each input source will offer n > partitions and for any given partition 0 <= i < n, the records in that > partition are 1) sorted

Re: How to chain multiple hadoop jobs?

2008-07-14 Thread Joman Chu
Hi, I don't have the code sitting in front of me at the moment, but I'll do some of it from memory and I'll post a real snippet tomorrow night. Hopefully, this can get you started public class MyMainClass { public static void main(String[] args) { ToolRunner.run(new Configu

Re: multiple Output Collectors ?

2008-07-14 Thread Joman Chu
One cheap hack that comes to mind is to extend the GenericWritable and ArrayWritable classes and write a second and third MapReduce job that will both parse over your first job's output, and each will select for the Key-Value pair it wants. Joman Chu AIM: ARcanUSNUMquam IRC: irc.liquid-silver.net

FileSplit hosts

2008-07-14 Thread Nathan Marz
What's the behavior of giving FileSplit "null" for the hosts field in the constructor? Will the framework figure out which hosts the data from on its own? Thanks, Nathan Marz

Re: How to chain multiple hadoop jobs?

2008-07-14 Thread Sean Arietta
Well that's what I need to do also... but Hadoop complains to me when I attempt to do that. Are you using Eclipse by any chance to develop? The error I'm getting seems to be stemming from the fact that Hadoop thinks I am uploading a new jar for EVERY execution of JobClient.runJob() so it fails ind

Ideal number of mappers and reducers; any physical limits?

2008-07-14 Thread Lukas Vlcek
Hi, I have a couple of *basic* questions about Hadoop internals. 1) If I understood correctly the ideal number of Reducers is equal to number of distinct keys (or custom Partitioners) emitted from from all Mappers at given Map-Reduce iteration. Is that correct? 2) In configuration there can be s

Re: How to chain multiple hadoop jobs?

2008-07-14 Thread Mori Bellamy
hey sean, i later learned that the method i originally posted (configuring different JobConfs and then running them, blocking style, with JobClient.runJob(conf)) was sufficient for my needs. the reason it was failing before was somehow my fault and the bugs somehow got fixed x_X. Lukas ga

Re: How to chain multiple hadoop jobs?

2008-07-14 Thread Sean Arietta
Could you please provide some small code snippets elaborating on how you implemented that? I have a similar need as the author of this thread and I would appreciate any help. Thanks! Cheers, Sean Joman Chu-2 wrote: > > Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work >

Re: Hadoop and lucene integration

2008-07-14 Thread Naama Kraus
I think you may find a lot of information about Hadoop in general in Hadoop's Wiki http://hadoop.apache.org/core/ Re. Hadoop and search, you might also want to take a look at Nutch http://lucene.apache.org/nutch/ In general, Hadoop allows one to store huge amount of data on a cluster of commodity

multiple Output Collectors ?

2008-07-14 Thread Khanh Nguyen
Hello, Is it possible to have more than one output collector for one map? My input are records of html pages. I am mapping each url to its html-content and want to have two output collectors. One that maps each --> and another one that map to something else (difficult to explain). Please help

Re: How does org.apache.hadoop.mapred.join work?

2008-07-14 Thread Chris Douglas
"Yielding equal partitions" means that each input source will offer n partitions and for any given partition 0 <= i < n, the records in that partition are 1) sorted on the same key 2) unique to that partition, i.e. if a key k is in partition i for a given source, k appears in no other parti

Pulling input from http?

2008-07-14 Thread Khanh Nguyen
Hello, I am struggling to get Hadoop to pull input from a http source but so far no luck. Is it even possible because in this case, the input is not placed in Hadoop's file system? An example code would be ideal. Thanks. -k

How does org.apache.hadoop.mapred.join work?

2008-07-14 Thread Kevin
Hi, I find limited information about this package which looks like could do "equi?" join. "Given a set of sorted datasets keyed with the same class and yielding equal partitions, it is possible to effect a join of those datasets prior to the map. " What does "yielding equal partitions" mean? Than

Hadoop User Group UK

2008-07-14 Thread Johan Oskarsson
Update on the Hadoop user group in the UK: It will be hosted at Skills Matter in Clerkenwell, London on August 19. We'll have presentations from both developers and users of Apache Hadoop. The event is free and anyone is welcome, but we only have room for 60 people so make sure you're on the

Re: Why is the task run in a child JVM?

2008-07-14 Thread Torsten Curdt
On 7/14/08, Jason Venner <[EMAIL PROTECTED]> wrote: One benefit is that if your map or reduce behaves badly it can't take down the task tracker. As the tracker jvm could also be monitored (and restarted) from outside, the internal execution might still be worth looking into. At least t

Re: Why is the task run in a child JVM?

2008-07-14 Thread Shengkai Zhu
Well, I got it. On 7/14/08, Jason Venner <[EMAIL PROTECTED]> wrote: > > One benefit is that if your map or reduce behaves badly it can't take down > the task tracker. > > In our case we have some poorly behaved external native libraries we use, > and we have to forcibly ensure that the child vms

Re: Is it possible to input two different files under same mapper

2008-07-14 Thread Jason Venner
This sounds like a good task for the Data Join code. If you can set up so that all of your data is stored in MapFiles, with the same type of key and the same partitioning setup and count, it will go very well. Mori Bellamy wrote: Hey Amer, It sounds to me like you're going to have to write yo

Re: Why is the task run in a child JVM?

2008-07-14 Thread Jason Venner
One benefit is that if your map or reduce behaves badly it can't take down the task tracker. In our case we have some poorly behaved external native libraries we use, and we have to forcibly ensure that the child vms are killed when the child main finishes (often by kill -9), so the fact the c

Why is the task run in a child JVM?

2008-07-14 Thread Shengkai Zhu
What's the benefits from such design compared to multi-thread? -- 朱盛凯 Jash Zhu 复旦大学软件学院 Software School, Fudan University

Re: Hadoop and lucene integration

2008-07-14 Thread bhupendar
Thanks for the response . The problem i am facing here is i dont have any clue about the hadoop . So first i am trying to analyse weather i can integrate hadoop with the existing application developed using Lucene or not ? i need some clue or tutorial which talk about hadoop integration with luce

Re: Parameterized InputFormats

2008-07-14 Thread Alejandro Abdelnur
If your InputFormat implements Configurable you'll get access to the JobConf via the setConf(Configuration) method when Hadoop creates an instance of your class. On Mon, Jun 30, 2008 at 11:20 PM, Nathan Marz <[EMAIL PROTECTED]> wrote: > Hello, > > Are there any plans to change the JobConf API so t

Re: Hadoop and lucene integration

2008-07-14 Thread Naama Kraus
Hi bhupendra, You may find these links helpful: https://issues.apache.org/jira/browse/HADOOP-2951 http://www.mail-archive.com/[EMAIL PROTECTED]/msg00596.html Naama On Mon, Jul 14, 2008 at 8:37 AM, bhupendar <[EMAIL PROTECTED]> wrote: > > Hi all > > I have created a search engine using lucene t

Re: Outputting to different paths from the same input file

2008-07-14 Thread Alejandro Abdelnur
You can use MultipleOutputFormat or MultipleOutputs (it has been committed to SVN a few days ago) for this. Then you can use a filter on your input dir for the next jobs so only files matching a given name/pattern are used. A On Fri, Jul 11, 2008 at 8:54 PM, Jason Venner <[EMAIL PROTECTED]> wrot