DFS block size

2009-11-14 Thread Hrishikesh Agashe
Hi, Default DFS block size is 64 MB. Does this mean that if I put file less than 64 MB on HDFS, it will not be divided any further? I have lots and lots if XMLs and I would like to process them directly. Currently I am converting them to Sequence files (10 XMLs per sequence file) and the puttin

How to call method after all map jobs on slaves nodes are done

2009-11-13 Thread Hrishikesh Agashe
Hi, I am implementing the MapRunnable interface to create the Map jobs. I have large data set for processing. (Data size is around 10 GB). I have 1 master and 10 slaves cluster. When I run my program, hadoop will process data successfully. After processing, I am collecting all data (all are files

Lucene + Hadoop

2009-11-10 Thread Hrishikesh Agashe
Hi, I am trying to use Hadoop for Lucene index creation. I have to create multiple indexes based on contents of the files (i.e. if author is "hrishikesh", it should be added to a index for "hrishikesh". There has to be a separate index for every author). For this, I am keeping multiple IndexWri

Multiple jobs on Hadoop

2009-07-24 Thread Hrishikesh Agashe
Hi, If I have one cluster with around 20 machines, can I submit different MR jobs from different machines in cluster? Are there any precautions to be taken? (I want to start Nutch crawl as one job and Katta indexing as another job) --Hrishi DISCLAIMER == This e-mail may contain privileg

Relation between number of map reduce tasks per node and capacity of namenode

2009-07-14 Thread Hrishikesh Agashe
Hi, Is there any relationship between how many map and how many reduce tasks I am running per node and what is capacity (RAM, CPU) of my NameNode? i.e. if I want to run more maps and more reduce tasks per node then RAM of NameNode should be high? Similarly does NameNode capacity should be driven