Re: hwo to read a text file in Map function until reaching specific line

2009-06-26 Thread Tarandeep Singh
The TextInputFormat gives byte offset in the file as key and the entire line as value. so it won't work for you. You can modify NLineInputFormat to achieve what you want. NLineInputformat gives each mapper N Lines (in your case N=500) Since you are interested in only first 500 lines of each

Re: Announcing CloudBase-1.3.1 release

2009-06-19 Thread Tarandeep Singh
On Wed, Jun 17, 2009 at 6:33 PM, zsongbo zson...@gmail.com wrote: How about the index of CloudBase? CloudBase has support for Hash Indexing. We have tested it with our production data and found it very useful specially if you want to index on Date column and later want to query on specific

Re: multiple file input

2009-06-19 Thread Tarandeep Singh
On Fri, Jun 19, 2009 at 2:41 PM, pmg parmod.me...@gmail.com wrote: For the sake of simplification I have simplified my input into two files 1. FileA 2. FileB As I said earlier I want to compare every record of FileA against every record in FileB I know this is n2 but this is the process. I

Re: multiple file input

2009-06-19 Thread Tarandeep Singh
oh my bad, I was not clear- For FileB, you will be running a second map reduce job. In mapper, you can use the Bloom Filter, created in first map reduce job (if you wish to use) to eliminate the lines whose keys dont match. Mapper will emit key,value pair, where key is teh field on which you want

Re: multiple file input

2009-06-19 Thread Tarandeep Singh
hey I think I got your question wrong. My solution won't let you achieve what you intended. your example made it clear. Since it is a cross product, the contents of one of the files has to be in memory for iteration, but since size is big, so might not be possible, so how about this solution and

Re: Restrict output of mappers to reducers running on same node?

2009-06-18 Thread Tarandeep Singh
keys to specific reducers, but you would not have control on which node a given reduce task will run. Jothi On 6/18/09 5:10 AM, Tarandeep Singh tarand...@gmail.com wrote: Hi, Can I restrict the output of mappers running on a node to go to reducer(s) running on the same

Restrict output of mappers to reducers running on same node?

2009-06-17 Thread Tarandeep Singh
Hi, Can I restrict the output of mappers running on a node to go to reducer(s) running on the same node? Let me explain why I want to do this- I am converting huge number of XML files into SequenceFiles. So theoretically I don't even need reducers, mappers would read xml files and output

Re: Effects of increasing block size / min split size

2009-06-12 Thread Tarandeep Singh
, but if the individual task completion time is very high, there might not be any discernible performance gain. Jothi On 6/11/09 11:36 PM, Tarandeep Singh tarand...@gmail.com wrote: Hi, I am trying to understand the effects of increasing block size or minimum split size. If I increase

Re: Effects of increasing block size / min split size

2009-06-12 Thread Tarandeep Singh
On Fri, Jun 12, 2009 at 4:59 PM, Owen O'Malley omal...@apache.org wrote: On Jun 11, 2009, at 11:06 AM, Tarandeep Singh wrote: I am trying to understand the effects of increasing block size or minimum split size. If I increase them, then a mapper will process more data, effectively reducing

Effects of increasing block size / min split size

2009-06-11 Thread Tarandeep Singh
Hi, I am trying to understand the effects of increasing block size or minimum split size. If I increase them, then a mapper will process more data, effectively reducing the number of mappers that will be spawned. As there is an overhead in starting mappers, so this seems good. However, If I

Re: Indexing on top of Hadoop

2009-06-10 Thread Tarandeep Singh
We have built basic index support in CloudBase (a data warehouse on top of Hadoop- http://cloudbase.sourceforge.net/) and can share our experience here- The index we built is like a Hash Index- for a given column/field value, it tries to process only those data blocks which contain that value

Re: Sharing object between mappers on same node (reuse.jvm ?)

2009-06-04 Thread Tarandeep Singh
first place, just a thought) -Tarandeep On Thu, Jun 4, 2009 at 12:49 AM, Kevin Peterson kpeter...@biz360.comwrote: On Wed, Jun 3, 2009 at 10:59 AM, Tarandeep Singh tarand...@gmail.com wrote: I want to share a object (Lucene Index Writer Instance) between mappers running on same node of 1

Sharing object between mappers on same node (reuse.jvm ?)

2009-06-03 Thread Tarandeep Singh
Hi, I want to share a object (Lucene Index Writer Instance) between mappers running on same node of 1 job (not across multiple jobs). Please correct me if I am wrong - If I set the -1 for the property: mapred.job.reuse.jvm.num.tasks then all mappers of one job will be executed in the same jvm

Re: Distributed Lucene Questions

2009-06-02 Thread Tarandeep Singh
://www.scaleunlimited.com http://www.101tec.com On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote: Hi All, I am trying to build a distributed system to build and serve lucene indexes. I came across the Distributed Lucene project- http://wiki.apache.org/hadoop/DistributedLucene https

Distributed Lucene Questions

2009-06-01 Thread Tarandeep Singh
Hi All, I am trying to build a distributed system to build and serve lucene indexes. I came across the Distributed Lucene project- http://wiki.apache.org/hadoop/DistributedLucene https://issues.apache.org/jira/browse/HADOOP-3394 and have a couple of questions. It will be really helpful if

How to submit a project to Hadoop/Apache

2009-04-15 Thread Tarandeep Singh
Hi, Can anyone point me to a documentation which explains how to submit a project to Hadoop as a subproject? Also, I will appreciate if someone points me to the documentation on how to submit a project as Apache project. We have a project that is built on Hadoop. It is released to the open

Re: How to submit a project to Hadoop/Apache

2009-04-15 Thread Tarandeep Singh
, that you can skip the incubator and go straight under a project's wing (e.g. Hadoop) if the project PMC approves. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Tarandeep Singh tarand...@gmail.com To: core-user

Announcing CloudBase-1.3 release

2009-04-14 Thread Tarandeep Singh
Hi, We have released 1.3 version of CloudBase on sourceforge- http://cloudbase.sourceforge.net/ [ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce architecture. It uses ANSI SQL as its query language and comes with a JDBC driver. It is developed by Business.com and is

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Tarandeep Singh
I think there is one important comparison missing in the paper- cost. The paper does mention that Hadoop comes for free in the conclusion, but didn't give any details of how much it would cost to get license for Vertica or DBMS X to run on 100 nodes. Further, with data warehouse products like

Re: Compare Files

2009-03-15 Thread Tarandeep Singh
Map- Output key,value pair as- (source, file_num) 1,1 2,1 3,1 2,2 7,2 Reduce- (1, [1]), (2, [1,2]), (3, [1]), (7, [2]) Ouptut only those keys whose list of values do not contain file2- 1 3 -Taran On Sun, Mar 15, 2009 at 7:24 AM, Tamir Kamara tamirkam...@gmail.com wrote: Hi, I have 2 files

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread Tarandeep Singh
of Hive vs. Cloudbase for performance and comparison of features? Cheers, Tim 2009/3/3 Guttikonda, Praveen praveen.guttiko...@hp.com: Hi , Will this be competing in a sense with HBASE then ? Cheers, Praveen -Original Message- From: Tarandeep Singh [mailto:tarand

Time Series Analysis using CloudBase

2009-03-03 Thread Tarandeep Singh
Hi, http://cloudbase.sourceforge.net/[ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce architecture. It uses ANSI SQL as its query language and comes with a JDBC driver. It is developed by Business.com and is released to open source community under GNU GPL license. One

Re: Announcing CloudBase-1.2.1 release

2009-03-03 Thread Tarandeep Singh
cardinality as mysql can't determine the best join order inherently) so I am wondering about porting my reporting application. I think this kind of info would be great for cloudbase docs. Cheers, Tim 2009/3/3 Tarandeep Singh tarand...@gmail.com: Tim is right. CloudBase

Announcing CloudBase-1.2.1 release

2009-03-02 Thread Tarandeep Singh
Hi, We have just released 1.2.1 version of CloudBase on sourceforge- http://cloudbase.sourceforge.net [ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce architecture. It uses ANSI SQL as its query language and comes with a JDBC driver. It is developed by Business.com and

Announcing CloudBase-1.2 release

2009-02-26 Thread Tarandeep Singh
Hi, We have released 1.2 version of CloudBase on sourceforge- http://cloudbase.sourceforge.net/ [ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce architecture. It uses ANSI SQL as its query language and comes with a JDBC driver. It is developed by Business.com and is

Announcing CloudBase-1.1 release

2008-12-22 Thread Tarandeep Singh
Hi, We have released 1.1 version of CloudBase on sourceforge- http://cloudbase.sourceforge.net/ [ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce architecture. It uses ANSI SQL as its query language and comes with a JDBC driver. It is developed by Business.com and is

Re: API Documentation question - WritableComparable

2008-12-11 Thread Tarandeep Singh
The example is just to illustrate how one should implement one's own WritableComparable class and in the compreTo method, it is just showing how it works in case of IntWritable with value as its member variable. You are right the example's code is misleading. It should have used either timestamp

Re: When I system.out.println() in a map or reduce, where does it go?

2008-12-10 Thread Tarandeep Singh
you can see the output in hadoop log directory (if you have used default settings, it would be $HADOOP_HOME/logs/userlogs) On Wed, Dec 10, 2008 at 1:31 PM, David Coe [EMAIL PROTECTED] wrote: I've noticed that if I put a system.out.println in the run() method I see the result on my console. If

Re: File Splits in Hadoop

2008-12-10 Thread Tarandeep Singh
On Wed, Dec 10, 2008 at 11:12 AM, amitsingh [EMAIL PROTECTED]wrote: Hi, I am stuck with some questions based on following scenario. 1) Hadoop normally splits the input file and distributes the splits across slaves(referred to as Psplits from now), in to chunks of 64 MB. a) Is there Any way

How to find partition number in reducer

2008-12-09 Thread Tarandeep Singh
Hi, I want to find out the partition number (which is being handled by the reducer). I can use this- HashPartitioner.getPartition( ...) but it takes Key as argument. Is there a way I can do something similar in configure( ) method (where I have not got the Key yet) Thanks, Taran

Re: How to find partition number in reducer

2008-12-09 Thread Tarandeep Singh
but this worked for me- jobConf.getInt( mapred.task.partition, 0) thanks, Taran Zheng -Original Message- From: Tarandeep Singh [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 09, 2008 6:16 PM To: core-user@hadoop.apache.org Subject: How to find partition number in reducer Hi, I want

Question about ChainMapper and ChainReducer

2008-11-25 Thread Tarandeep Singh
Hi, I would like to know how does ChainMapper and ChainReducer save IO ? The doc says the output of first mapper becomes the input of second and so on. So does this mean, the output of first map is *not* written to HDFS and a second map process is started that operates on the data generated by

Caching data selectively on slaves

2008-11-11 Thread Tarandeep Singh
Hi, Is is possible to cache data selectively on slave machines? Lets say I have data partitioned as D1, D2... and so on. D1 is required by Reducer R1, D2 by R2 and so on. I know this before hand because HashPartitioner.getPartition was used to partition the data. If I put D1, D2.. in

key,Values at Reducer are local or present in DFS

2008-11-05 Thread Tarandeep Singh
Hi, I want to know whether the key,values received by a particular reducer at a node are stored locally on that node or are stored on DFS (and hence replicated over cluster according to replication factor set by user) One more question- How does framework replicates the data? Say Node A writes

Re: Pushing jar files on slave machines

2008-10-17 Thread Tarandeep Singh
cluster. I needed to have the third party jar files to become available to all nodes without me manually distributing them from the master node where I launch my job. Kyle On Mon, 2008-10-13 at 12:11 -0700, Allen Wittenauer wrote: On 10/13/08 11:06 AM, Tarandeep Singh [EMAIL PROTECTED

CloudBase: Data warehouse system build on top of Hadoop

2008-10-16 Thread Tarandeep Singh
Hi, CloudBase is a data warehouse system built on top of Hadoop. It is developed by Business.com (www.business.com) and is released to open source community under GNU General Public License 2.0 CloudBase provides a database abstraction layer on top of flat log files and allows one to query the

Pushing jar files on slave machines

2008-10-13 Thread Tarandeep Singh
Hi, I want to push third party jar files that are required to execute my job, on slave machines. What is the best way to do this? I tried setting HADOOP_CLASSPATH before submitting my job, but I got classNotFoundException. This is what I tried- for f in $MY_HOME/lib/*.jar; do

Questions regarding adding resource via Configuration

2008-10-06 Thread Tarandeep Singh
Hi, I have a configuration file (similar to hadoop-site.xml) and I want to include this file as a resource while running Map-Reduce jobs. Similarly, I want to add a jar file that is required by Mappers and Reducers ToolRunner.run( ...) allows me to do this easily, my question is can I add these

Add jar file via -libjars - giving errors

2008-10-06 Thread Tarandeep Singh
Hi, I want to add a jar file (that is required by mappers and reducers) to the classpath. Initially I had copied the jar file to all the slave nodes in the $HADOOP_HOME/lib directory and it was working fine. However when I tried the libjars option to add jar files - $HADOOP_HOME/bin/hadoop jar

Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Tarandeep Singh
side so that it gets picked up on the client side as well. mahadev On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote: Hi, I want to add a jar file (that is required by mappers and reducers) to the classpath. Initially I had copied the jar file to all the slave nodes

Optimal values of parameters in hadoop-site.xml

2008-09-23 Thread Tarandeep Singh
Hi, I am running a small cluster of 4 nodes, each node having quad-cores and 8 gb of RAM. I have used the following values for parameters in hadoop-site.xml. I want to know, can I increase the performance further by changing one or more of these- dfs.replication: I have set it to 2. Will I get

Stop MR jobs after N records have been produced ?

2008-09-04 Thread Tarandeep Singh
Hi, Can I stop Map-Reduce jobs after mappers (or reducers) have produced N records ? For example, I am interested in finding any 5 rows in the log files that have a specific keyword. Once I have got 5 lines, there is no need to check other lines in the log files and thus Mappers and reducers

how to access counter value in Reducer ?

2008-09-04 Thread Tarandeep Singh
Hi, How can I access a value of a counter in a reducer ? Basically I am interested in knowing how many records I have got from file1, file2 .. fileN The mapper is maintaining N counters and incrementing counter-i for every record read from ith file. Initially I was tagging my records with file

Re: how to get number of records written by reducer ?

2008-08-28 Thread Tarandeep Singh
On Thu, Aug 28, 2008 at 2:39 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Aug 28, 2008, at 2:33 PM, Tarandeep Singh wrote: Hi, I want to know how many records were written by the reducer via API. Should I define my own counter or is there a way to get the value of this counter

Re: questions on sorting big files and sorting order

2008-08-27 Thread Tarandeep Singh
On Tue, Aug 26, 2008 at 7:50 AM, Owen O'Malley [EMAIL PROTECTED] wrote: On Tue, Aug 26, 2008 at 12:39 AM, charles du [EMAIL PROTECTED] wrote: I would like to sort a large number of records in a big file based on a given field (key). The property you are looking for is a total order and

Multiple output files by reducers?

2008-08-26 Thread Tarandeep Singh
Hi, Is it correct that the output of Map-Reduce job can result in multiple files in the output directory ? If yes, then how can I read the output in the order generated by the MR job ? Can I use FileStatus.getModificationTime( ) and pick the files in the increasing order of their modification

How to set System property for my job

2008-08-08 Thread Tarandeep Singh
Hi, While submitting a job to Hadoop, how can I set system properties that are required by my code ? Passing -Dmy.prop=myvalue to the hadoop job command is not going to work as hadoop command will pass this to my program as command line argument. Is there any way to achieve this ? Thanks, Taran

Re: Too many fetch failures AND Shuffle error

2008-06-30 Thread Tarandeep Singh
I am getting this error as well. As Sayali mentioned in his mail, I updated the /etc/hosts file with the slave machines IP addresses, but I am still getting this error. Amar, which is the url that you were talking about in your mail - There will be a URL associated with a map that the reducer try

MapWritable as output value of Reducer

2008-06-05 Thread Tarandeep Singh
hi, Can I use MapWritable as an output value of a Reducer ? If yes, how will the (key, value) pairs in the MapWritable object will be written to the file ? What output format should I use in this case ? Further, I want to chain the output of the first map reduce job to another map reduce job,

behavior of MapWritable as Key in Map Reduce

2008-05-28 Thread Tarandeep Singh
Hi, I want to understand the behavior of MapWritable if used as an intermediate Key in Mappers and Reducers. Suppose I create a MapWritable object with the following key-values in it- (K1, V1), (K2, V2) (K3, V3) So how will the Map Reduce Framework group and sort the keys (MapWritable objects)

Need example of MapWritable as Intermediate Key

2008-05-28 Thread Tarandeep Singh
Hi, Can someone point me to an example code where MapWritable/SortedMapWritable is used as in intermediate key. I am looking for how to set the comparator for MapWritable/SortedMapwritable so that the framework groups/sorts the intermediate keys in accordance to my requirement - sort the

Question - Sum of a col in log file - memory requirement in reducer

2008-05-27 Thread Tarandeep Singh
Hi, Is it correct that an intermediate key from a mapper goes to only 1 reducer ? If yes, then if I have to sum up values of some col in a log file, a reducer will consume a lot of memory - I have a simple requirement - to sum up the values of one of the column in the log files. Suppose the log

JobConf: How to pass List/Map

2008-04-30 Thread Tarandeep Singh
Hi, How can I set a list or map to JobConf that I can access in Mapper/Reducer class ? The get/setObject method from Configuration has been deprecated and the documentation says - A side map of Configuration to Object should be used instead. I could not follow this :( Can someone please explain

submitting map-reduce jobs without creating jar file ?

2008-04-22 Thread Tarandeep Singh
hi, Can I submit a map-reduce job without creating the jar file (and using $HADOOP_HOME/bin/hadoop script). I looked into the hadoop script and it is invoking org.apache.hadoop.util.RunJar class. Should I (or rather do I) have to do the same thing as this class is doing if I don't want to use the

Hadoop input path - can it have subdirectories

2008-04-01 Thread Tarandeep Singh
Hi, Can I give a directory (having subdirectories) as input path to Hadoop Map-Reduce Job. I tried, but got error. Can Hadoop recursively traverse the input directory and collect all the file names or the input path has to be just a directory containing files (and no sub-directories) ? -Taran

Hadoop input path - can it have subdirectories

2008-03-31 Thread Tarandeep Singh
Hi, Can I pass a directory having subdirectories ( which further have subdirectories) to hadoop as input path ? I tried it, but I got error :( -Taran

Re: Sorting output data on value

2008-02-22 Thread Tarandeep Singh
On Fri, Feb 22, 2008 at 5:46 AM, Owen O'Malley [EMAIL PROTECTED] wrote: On Feb 21, 2008, at 11:01 PM, Ted Dunning wrote: But this only guarantees that the results will be sorted within each reducers input. Thus, this won't result in getting the results sorted by the reducers

Re: hadoop: how to find top N frequently occurring words

2008-02-04 Thread Tarandeep Singh
) and count directly in that. This would lead to some quantifiable error rate, which may be acceptable for your application. Thanks for suggesting this. I didn't know about it. I will read more about it and hopefully it will solve my problem. thanks, Taran Miles On 04/02/2008, Tarandeep Singh

Re: hadoop: how to find top N frequently occurring words

2008-02-04 Thread Tarandeep Singh
, Taran MIles On 04/02/2008, Tarandeep Singh [EMAIL PROTECTED] wrote: Hi, Can someone guide me on how to write program using hadoop framework that analyze the log files and find out the top most frequently occurring keywords. The log file has the format - keyword source dateId

More than one map-reduce tasks in one program ?

2008-02-04 Thread Tarandeep Singh
Hi, I am working on a problem - process log files and count the number of times all keywords occur - a kinda word count program that comes with Hadoop examples. In addition to that I need to do post processing of the result, like identify the top 10 most frequently occurring keywords or keywords