The TextInputFormat gives byte offset in the file as key and the entire line
as value. so it won't work for you.
You can modify NLineInputFormat to achieve what you want. NLineInputformat
gives each mapper N Lines (in your case N=500)
Since you are interested in only first 500 lines of each
On Wed, Jun 17, 2009 at 6:33 PM, zsongbo zson...@gmail.com wrote:
How about the index of CloudBase?
CloudBase has support for Hash Indexing. We have tested it with our
production data and found it very useful specially if you want to index on
Date column and later want to query on specific
On Fri, Jun 19, 2009 at 2:41 PM, pmg parmod.me...@gmail.com wrote:
For the sake of simplification I have simplified my input into two files 1.
FileA 2. FileB
As I said earlier I want to compare every record of FileA against every
record in FileB I know this is n2 but this is the process. I
oh my bad, I was not clear-
For FileB, you will be running a second map reduce job. In mapper, you can
use the Bloom Filter, created in first map reduce job (if you wish to use)
to eliminate the lines whose keys dont match. Mapper will emit key,value
pair, where key is teh field on which you want
hey I think I got your question wrong. My solution won't let you achieve
what you intended. your example made it clear.
Since it is a cross product, the contents of one of the files has to be in
memory for iteration, but since size is big, so might not be possible, so
how about this solution and
keys to specific reducers, but
you
would not have control on which node a given reduce task will run.
Jothi
On 6/18/09 5:10 AM, Tarandeep Singh tarand...@gmail.com wrote:
Hi,
Can I restrict the output of mappers running on a node to go to
reducer(s)
running on the same
Hi,
Can I restrict the output of mappers running on a node to go to reducer(s)
running on the same node?
Let me explain why I want to do this-
I am converting huge number of XML files into SequenceFiles. So
theoretically I don't even need reducers, mappers would read xml files and
output
, but if the individual task completion time is very
high, there might not be any discernible performance gain.
Jothi
On 6/11/09 11:36 PM, Tarandeep Singh tarand...@gmail.com wrote:
Hi,
I am trying to understand the effects of increasing block size or minimum
split size. If I increase
On Fri, Jun 12, 2009 at 4:59 PM, Owen O'Malley omal...@apache.org wrote:
On Jun 11, 2009, at 11:06 AM, Tarandeep Singh wrote:
I am trying to understand the effects of increasing block size or minimum
split size. If I increase them, then a mapper will process more data,
effectively reducing
Hi,
I am trying to understand the effects of increasing block size or minimum
split size. If I increase them, then a mapper will process more data,
effectively reducing the number of mappers that will be spawned. As there is
an overhead in starting mappers, so this seems good.
However, If I
We have built basic index support in CloudBase (a data warehouse on top of
Hadoop- http://cloudbase.sourceforge.net/) and can share our experience
here-
The index we built is like a Hash Index- for a given column/field value, it
tries to process only those data blocks which contain that value
first place,
just a thought)
-Tarandeep
On Thu, Jun 4, 2009 at 12:49 AM, Kevin Peterson kpeter...@biz360.comwrote:
On Wed, Jun 3, 2009 at 10:59 AM, Tarandeep Singh tarand...@gmail.com
wrote:
I want to share a object (Lucene Index Writer Instance) between mappers
running on same node of 1
Hi,
I want to share a object (Lucene Index Writer Instance) between mappers
running on same node of 1 job (not across multiple jobs). Please correct me
if I am wrong -
If I set the -1 for the property: mapred.job.reuse.jvm.num.tasks then all
mappers of one job will be executed in the same jvm
://www.scaleunlimited.com
http://www.101tec.com
On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote:
Hi All,
I am trying to build a distributed system to build and serve lucene
indexes.
I came across the Distributed Lucene project-
http://wiki.apache.org/hadoop/DistributedLucene
https
Hi All,
I am trying to build a distributed system to build and serve lucene indexes.
I came across the Distributed Lucene project-
http://wiki.apache.org/hadoop/DistributedLucene
https://issues.apache.org/jira/browse/HADOOP-3394
and have a couple of questions. It will be really helpful if
Hi,
Can anyone point me to a documentation which explains how to submit a
project to Hadoop as a subproject? Also, I will appreciate if someone points
me to the documentation on how to submit a project as Apache project.
We have a project that is built on Hadoop. It is released to the open
, that you can skip the incubator and go
straight under a project's wing (e.g. Hadoop) if the project PMC
approves.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Tarandeep Singh tarand...@gmail.com
To: core-user
Hi,
We have released 1.3 version of CloudBase on sourceforge-
http://cloudbase.sourceforge.net/
[ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce
architecture. It uses ANSI SQL as its query language and comes with a JDBC
driver. It is developed by Business.com and is
I think there is one important comparison missing in the paper- cost. The
paper does mention that Hadoop comes for free in the conclusion, but didn't
give any details of how much it would cost to get license for Vertica or
DBMS X to run on 100 nodes.
Further, with data warehouse products like
Map- Output key,value pair as- (source, file_num)
1,1
2,1
3,1
2,2
7,2
Reduce- (1, [1]), (2, [1,2]), (3, [1]), (7, [2])
Ouptut only those keys whose list of values do not contain file2-
1
3
-Taran
On Sun, Mar 15, 2009 at 7:24 AM, Tamir Kamara tamirkam...@gmail.com wrote:
Hi,
I have 2 files
of Hive vs. Cloudbase for performance and
comparison of features?
Cheers,
Tim
2009/3/3 Guttikonda, Praveen praveen.guttiko...@hp.com:
Hi ,
Will this be competing in a sense with HBASE then ?
Cheers,
Praveen
-Original Message-
From: Tarandeep Singh [mailto:tarand
Hi,
http://cloudbase.sourceforge.net/[ CloudBase is a data warehouse system
built on top of Hadoop's Map-Reduce architecture. It uses ANSI SQL as its
query language and comes with a JDBC driver. It is developed by Business.com
and is released to open source community under GNU GPL license. One
cardinality as mysql can't determine the best
join order inherently) so I am wondering about porting my reporting
application.
I think this kind of info would be great for cloudbase docs.
Cheers,
Tim
2009/3/3 Tarandeep Singh tarand...@gmail.com:
Tim is right. CloudBase
Hi,
We have just released 1.2.1 version of CloudBase on sourceforge-
http://cloudbase.sourceforge.net
[ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce
architecture. It uses ANSI SQL as its query language and comes with a JDBC
driver. It is developed by Business.com and
Hi,
We have released 1.2 version of CloudBase on sourceforge-
http://cloudbase.sourceforge.net/
[ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce
architecture. It uses ANSI SQL as its query language and comes with a JDBC
driver. It is developed by Business.com and is
Hi,
We have released 1.1 version of CloudBase on sourceforge-
http://cloudbase.sourceforge.net/
[ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce
architecture. It uses ANSI SQL as its query language and comes with a JDBC
driver. It is developed by Business.com and is
The example is just to illustrate how one should implement one's own
WritableComparable class and in the compreTo method, it is just showing how
it works in case of IntWritable with value as its member variable.
You are right the example's code is misleading. It should have used either
timestamp
you can see the output in hadoop log directory (if you have used default
settings, it would be $HADOOP_HOME/logs/userlogs)
On Wed, Dec 10, 2008 at 1:31 PM, David Coe [EMAIL PROTECTED] wrote:
I've noticed that if I put a system.out.println in the run() method I
see the result on my console. If
On Wed, Dec 10, 2008 at 11:12 AM, amitsingh [EMAIL PROTECTED]wrote:
Hi,
I am stuck with some questions based on following scenario.
1) Hadoop normally splits the input file and distributes the splits across
slaves(referred to as Psplits from now), in to chunks of 64 MB.
a) Is there Any way
Hi,
I want to find out the partition number (which is being handled by the
reducer). I can use this-
HashPartitioner.getPartition( ...) but it takes Key as argument.
Is there a way I can do something similar in configure( ) method (where I
have not got the Key yet)
Thanks,
Taran
but this worked for me-
jobConf.getInt( mapred.task.partition, 0)
thanks,
Taran
Zheng
-Original Message-
From: Tarandeep Singh [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 09, 2008 6:16 PM
To: core-user@hadoop.apache.org
Subject: How to find partition number in reducer
Hi,
I want
Hi,
I would like to know how does ChainMapper and ChainReducer save IO ?
The doc says the output of first mapper becomes the input of second and so
on. So does this mean, the output of first map is *not* written to HDFS and
a second map process is started that operates on the data generated by
Hi,
Is is possible to cache data selectively on slave machines?
Lets say I have data partitioned as D1, D2... and so on. D1 is required by
Reducer R1, D2 by R2 and so on. I know this before hand because
HashPartitioner.getPartition was used to partition the data.
If I put D1, D2.. in
Hi,
I want to know whether the key,values received by a particular reducer at a
node are stored locally on that node or are stored on DFS (and hence
replicated over cluster according to replication factor set by user)
One more question- How does framework replicates the data? Say Node A
writes
cluster. I needed to have the third
party jar files to become available to all nodes without me manually
distributing them from the master node where I launch my job.
Kyle
On Mon, 2008-10-13 at 12:11 -0700, Allen Wittenauer wrote:
On 10/13/08 11:06 AM, Tarandeep Singh [EMAIL PROTECTED
Hi,
CloudBase is a data warehouse system built on top of Hadoop. It is
developed by Business.com (www.business.com) and is released to
open source community under GNU General Public License 2.0
CloudBase provides a database abstraction layer on top of flat log files
and allows one to query the
Hi,
I want to push third party jar files that are required to execute my job, on
slave machines. What is the best way to do this?
I tried setting HADOOP_CLASSPATH before submitting my job, but I got
classNotFoundException.
This is what I tried-
for f in $MY_HOME/lib/*.jar; do
Hi,
I have a configuration file (similar to hadoop-site.xml) and I want to
include this file as a resource while running Map-Reduce jobs. Similarly, I
want to add a jar file that is required by Mappers and Reducers
ToolRunner.run( ...) allows me to do this easily, my question is can I add
these
Hi,
I want to add a jar file (that is required by mappers and reducers) to the
classpath. Initially I had copied the jar file to all the slave nodes in the
$HADOOP_HOME/lib directory and it was working fine.
However when I tried the libjars option to add jar files -
$HADOOP_HOME/bin/hadoop jar
side so that it gets picked up on the client
side as well.
mahadev
On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote:
Hi,
I want to add a jar file (that is required by mappers and reducers) to
the
classpath. Initially I had copied the jar file to all the slave nodes
Hi,
I am running a small cluster of 4 nodes, each node having quad-cores and 8
gb of RAM. I have used the following values for parameters in
hadoop-site.xml. I want to know, can I increase the performance further by
changing one or more of these-
dfs.replication: I have set it to 2. Will I get
Hi,
Can I stop Map-Reduce jobs after mappers (or reducers) have produced N
records ?
For example, I am interested in finding any 5 rows in the log files that
have a specific keyword. Once I have got 5 lines, there is no need to check
other lines in the log files and thus Mappers and reducers
Hi,
How can I access a value of a counter in a reducer ?
Basically I am interested in knowing how many records I have got from file1,
file2 .. fileN
The mapper is maintaining N counters and incrementing counter-i for every
record read from ith file.
Initially I was tagging my records with file
On Thu, Aug 28, 2008 at 2:39 PM, Owen O'Malley [EMAIL PROTECTED] wrote:
On Aug 28, 2008, at 2:33 PM, Tarandeep Singh wrote:
Hi,
I want to know how many records were written by the reducer via API.
Should
I define my own counter or is there a way to get the value of this
counter
On Tue, Aug 26, 2008 at 7:50 AM, Owen O'Malley [EMAIL PROTECTED] wrote:
On Tue, Aug 26, 2008 at 12:39 AM, charles du [EMAIL PROTECTED] wrote:
I would like to sort a large number of records in a big file based on a
given field (key).
The property you are looking for is a total order and
Hi,
Is it correct that the output of Map-Reduce job can result in multiple files
in the output directory ?
If yes, then how can I read the output in the order generated by the MR job
?
Can I use FileStatus.getModificationTime( ) and pick the files in the
increasing order of their modification
Hi,
While submitting a job to Hadoop, how can I set system properties that are
required by my code ?
Passing -Dmy.prop=myvalue to the hadoop job command is not going to work as
hadoop command will pass this to my program as command line argument.
Is there any way to achieve this ?
Thanks,
Taran
I am getting this error as well.
As Sayali mentioned in his mail, I updated the /etc/hosts file with the
slave machines IP addresses, but I am still getting this error.
Amar, which is the url that you were talking about in your mail -
There will be a URL associated with a map that the reducer try
hi,
Can I use MapWritable as an output value of a Reducer ?
If yes, how will the (key, value) pairs in the MapWritable object will be
written to the file ? What output format should I use in this case ?
Further, I want to chain the output of the first map reduce job to another
map reduce job,
Hi,
I want to understand the behavior of MapWritable if used as an intermediate
Key in Mappers and Reducers.
Suppose I create a MapWritable object with the following key-values in it-
(K1, V1), (K2, V2) (K3, V3)
So how will the Map Reduce Framework group and sort the keys (MapWritable
objects)
Hi,
Can someone point me to an example code where MapWritable/SortedMapWritable
is used as in intermediate key. I am looking for how to set the comparator
for MapWritable/SortedMapwritable so that the framework groups/sorts the
intermediate keys in accordance to my requirement - sort the
Hi,
Is it correct that an intermediate key from a mapper goes to only 1 reducer ?
If yes, then if I have to sum up values of some col in a log file, a
reducer will consume a lot of memory -
I have a simple requirement - to sum up the values of one of the
column in the log files.
Suppose the log
Hi,
How can I set a list or map to JobConf that I can access in
Mapper/Reducer class ?
The get/setObject method from Configuration has been deprecated and
the documentation says -
A side map of Configuration to Object should be used instead.
I could not follow this :(
Can someone please explain
hi,
Can I submit a map-reduce job without creating the jar file (and using
$HADOOP_HOME/bin/hadoop script). I looked into the hadoop script and
it is invoking org.apache.hadoop.util.RunJar class. Should I (or
rather do I) have to do the same thing as this class is doing if I
don't want to use the
Hi,
Can I give a directory (having subdirectories) as input path to Hadoop
Map-Reduce Job.
I tried, but got error.
Can Hadoop recursively traverse the input directory and collect all
the file names or the input path has to be just a directory containing
files (and no sub-directories) ?
-Taran
Hi,
Can I pass a directory having subdirectories ( which further have
subdirectories) to hadoop as input path ?
I tried it, but I got error :(
-Taran
On Fri, Feb 22, 2008 at 5:46 AM, Owen O'Malley [EMAIL PROTECTED] wrote:
On Feb 21, 2008, at 11:01 PM, Ted Dunning wrote:
But this only guarantees that the results will be sorted within each
reducers input. Thus, this won't result in getting the results
sorted by
the reducers
) and count
directly in that. This would lead to some quantifiable error rate, which
may be acceptable for your application.
Thanks for suggesting this. I didn't know about it. I will read more
about it and hopefully it will solve my problem.
thanks,
Taran
Miles
On 04/02/2008, Tarandeep Singh
,
Taran
MIles
On 04/02/2008, Tarandeep Singh [EMAIL PROTECTED] wrote:
Hi,
Can someone guide me on how to write program using hadoop framework
that analyze the log files and find out the top most frequently
occurring keywords. The log file has the format -
keyword source dateId
Hi,
I am working on a problem - process log files and count the number of
times all keywords occur - a kinda word count program that comes with
Hadoop examples. In addition to that I need to do post processing of
the result, like identify the top 10 most frequently occurring
keywords or keywords
60 matches
Mail list logo