Thanks Christophe, Amr and Ted for your recommendations.
Cheers,
Usman
Hey Usman, your second approach is on the right track. You don't want
to have your end users interacting directly with HDFS. The latency is
too high, and it wasn't designed for this.
OTOH, running a script (a mapreduce,
hello!
Can you precisely tell me baout why to use HBASE??
Also, as my data is going to increase daya by day, Will I be able to search
for a particular file or folder in present in HDFS efficiently nad in a fast
manner?
i have written a small java code, using Filesystem api of hadoop and in
Hi Group,
I have data to be analyzed and I would like to dump this data to Hadoop from
machine.X where as Hadoop is running from machine.Y, after dumping this data
to data I would like to initiate a job, get this data analyzed and get the
output information back to machine.X
I would like to do
Hi,
Are there any instructions on how to build Hadoop from source? Now that the
project seems to have been split into separate projects (common, hdfs, and
mapreduce), there are 3 separate repositories under svn. Information on this
page is no longer correct:
Use ant jar if you want to jar file.
2009/7/9 Harish Mallipeddi harish.mallipe...@gmail.com:
Hi,
Are there any instructions on how to build Hadoop from source? Now that the
project seems to have been split into separate projects (common, hdfs, and
mapreduce), there are 3 separate repositories
-- Forwarded message --
From: Sugandha Naolekar sugandha@gmail.com
Date: Thu, Jul 9, 2009 at 1:41 PM
Subject: how to compress..!
To: core-u...@hadoop.apache.org
Hello!
How to compress data by using hadoop api's??
I want to write a java code to comperss the core files(the
The simplest way is to swap the key and value in your mapper's output, then
swap them back afterward.
On Thu, Jul 9, 2009 at 7:52 AM, Marcus Herou marcus.he...@tailsweep.comwrote:
Hi many times I want to sort by value instead of key.
For instance when counting the top used tags in blog posts
Thanks a lot Jason.My copy of that book is on the way..So soon I will be
able to use that.
Pankil
On Thu, Jul 9, 2009 at 1:54 AM, jason hadoop jason.had...@gmail.com wrote:
In the example code from Pro Hadoop, is a sample map reduce job that uses
mapside join to merge the files into a single
You don't mention what size cluster you have, but we use a relatively small
cluster and index hundreds of GB in an hour to few hours (depending on the
content and the size fo the cluster). So your results are anomalous.
However, we wrote our own indexer. The way it works is that documents are
Hi,
Does anyone has hint on how to implement SORT-MERGE JOIN using map-reduce
paradigm?
I read article regarding it on Pig wiki but did not got clarity as it
doesn't show in form of map and reduce.
Pankil
This approach may not combine the results as per the key to reduce function.
-Original Message-
From: miles...@gmail.com [mailto:miles...@gmail.com] On Behalf Of Miles
Osborne
Sent: Thursday, July 09, 2009 10:03 AM
To: common-user@hadoop.apache.org
Subject: Re: Sort by value
if you have
Hi Pankil,
Basically there are two steps here - the first is to sort the two files.
This can be done using an mapreduce where the mapper extracts the join
column as a key.
If you make sure you have the same number of reducers (and partition by the
equijoin column) for both sorts, then you'll end
A few comments before I answer:
1) Each time you send an email, we receive two emails. Is your mail client
misconfigured?
2) You already asked this question in another thread :). See my response
there.
Short answer:
Hey Ram.
The problem is i initialize these variables in the run function after
receiving the cmd line arguments .
I want to access the same vars in the mpa function.
Is there a diff way other than passing the variables through a Conf object ?
-Hrishi
- Original Message
From:
The only way to do it as of now is through the conf object.
On Thu, Jul 9, 2009 at 11:35 AM, smarthr...@yahoo.co.in wrote:
Hey Ram.
The problem is i initialize these variables in the run function after
receiving the cmd line arguments .
I want to access the same vars in the mpa function.
Hi Pankil,
Simply use the normal FileSystem APIs to open the side input. You can
construct a SequenceFile.Reader from a Path and use the normal methods
inside that class to do the reading of the records.
-Todd
On Thu, Jul 9, 2009 at 11:12 AM, Pankil Doshi forpan...@gmail.com wrote:
Dear Todd,
Hi Jothi,
We are trying to index around 245GB compressed data (~1TB uncompressed)
on a 9 node Hadoop cluster with 8 slaves and 1 master. In Map, we are
just parsing the files, passing the same to reduce. In Reduce, we are
indexing the parsed data in much like Nutch style.
When we ran the job,
I had the same issue with getting a static member filled out, so I used
the JobConf object to get variables I had stored in the run() method
from the command line. With the 0.19 api, the MapReduceBase class has a
public void configure( JobConf job ) method to override that will be
called before
You are basically re-inventing lots of capabilities that others have solved
before.
The idea of building an index that refers to files which are constructed by
progressive merging is very standard and very similar to the way that Lucene
works.
You don't say how much data you are moving, but I
Use the configuration object.
Remember that the outer class is replicated all across the known universe.
Your command line arguments only exist on your original machine.
On Thu, Jul 9, 2009 at 11:35 AM, smarthr...@yahoo.co.in wrote:
Hey Ram.
The problem is i initialize these variables in the
Yep figured that.
On Thu, Jul 9, 2009 at 7:09 PM, Owen O'Malley omal...@apache.org wrote:
You need two jobs:
1. map: line - line, 1, combiner reducer: sum values, sort by line
2. map: line, count - count, line reducer: count, line - line, count
So job 1 looks like word count and job 2
Reduce tasks which require more than twenty minutes are not a problem. But
you must emit some data periodically to inform the rest of the system that
each reducer is still alive. Emitting a (k, v) output pair to the collector
will reset the timer. Similarly, calling Reporter.incrCounter() will
I've seen this behavior before with reduces going over 100% on big jobs.
What version of Hadoop are you using? I think there are some old bugs filed
for this if you search the Jira.
On Thu, Jul 9, 2009 at 5:31 PM, Aaron Kimball aa...@cloudera.com wrote:
Reduce tasks which require more than
Hi Matthew,
You can set the heap size for child jobs by calling
conf.set(mapred.child.java.opts, -Xmx1024m) to get a gig of heap space.
That should fix the OOM issue in IsolationRunner. You can also change the
heap size used in Eclipse; if you go to Debug Configurations, create a new
To clarify all of the writers.
Store the values you wish to share with your map tasks, in the JobConf
object.
In the configure method of your mapper class, unpack the variables and store
them in class fields of the mapper class.
Then use them as needed in the map method of your mapper class.
On
Hi Shravan,
By Hadoop client, I think he means the hadoop command-line program
available under $HADOOP_HOME/bin. You can either write a custom Java program
which directly uses the Hadoop APIs or just write a bash/python script which
will invoke this command-line app and delegate work to it.
-
26 matches
Mail list logo