Hi
We currently have a cluster of 3 nodes, all of same
configuration[4cpu, 8GB RAM, 2TB]. Namenode website shows that DFS
used % is 30%[Nearly 700GB used per node]. The CPU utilization is
continuously remaining high for all the nodes in cluster. So to fit to
the needs of growing number of jobs tha
Thanks for the reply
Nice to get more information about hadoop and i have also started looking at
nutch
The only thing i am not able to get yet is weather we can integrate hadoop
with lucene or not and any tutorial on it that will help me do this ?
Thanks a lot again for your response and help
check MultipleOutputFormat and MultipleOutputs (this has been
committed to the trunk last week)
On Mon, Jul 14, 2008 at 11:49 PM, Khanh Nguyen <[EMAIL PROTECTED]> wrote:
> Hello,
>
> Is it possible to have more than one output collector for one map?
>
> My input are records of html pages. I am ma
Hi,
I'm wondering what would be the memory consumption of
dfs.block.size for a fixed set of data in NameNode? I know
it is determined by # of blocks and # of replications, but
how many memory does one block will use in NameNode?
In addition, what would be the pros/cons of bigger/smaller
block size
-- Presently, my RecordReader converts XML strings from a file to
MyWritable
object
-- When readFields is called, RecordReader should provide the next
MyWritable object, if there is one
-- When write is called, MyWriter should write the objects out
Not quite. Your RecordReader may produce My
Yes; a combiner that emits a key that should go to a different
partition is incorrect. If this were legal, then the combiner output
would also need to be buffered, sorted, spilled, etc., effectively
requiring another map phase. The combiner's purpose is to decrease the
volume of data that n
I have some mapping jobs that are chained together, and would like to set the
inputs for them in an overridden configure(JobConf) method. When I try to
do it this way, though, I get an error like this:
aggregatorJob failed: java.io.IOException: No input paths specified in input
at
org.ap
Hello Chris:
Thanks for the prompt reply!
So, to conclude from your note:
-- Presently, my RecordReader converts XML strings from a file to MyWritable
object
-- When readFields is called, RecordReader should provide the next
MyWritable object, if there is one
-- When write is called, MyWriter shou
Hi there,
I read the code a bit, though I am not sure if I get it right. It
appears to me that when memory buffer of mapper is full, it spills and
gets sorted by partition id and by keys. Then, if there is a combiner
defined, it will work on each partition. However, it seems that the
outputs of a
It's easiest to consider write as a function that converts your record
to bytes and readFields as a function restoring your record from
bytes. So it should be the case that:
MyWritable i = new MyWritable();
i.initWithData(some_data);
i.write(byte_stream);
...
MyWritable j = new MyWritable();
Not quite; the intermediate output is written to the local disk on the
node executing MapTask and fetched over HTTP by the ReduceTask. The
ReduceTask need only wait for the MapTask to complete successfully
before fetching its output, but it cannot start before all MapTasks
have finished. Th
i'm pretty sure that the reducer waits for all of the map tasks'
output to be written to HDFS (or else i nee no use for the Combiner
class). i'm not sure about your second question though. my gut tells
me "no"
On Jul 14, 2008, at 3:50 PM, Kevin wrote:
Hi, there,
I am interested in the
Weird. I use eclipse, but that's never happened to me. When you set
up your JobConfs, for example:
JobConf conf2 = new JobConf(getConf(),MyClass.class)
is your "MyClass" in the same package as your driver program? also, do
you run from eclipse or from the command line (i've never tried to
l
Hi, there,
I am interested in the implementation details of hadoop mapred. In
particular, does the reducer wait till a map task ends and then fetch
the output (key-value pairs)? If so, is the very file produced by a
mapper for the reducer sorted before reducer gets it? (which means
that the reduce
Hi There!
I'm currently working on code for my own Writable object (called
ServiceWritable) and I've been working off LongWritable for this one. I was
wondering, however, about the following two functions:
public void readFields(java.io.DataInput in)
and
public void write(java.io.DataOutput out)
Thank you, Chris. This solves my questions.
-Kevin
On Mon, Jul 14, 2008 at 11:17 AM, Chris Douglas <[EMAIL PROTECTED]> wrote:
> "Yielding equal partitions" means that each input source will offer n
> partitions and for any given partition 0 <= i < n, the records in that
> partition are 1) sorted
Hi, I don't have the code sitting in front of me at the moment, but
I'll do some of it from memory and I'll post a real snippet tomorrow
night. Hopefully, this can get you started
public class MyMainClass {
public static void main(String[] args) {
ToolRunner.run(new Configu
One cheap hack that comes to mind is to extend the GenericWritable and
ArrayWritable classes and write a second and third MapReduce job that
will both parse over your first job's output, and each will select for
the Key-Value pair it wants.
Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net
What's the behavior of giving FileSplit "null" for the hosts field in
the constructor? Will the framework figure out which hosts the data
from on its own?
Thanks,
Nathan Marz
Well that's what I need to do also... but Hadoop complains to me when I
attempt to do that. Are you using Eclipse by any chance to develop? The
error I'm getting seems to be stemming from the fact that Hadoop thinks I am
uploading a new jar for EVERY execution of JobClient.runJob() so it fails
ind
Hi,
I have a couple of *basic* questions about Hadoop internals.
1) If I understood correctly the ideal number of Reducers is equal to number
of distinct keys (or custom Partitioners) emitted from from all Mappers at
given Map-Reduce iteration. Is that correct?
2) In configuration there can be s
hey sean,
i later learned that the method i originally posted (configuring
different JobConfs and then running them, blocking style, with
JobClient.runJob(conf)) was sufficient for my needs. the reason it was
failing before was somehow my fault and the bugs somehow got fixed x_X.
Lukas ga
Could you please provide some small code snippets elaborating on how you
implemented that? I have a similar need as the author of this thread and I
would appreciate any help. Thanks!
Cheers,
Sean
Joman Chu-2 wrote:
>
> Hi, I use Toolrunner.run() for multiple MapReduce jobs. It seems to work
>
I think you may find a lot of information about Hadoop in general in
Hadoop's Wiki http://hadoop.apache.org/core/
Re. Hadoop and search, you might also want to take a look at Nutch
http://lucene.apache.org/nutch/
In general, Hadoop allows one to store huge amount of data on a cluster of
commodity
Hello,
Is it possible to have more than one output collector for one map?
My input are records of html pages. I am mapping each url to its
html-content and want to have two output collectors. One that maps
each --> and another one that map
to something else (difficult to explain).
Please help
"Yielding equal partitions" means that each input source will offer n
partitions and for any given partition 0 <= i < n, the records in that
partition are 1) sorted on the same key 2) unique to that partition,
i.e. if a key k is in partition i for a given source, k appears in no
other parti
Hello,
I am struggling to get Hadoop to pull input from a http source but so
far no luck. Is it even possible because in this case, the input is
not placed in Hadoop's file system? An example code would be ideal.
Thanks.
-k
Hi,
I find limited information about this package which looks like could
do "equi?" join. "Given a set of sorted datasets keyed with the same
class and yielding equal partitions, it is possible to effect a join
of those datasets prior to the map. " What does "yielding equal
partitions" mean?
Than
Update on the Hadoop user group in the UK:
It will be hosted at Skills Matter in Clerkenwell, London on August 19.
We'll have presentations from both developers and users of Apache Hadoop.
The event is free and anyone is welcome, but we only have room for 60
people so make sure you're on the
On 7/14/08, Jason Venner <[EMAIL PROTECTED]> wrote:
One benefit is that if your map or reduce behaves badly it can't
take down
the task tracker.
As the tracker jvm could also be monitored (and restarted) from
outside, the internal execution might still be worth looking into. At
least t
Well, I got it.
On 7/14/08, Jason Venner <[EMAIL PROTECTED]> wrote:
>
> One benefit is that if your map or reduce behaves badly it can't take down
> the task tracker.
>
> In our case we have some poorly behaved external native libraries we use,
> and we have to forcibly ensure that the child vms
This sounds like a good task for the Data Join code.
If you can set up so that all of your data is stored in MapFiles, with
the same type of key and the same partitioning setup and count, it will
go very well.
Mori Bellamy wrote:
Hey Amer,
It sounds to me like you're going to have to write yo
One benefit is that if your map or reduce behaves badly it can't take
down the task tracker.
In our case we have some poorly behaved external native libraries we
use, and we have to forcibly ensure that the child vms are killed when
the child main finishes (often by kill -9), so the fact the c
What's the benefits from such design compared to multi-thread?
--
朱盛凯
Jash Zhu
复旦大学软件学院
Software School, Fudan University
Thanks for the response .
The problem i am facing here is i dont have any clue about the hadoop . So
first i am trying to analyse weather i can integrate hadoop with the
existing application developed using Lucene or not ? i need some clue or
tutorial which talk about hadoop integration with luce
If your InputFormat implements Configurable you'll get access to the
JobConf via the setConf(Configuration) method when Hadoop creates an
instance of your class.
On Mon, Jun 30, 2008 at 11:20 PM, Nathan Marz <[EMAIL PROTECTED]> wrote:
> Hello,
>
> Are there any plans to change the JobConf API so t
Hi bhupendra,
You may find these links helpful:
https://issues.apache.org/jira/browse/HADOOP-2951
http://www.mail-archive.com/[EMAIL PROTECTED]/msg00596.html
Naama
On Mon, Jul 14, 2008 at 8:37 AM, bhupendar <[EMAIL PROTECTED]> wrote:
>
> Hi all
>
> I have created a search engine using lucene t
You can use MultipleOutputFormat or MultipleOutputs (it has been
committed to SVN a few days ago) for this.
Then you can use a filter on your input dir for the next jobs so only
files matching a given name/pattern are used.
A
On Fri, Jul 11, 2008 at 8:54 PM, Jason Venner <[EMAIL PROTECTED]> wrot
38 matches
Mail list logo