I believe that this is turned on by default (at least in 15).
On 4/16/08 10:57 AM, Milind Bhandarkar [EMAIL PROTECTED] wrote:
Yes. In hadoop, you can enable backup tasks by setting
mapred.speculative.execution to true.
- milind
On 4/16/08 8:07 AM, Chaman Singh Verma [EMAIL PROTECTED]
Would it be better to have lots of records arrive at the same reducer?
That has a simpler mechanism for ignoring data.
You can just add a (trivial) partition function in addition to your sort.
On 4/16/08 12:07 PM, Karl Wettin [EMAIL PROTECTED] wrote:
I have a job that out of a list with
That design is fine.
You should read your map in the configure method of the reducer.
There is a MapFile format supported by Hadoop, but they tend to be pretty
slow. I usually find it better to just load my hash table by hand. If you
do this, you should use whatever format you like.
On
it is
called before reduce job.
I need to eliminate rows from the HashMap when all the keys are read.
Also my concern is if dataset is large will this HashMap thing work??
On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning [EMAIL PROTECTED] wrote:
That design is fine.
You should read your
Please include the Mahout sub-project when you report what you find. This
kind of dataset would be very helpful for that project as well.
And you might find something helpful there as well. The goal is to support
machine learning on hadoop.
On 4/15/08 8:29 AM, Chaman Singh Verma [EMAIL
script which will sequentially parse each line and Iterate.
Thanks,
Senthil
-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Monday, April 14, 2008 2:20 PM
To: core-user@hadoop.apache.org
Subject: Re: Reduce Output
Try using Text, Text as the output type
Power law algorithms are ideal for this kind of parallelized problem.
The basic idea is that hub and authority style algorithms are intimately
related to eigenvector or singular value decompositions (depending on
whether the links are symmetrical). This also means that there is a close
On 4/15/08 11:59 AM, Chaman Singh Verma [EMAIL PROTECTED] wrote:
How Google handle such a large matrix and solve it ? Do they use MapReduce
framework for these process or adopt standard and reliable Message Passing
Interface/RPC etc for this
task ?
They use map-reduce.
What about the
Why do you want to do this perverse thing?
How does it help to have more than one datanode per machine? And what in
the world is better when you have 10?
On 4/15/08 12:53 PM, Cagdas Gerede [EMAIL PROTECTED] wrote:
I have a follow-up question,
Is there a way to programatically configure
working on Distributed File System part. I do not use MR part,
and I need to run multiple processes to test some scenarios on the file
system.
On Tue, Apr 15, 2008 at 1:37 PM, Ted Dunning [EMAIL PROTECTED] wrote:
I have had no issues in scaling the number of datanodes. The location
Write an additional map-reduce step to join the data items together by
treating different input files differently.
OR
Write an additional map-reduce step that reads in your string values in the
map configuration method and keeps them in memory for looking up as you pass
over the output of your
So do you know any class or method that I can use to have the values separated
by space or any other separator.
Thanks,
Senthil
-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Monday, April 14, 2008 12:47 PM
To: core-user@hadoop.apache.org
Subject: Re
, new IntWritable(sum));
}
}
-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Monday, April 14, 2008 1:49 PM
To: core-user@hadoop.apache.org
Subject: Re: Reduce Output
The format of the reduce output is the responsibility of the reducer. You
Are you trying to read from mySQL?
If so, it isn't very surprising that you could get lower performance with
more readers.
On 4/9/08 7:07 PM, Nate Carlson [EMAIL PROTECTED] wrote:
Hey all,
We've got a job that we're running in both a development environment, and
out on EC2. I've been
Hadoop also does much better with spindles spread across many machines.
Putting 16 TB on each of two nodes is distinctly sub-optimal on many fronts.
Much better to put 0.5-2TB on 16-64 machines. With 2x1TB SATA drives, your
cost and performance are likely to both be better than two machines with
I haven't done a detailed comparison, but I have seen some effects:
A) raid doesn't usually work really well on low-end machines compared to
independent drives. This would make me distrust raid.
B) hadoop doesn't do very well, historically speaking with more than one
partition if the
On 4/8/08 10:43 AM, Natarajan, Senthil [EMAIL PROTECTED] wrote:
I would like to try using Hadoop.
That is good for education, probably bad for run time. It could take
SECONDS longer to run (oh my).
Do you mean to write another MapReduce program which takes the output of the
first
Looks like it is up to me.
On 4/8/08 12:36 PM, Ian Tegebo [EMAIL PROTECTED] wrote:
The wiki has been down for more than a day, any ETA? I was going to search
the
archives for the status, but I'm getting 403's for each of the Archive links
on
the mailing list page:
input files into single file, so that at the
end of the copy process , we will have as many files as there are machines
in the cluster.
Any thoughts if how I should proceeed on this ? or if this is a good idea
at all ?
Ted Dunning [EMAIL PROTECTED] wrote:
The split will depend
Are you implementing this for instruction or production?
If production, why not use Lucene?
On 4/3/08 6:45 PM, Aayush Garg [EMAIL PROTECTED] wrote:
HI Amar , Theodore, Arun,
Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
I have written following code for
Take a looks at the way that the text input format moves to the next line
after a split point.
There are a couple of possible problems with your input format not found
problem.
First, is your input in a package? If so, you need to provide a complete
name for the class.
Secondly, you have to
On 4/4/08 10:18 AM, Francesco Tamberi [EMAIL PROTECTED] wrote:
Thank for your fast reply!
Ted Dunning ha scritto:
Take a looks at the way that the text input format moves to the next line
after a split point.
I'm not sure to understand.. is my way correct or are you suggesting
, Ted Dunning [EMAIL PROTECTED] wrote:
Are you implementing this for instruction or production?
If production, why not use Lucene?
On 4/3/08 6:45 PM, Aayush Garg [EMAIL PROTECTED] wrote:
HI Amar , Theodore, Arun,
Thanks for your reply. Actaully I am new to hadoop so cant figure
out
You can overwrite it, but you can't update it. Soon you will be able to
append to it, but you won't be able to do any other updates.
On 4/2/08 11:39 PM, Garri Santos [EMAIL PROTECTED] wrote:
Hi!
I'm starting to take alook at hadoop and the whole HDFS idea. I'm wondering
if it's just fine
That depends on where the file is. If you are reading a file on a normal
file system, you use normal Java functions. If you are reading a file from
HDFS, you use hadoop functions.
On 4/3/08 1:22 AM, Jeremy Chow [EMAIL PROTECTED] wrote:
Hi list,
If I define a method named configure in a
The easiest way is to package all of your code (classes and jars) into a
single jar file which you then execute. When you instantiate a JobClient
and run a job, your jar gets copied to all necessary nodes. The machine you
use to launch the job need not even be in the cluster, just able to see
Interesting you should say this.
I have been using this exact example (slightly modified) as an interview
question lately. I have to admit I stole it from Doug's Hadoop slides.
If you have a 1TB database with 100 B records and you want to update 1% of
them, how long will it take?
Assume for
I would expect that most file systems can saturate the disk bandwidth for
the large sequential reads that hadoop does.
We use ext3 with good results.
On 4/1/08 8:08 AM, Colin Freas [EMAIL PROTECTED] wrote:
Is the performance of Hadoop impacted by the underlying file system on the
nodes at
But wildcards that match directories that contain files work well.
On 4/1/08 10:41 AM, Peeyush Bishnoi [EMAIL PROTECTED] wrote:
Hello ,
No Hadoop can't traverse recursively inside subdirectory with Java Map-Reduce
program. It have to be just directory containing files (and no
Are you missing a colon on the first command?
Probably just a transcription error when you composed your email (but I have
made similar mistakes often enough and been unable to see them).
On 4/1/08 1:18 PM, Prasan Ary [EMAIL PROTECTED] wrote:
Just to make sure that I am specifying the
Try opening the desired output file in the reduce method. Make sure that
the output files are relative to the correct task specific directory (look
for side-effect files on the wiki).
On 4/1/08 5:57 PM, Ashish Venugopal [EMAIL PROTECTED] wrote:
Hi, I am using Hadoop streaming and I am
that indicates that you can...
Ashish
On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning [EMAIL PROTECTED] wrote:
Try opening the desired output file in the reduce method. Make sure that
the output files are relative to the correct task specific directory (look
for side-effect files on the wiki
Hadoop can't split a gzipped file so you will only get as many maps as you
have files.
Why the obsession with hadoop streaming? It is at best a jury rigged
solution.
On 3/31/08 3:12 PM, lin [EMAIL PROTECTED] wrote:
Does Hadoop automatically decompress the gzipped file? I only have a single
This seems a bit surprising. In my experience well-written Java is
generally just about as fast as C++, especially for I/O bound work. The
exceptions are:
- java startup is still slow. This shouldn't matter much here because you
are using streaming anyway so you have java startup + C
++ and it is easier to migrate to hadoop streaming. Also we have very
strict performance requirements. Java seems to be too slow. I rewrote the
first program in Java and it runs 4 to 5 times slower than the C++ one.
On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning [EMAIL PROTECTED] wrote
My experiences with Groovy are similar. Noticeable slowdown, but quite
bearable (almost always better than 50% of best attainable speed).
The highest virtue is that simple programs become simple again. Word count
is 5 lines of code.
On 3/31/08 6:10 PM, Colin Evans [EMAIL PROTECTED]
Yes.
The present work-arounds for this are pretty complicated.
option1) you can write small files relatively frequently and every time you
write some number of them, you can concatenate them and delete them. These
concatenations can receive the same treatment. If managed carefully in
We evaluated several options for just this problem and eventually settled on
MogileFS. That said, Mogile needed several weeks of work to get it ready
for prime time. It will work pretty well for modest sized collections, but
for our stuff (many hundreds of millions of files, approaching PB of
PROTECTED] wrote:
might be off-topic but how would you compare GlusterFS to HDFS and
MogileFS for such an application? Did you look at that at all and
decided against it?
Ted Dunning wrote:
We evaluated several options for just this problem and eventually settled on
MogileFS. That said, Mogile
It depends on the failure.
For some failure modes, the disk just becomes very slow.
On 3/26/08 4:39 PM, Cagdas Gerede [EMAIL PROTECTED] wrote:
I was wondering
1) what happens if a data node is alive but its harddrive fails? Does it
throw an exception and dies?
2) If It continues to run
Copy from a machine that is *not* running as a data node in order to get
better balancing. Using distcp may also help because the nodes actually
doing the copying will be spread across the cluster.
You should probably be running a rebalancing script as well if your nodes
have differing sizes.
Map-reduce excels at gluing together files like this.
The map phase selects the key and makes sure that you have some way of
telling what the source of the record is.
The reduce phase takes all of the records with the same key and glues them
together. It can do your processing, but it is also
I hate to point this out, but losing *any* data node will decrease the
replication of some blocks.
On 3/24/08 4:53 PM, lohit [EMAIL PROTECTED] wrote:
Improves performance on the basis that files are copied locally in
that node, so there is no need network transmission. But isn't that
policy
Also, streaming is not likely to be the fastest way to solve your problem
because it introduces quite a bit more copying and, even worse, context
switches into the process (java moves the data, passes it to the mapper,
reads the results). I have seen a comment that there were flushes being
done
I think that a custom partitioner is half of the answer. The other half is
that the reducer can open and close output files as needed. With the
partitioner, only one file need be kept open at a time. It is good practice
to open the files relative to the task directory so that process failure
I think the original request was to limit the sum of maps and reduces rather
than limiting the two parameters independently.
Clearly, with a single job running at a time, this is a non-issue since
reducers don't do much until the maps are done. With multiple jobs it is a
bit more of an issue.
Also see my comment about side effect files.
Basically, if you partition on date, then each set of values in the reduce
will have the same date. Thus the reducer can open a file, write the
values, close the file (repeat).
This gives precisely the effect you were seeking.
On 3/18/08 6:17 PM,
Replication is vital in large or even medium-sized clusters for reliability.
Replication also helps distribution.
On 3/17/08 2:48 AM, Alfonso Olias Sanz [EMAIL PROTECTED]
wrote:
But what I wanted to say was that we need to set up a
cluster in a way that the data is distributed among all the
This sounds very different from your earlier questions.
If you have a moderate (10's to 1000's) number of binary files, then it is
very easy to write a special purpose InputFormat that tells hadoop that the
file is not splittable. This allows you to add all of the files as inputs
to the map
it is not good one to separate them out. Just was
wondering is it possible at all. Thanks!
Ted Dunning wrote:
It is quite possible to do this.
It is also a bad idea.
One of the great things about map-reduce architectures is that data is near
the computation so that you don't have
Identity reduce is nice because the result values can be sorted.
On 3/12/08 8:21 AM, Jason Rennie [EMAIL PROTECTED] wrote:
Map could perform all the dot-products, which is the heavy lifting
in what we're trying to do. Might want to do a reduce after that, not
sure...
Ahhh...
There is an old saying for this. I think you are pulling fly specks out of
pepper.
Unless your input format is very, very strange, doing the split again for
two jobs does, indeed, lead to some small inefficiency, but this cost should
be so low compared to other inefficiencies that you
factor using hadoop dfs?
Chris
On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning [EMAIL PROTECTED] wrote:
What about just taking down half of the nodes and then loading your data
into the remainder? Should take about 20 minutes each time you remove nodes
but only a few seconds each time you
Yes. Each task is launching a JVM.
Map reduce is not generally useful for real-time applications. It is VERY
useful for large scale data reductions done in advance of real-time
operations.
The basic issue is that the major performance contribution of map-reduce
architectures is large scale
Would you be interested in the grool extension to Groovy described in the
attached README?
I am looking for early collaborators/guinea pigs.
On 3/11/08 1:43 PM, Jason Rennie [EMAIL PROTECTED] wrote:
Have been working my way through the Map-Reduce tutorial. Just got the
WordCount example
Amar's comments are a little strange.
Replication occurs at the block level, not the file level. Storing data in
a small number of large files or a large number of small files will have
less than a factor of two effect on number of replicated blocks if the small
files are 64MB. Files smaller
Have you looked at hbase. It looks like you are trying to reimplement a
bunch of it.
On 3/10/08 11:01 AM, Richard K. Turner [EMAIL PROTECTED] wrote:
... [storing data in columns is nice] ... I would also do the same for dir
csv_file2. Does anyone know how to do this
in Hadoop?
for map ?
Thanks, Naama
On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning [EMAIL PROTECTED] wrote:
This is not difficult to do. Simply open an extra file in the reducers
configure method and close it in the close method. Make sure you make it
relative to the map reduce output directory so
I thought so as well until I reflected for a moment.
But if you include the top N from every combiner, then you are guaranteed to
have the global top N in the output of all of the combiners.
On 3/6/08 11:50 PM, Owen O'Malley [EMAIL PROTECTED] wrote:
On Mar 6, 2008, at 5:02 PM, Ted Dunning
This is not difficult to do. Simply open an extra file in the reducers
configure method and close it in the close method. Make sure you make it
relative to the map reduce output directory so that you can take advantage
of all of the machinery that handles lost jobs and such.
Search the
You can use System.out if you like and then look at the results as each map
or reduce completes via the web administration tool.
Also, you can use counters via the reporter passed to your map and reduce
classes to get immediate feedback.
On 3/6/08 8:12 AM, Prasan Ary [EMAIL PROTECTED] wrote:
You can definitely use the approach that you suggest and you should have
good results if you are looking for only a small fraction of the file.
Basically, you should have the record reader check to see if any interesting
records exist in the current split and if so, read them and if not, just
The right answer really depends on your workload and what your needs and
goals are.
You say that this is a research lab. If you are researching parallel
algorithms, then I would recommend much higher parallelism.
If you are working on problems where you want throughput, then the answer
may be
Just use a standard rebalancing script and the empty node will fill in
quickly enough.
The most common approach to rebalancing is to iterate through the files in
your system and increase the replication substantially for about a minute
and then drop it back down. It helps to overlap the time
Yes.
Use the configure method which is called each time a new file is used in the
map. Save the file name in a field of the mapper.
The other alternative is to derive a new InputFormat that remembers the
input file name.
On 3/4/08 5:38 PM, Tarandeep Singh [EMAIL PROTECTED] wrote:
Hi,
I
Just call reporter.incrCounter(specificEnumValueOfSomeKind, n)
where the first argument is some enum value. The framework will work out
that it is your enum and put it in a box of its own along with any other
values.
On 3/2/08 7:46 PM, dennis81 [EMAIL PROTECTED] wrote:
Hi,
I was
In our case, we looked at the problem and decided that Hadoop wasn't
feasible for our real-time needs in any case. There were several issues,
- first, of all, map-reduce itself didn't seem very plausible for real-time
applications. That left hbase and hdfs as the capabilities offered by
hadoop
are
only writing 1MB/s. If you need a day of buffering (=100,000 seconds), then
you need 100GB of buffer storage. These are very, very moderate
requirements for your ingestion point.
On 2/29/08 11:18 AM, Steve Sapovits [EMAIL PROTECTED] wrote:
Ted Dunning wrote:
In our case, we looked
This is exactly what we do as well. We also have auto-detection for
modifications and downstream processing so that back-filling in the presence
error correction is possible (the errors can be old processing code or file
munging).
On 2/28/08 6:06 PM, Joydeep Sen Sarma [EMAIL PROTECTED]
Have you tried using http to fetch the file instead?
http://name-node-and-port/data/file-path
This will get redirected to one of the datanodes to handle and should be
pretty fast. It would be interesting to find out if this alternative path
is subject to the same hangs that you are seeing.
Ooops. Should have read the rest of your posting. Sorry about the noise.
On 2/27/08 12:05 PM, C G [EMAIL PROTECTED] wrote:
Hi All:
The following write-up is offered to help out anybody else who has seen
performance problems and hangs while using dfs -copyToLocal/-cat.
One
Joins are easy.
Just reduce on a key composed of the stuff you want to join on. If the data
you are joining is disparate, leave some kind of hint about what kind of
record you have.
The reducer will be iterating through sets of records that have the same
key. This is similar to the results
But this only guarantees that the results will be sorted within each
reducers input. Thus, this won't result in getting the results sorted by
the reducers output value.
On 2/21/08 8:40 PM, Owen O'Malley [EMAIL PROTECTED] wrote:
On Feb 21, 2008, at 5:47 PM, Ted Dunning wrote:
It may
Sorry to be picky about the math, but 1 Trillion = 10^12 = million million.
At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9. At 100
links per page, this gives 10B pages.
On 2/19/08 2:25 PM, Peter W. [EMAIL PROTECTED] wrote:
Amazing milestone,
Looks like Y! had
There is a kill job link at the bottom of the map-reduce admin panel for the
job.
Did that not work?
On 2/12/08 10:33 PM, Jim the Standing Bear [EMAIL PROTECTED] wrote:
What is the best way to kill a bad job (e.g. an infinite loop)? The
job I was running went into an infinite loop and I
But that map will have to read the file again (and is likely to want a
different key than the reduce produces).
On 2/12/08 12:33 PM, Miles Osborne [EMAIL PROTECTED] wrote:
You may well have another Map operation operate over the Reducer output, in
which case you'd want key-value pairs.
Welcome to the club.
The good news is that if you give the output collector a null key, it will
just output the data in the value argument and ignore the key entirely.
Occasionally, the distinction is useful to avoid constructing yet another
temporary data structure to hold a tuple. Word
It isn't popular much anymore, but once upon a time, network topology for
clustering was a big topic. Since then, switches have gotten pretty fast
and worrying about these things has gone out of fashion a bit other than
something on the level of the current rack-aware locality in Hadoop.
With 4
Doesn't the incremental CPU cost you as much as an entire extra box?
On 2/12/08 12:19 PM, Colin Evans [EMAIL PROTECTED] wrote:
The big question for me is how well a dual-CPU 4-core (8 cores per box)
configuration will do. Has anyone tried out this configuration with
Intel or AMD CPUs? Is
I would concur that it is much better to have sufficient storage in the
compute farm for DFS files to be local for the compute tasks.
Also, a 16 disk machine typically costs a good bit more than a 6 disk
machine + 10 disks because you usually require a second chassis. Sun's
Thumper would be an
Why not down-grade the CPU power and increase the number of chassis to get
more disks (and controllers and network interfaces)?
On 2/12/08 12:53 PM, Jason Venner [EMAIL PROTECTED] wrote:
We have 3 types of machines we can get, 2 disk, 6 disk and 16 disk
machines. *They all have 4 dual core
I have had issues with machines that are highly disparate in terms of disk
space. I expect that some of those issues have been mitigated in recent
releases.
On 2/12/08 11:51 AM, Jason Venner [EMAIL PROTECTED] wrote:
We are starting to build larger clusters, and want to better understand
Jeff,
Doesn't the reducer see all of the data points for each cluster (canopy) in
a single list?
If so, why the need to output during close?
If not, why not?
On 2/11/08 12:24 PM, Jeff Eastman [EMAIL PROTECTED] wrote:
Hi Owen,
Thanks for the information. I took Ted's advice and
You should be looking at HDFS (part of hadoop) plus hbase or code that you
write yourself.
Hadoop is built in two parts. One part is the distributed file system that
provides replication and similar functions. You can access this file system
pretty easily from Java. Your requirements are
in
parallel on many files as possible. This way I would be able to return a
result faster then if I would have used one machine.
Is there a way to tell which files are in memory?
On Feb 10, 2008 10:33 PM, Ted Dunning [EMAIL PROTECTED] wrote:
But if your files DO fit into memory
You got it exactly.
On 2/10/08 5:08 PM, Jeff Eastman [EMAIL PROTECTED] wrote:
mapper assigns points to clusters, combiner computes partial centroids,
reducer computers final centroids... Using a combiner in this manner would
avoid [outputting data in the mapper's close method].
Did I get
Hmmm
I think that computing centroids in the mapper may not be the best idea.
A different structure that would work well is to use the mapper to assign
data records to centroids and use the centroid number as the key for the
reduce key. Then the reduce itself can compute the centroids.
@hadoop.apache.org
Subject: Re: Namenode fails to replicate file
Doesn't the -setrep command force the replication to be increased
immediately?
./hadoop dfs -setrep [replication] path
(I may have misunderstood)
On Thu, 2008-02-07 at 17:05 -0800, Ted Dunning wrote:
Chris Kline reported
-0800, Ted Dunning wrote:
Chris Kline reported a problem in early January where a file which had too
few replicated blocks did not get replicated until a DFS restart.
I just saw a similar issue. I had a file that had a block with 1 replica (2
required) that did not get replicated. I
I will see if I can replicate the problem and do as you suggest.
On 2/8/08 4:29 PM, Raghu Angadi [EMAIL PROTECTED] wrote:
Ted Dunning wrote:
That makes it wait, but I don't think it increases the urgency on the part
of the namenode.
As an interesting experiment, I had a cluster
map task, or does you suggestion already do this and this is a
moot point?.
David
On Thu, 2008-02-07 at 09:39 -0800, Ted Dunning wrote:
Set numReducers to 0.
On 2/7/08 9:35 AM, David Alves [EMAIL PROTECTED] wrote:
Hi All
First of all since this is my first post I must say congrats
Set numReducers to 0.
On 2/7/08 9:35 AM, David Alves [EMAIL PROTECTED] wrote:
Hi All
First of all since this is my first post I must say congrats for the
great piece of software (both Hadoop and HBase).
I've been using HadoopHBase for a while and I have a question, let me
just explain a
Chris Kline reported a problem in early January where a file which had too
few replicated blocks did not get replicated until a DFS restart.
I just saw a similar issue. I had a file that had a block with 1 replica (2
required) that did not get replicated. I changed the number of required
I don't think anybody has figured out how to patent the Lanczos algorithm
itself!
On 2/6/08 10:03 AM, Peter W. [EMAIL PROTECTED] wrote:
Hello,
This is Mahout project seems very interesting.
Any problem that has reducibility components
using mapreduce and can then be described as a
,
C G
Ted Dunning [EMAIL PROTECTED] wrote:
I am looking for a way for scripts to write data to HDFS without having to
install anything.
The /data and /listPaths URL's on the nameserver are ideal for reading
files, but I can't find anything comparable to write files.
Am I missing
Very nice summary.
One of the issues that we have had with multiple search servers is that on
linux, there can be substantial contention for disk I/O. This means that as
a new index is being written, access to the current index can be stalled for
very long periods of time (sometimes 10s). This
The method to describe is the standard approach.
The benefit is that the data that arrives at the reducer might be larger
than you want to store in memory (for sorting by the reduce). Also, reading
the entire set of reduce values would increase the amount of data allocated
and would mean that
On 2/6/08 11:58 AM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:
But it actually adds duplicate data (i.e., the value column which
needs
sorting) to the key.
Why? U can always take it out of the value to remove the redundancy.
Actually, you can't in most cases.
Suppose you have
If there is video recorded, please consider posting it on Veoh.
:-)
On 2/6/08 1:44 PM, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi Otis,
can you suggest a technology how we could do that? Skype? Ichat?
Something that is free?
I'm happy setup a video conf, however there are no big
We have quite a few serving the load, but if we are trying to update
relatively often (say every 30 minutes), then having a server out of action
for several minutes really hurts. The outage is that long because you have
to
A) turn off traffic
B) wait for traffic to actually stop
C) move the
501 - 600 of 609 matches
Mail list logo