Hi Steve,
I'll second David's opinion (if I get it correctly ;-) that importing your data
first and doing interesting processing later would be a good idea.
Of course you can read your data from NFS during your actual processing, as
long as all of your TaskTrackers have access to the NFS. Keep
.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java
will
not work for a dataset with 100 million items
Any bright ideas?
--
Christoph Schmitz
Software-Architekt
Targeting Core Product
1&1 Internet AG | Brauerstraße 50 | 76135 Kar
Hi Dave,
I haven't actually done this in practice, so take this with a grain of
salt ;-)
One way to circumvent your problem might be to add entropy to the keys,
i.e., if your keys are "a", "b" etc. and you got too many "a"s and too
many "b"s, you could inflate your keys randomly to be (a, 1)
Hi Hamid,
I'm not sure if I understand your question correctly, but I think this is
exactly what the standard workflow in a Hadoop application looks like:
Job job1 = new Job(...);
// setup job, set Mapper and Reducer, etc.
job1.waitForCompletion(...); // at this point, the cluster will run job
er@hadoop.apache.org
Betreff: Re: how to overwrite output in HDFS?
I create such a class in the project, and build an instance of it in
main, and try to use this method included, but it didnt work.
Can you explain a little bit more about how to let this function work?
On Tue, Apr 3, 2012 at 6:39 P
Hi Xin,
you can derive your own output format class from one of the Hadoop
OutputFormats and make sure the "checkOutputSpecs" method, which usually does
the checking, is empty:
---
public final class OverwritingTextOutputFormat extends
TextOutputFormat {
@Override
public void c
Hi Ashish,
IMHO your numbers (2 machines, 10 URLs) are way too small to outweigh the
natural overhead that occurs with a distributed computation (distributing the
program code, coordinating the distributed file system, making sure everybody
is starting and stopping, etc.). Also, if you're web c
How about GridGain? Not sure abouts its liveliness, though.
Regards,
Christoph
-Ursprüngliche Nachricht-
Von: real great.. [mailto:greatness.hardn...@gmail.com]
Gesendet: Montag, 30. Januar 2012 14:48
An: mapreduce-user@hadoop.apache.org; ashwanthku...@googlemail.com
Betreff: Re: Other t
Hi Rajen,
you can write stuff to the task attempt directory and it will be included in
the output of your MapReduce job.
You can get the directory from the Mapper context:
FileOutputFormat.getWorkOutputPath(context)
In that path, you can just open new files via the FileSystem methods.
Hope th
> Von: Harsh J [mailto:ha...@cloudera.com]
> Gesendet: Dienstag, 16. August 2011 07:15
> An: mapreduce-user@hadoop.apache.org
> Betreff: Re: Under-replication warnings for Distributed Cache?
>
> On Mon, Aug 15, 2011 at 7:10 PM, Christoph Schmitz
> wrote:
> >
ut this warning?
Thanks and best regards,
Christoph
--
Christoph Schmitz
1&1 Internet AG
Ernst-Frey-Straße 10 · DE-76135 Karlsruhe
Telefon: +49 721 91374-6733
christoph.schm...@1und1.de
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann,
Markus Huhn, Hans-Henning
only go to 1 mapper
than the case if it was split into 60 1 GB files which will make map-red job
finish earlier than one 60 GB file as it will Hv 60 mappers running in
parallel. Isn't it so ?
Sent from my iPhone
On Jun 20, 2011, at 12:59 AM, Christoph Schmitz
wrote:
> Simple answer: do
-
_30 ?
On Sun, Jun 19, 2011 at 11:19 PM, Mapred Learn wrote:
Thanks !
I will try this !
On Sun, Jun 19, 2011 at 11:16 PM, Christoph Schmitz
wrote:
Hi JJ,
you can do that by subclassing
d job on those 60 text fixed length files ? If yes, do you have any
idea how to do this ?
On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz
wrote:
JJ,
uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.)
will be slow. If possible, try to get the
JJ,
uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will be
slow. If possible, try to get the files in smaller chunks where they are
created, and upload them in parallel with a simple MapReduce job that only
passes the data through (i.e. uses the standard Mapper and Reducer
Hi JJ,
you can do that by subclassing TextOutputFormat (or whichever output format
you're using) and overloading the getDefaultWorkFile method:
public class MyOutputFormat extends TextOutputFormat {
// ...
public Path getDefaultWorkFile(TaskAttemptContext context,
String exte
rg
Betreff: Re: How to merge several SequenceFile into one?
Hi Christoph,
If there is no reducer, how can these sequence files be merged?
Thanks for you advice.
Best Wishes,
-Lin
在 2011年5月12日 下午7:44,Christoph Schmitz 写道:
> Hi Lin,
>
> you could run a map-only job, i.e. read your data
Hi Lin,
you could run a map-only job, i.e. read your data and output it from the mapper
without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use
job.setNumReduceTasks(0)).
That way, you parallelize over your inputs through a number of mappers and do
not have any sort/shuffle
Hi,
for reporting and monitoring purposes, I would like to access - from Java code
- the job configuration of Jobs that someone else has submitted to a JobTracker
(in 0.20.169).
Basically, this would mean doing a lot of what "hadoop job -status "
does (to get to the location of the job.xml fi
Gah, that sucks. I'm using 0.20.1-169 from Cloudera CDH2 and assumed it would
be there in 0.20.2 as well.
Sorry, I have no idea what happened to MultipleOutputs in 0.20.2.
Regards,
Christoph
-Ursprüngliche Nachricht-
Von: Panayotis Antonopoulos [mailto:antonopoulos...@hotmail.com]
Ges
r 2011 14:48:10 +0530
> Subject: Re: Out-of-band writing from mapper
> To: mapreduce-user@hadoop.apache.org
>
> Hello Christoph,
>
> On Wed, Apr 20, 2011 at 2:12 PM, Christoph Schmitz
> wrote:
> > My question is: is there any mechanism to assist me in writing to some
&
far as I understand, MultipleOutputs would be used in the
reducer, right? (Which I wanted to avoid for the bulk of my data.)
Regards,
Christoph
On 04/20/2011 11:18 AM, Harsh J wrote:
Hello Christoph,
On Wed, Apr 20, 2011 at 2:12 PM, Christoph Schmitz
wrote:
My question is: is there any mechanism to ass
Hi,
I need to process data in a Java MR job (using 0.20.1) in a way such that the
largest part of the data is manipulated in the mapper only (i.e. some simple
per-record transformation without the need for sort + shuffle), and some small
pieces have to be passed on to the reducer. The mapper-on
23 matches
Mail list logo