RE: I need advice on whether my starting data needs to be in HDFS

2014-05-19 Thread Christoph Schmitz
Hi Steve, I'll second David's opinion (if I get it correctly ;-) that importing your data first and doing interesting processing later would be a good idea. Of course you can read your data from NFS during your actual processing, as long as all of your TaskTrackers have access to the NFS. Keep

Re: How do I perform a scalable cartesian product

2013-08-15 Thread Christoph Schmitz
.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java will not work for a dataset with 100 million items Any bright ideas? -- Christoph Schmitz Software-Architekt Targeting Core Product 1&1 Internet AG | Brauerstraße 50 | 76135 Kar

Re: Distributing Keys across Reducers

2012-07-20 Thread Christoph Schmitz
Hi Dave, I haven't actually done this in practice, so take this with a grain of salt ;-) One way to circumvent your problem might be to add entropy to the keys, i.e., if your keys are "a", "b" etc. and you got too many "a"s and too many "b"s, you could inflate your keys randomly to be (a, 1)

AW: Understanding job completion in other nodes

2012-06-26 Thread Christoph Schmitz
Hi Hamid, I'm not sure if I understand your question correctly, but I think this is exactly what the standard workflow in a Hadoop application looks like: Job job1 = new Job(...); // setup job, set Mapper and Reducer, etc. job1.waitForCompletion(...); // at this point, the cluster will run job

AW: how to overwrite output in HDFS?

2012-04-03 Thread Christoph Schmitz
er@hadoop.apache.org Betreff: Re: how to overwrite output in HDFS? I create such a class in the project, and build an instance of it in main, and try to use this method included, but it didnt work. Can you explain a little bit more about how to let this function work? On Tue, Apr 3, 2012 at 6:39 P

AW: how to overwrite output in HDFS?

2012-04-03 Thread Christoph Schmitz
Hi Xin, you can derive your own output format class from one of the Hadoop OutputFormats and make sure the "checkOutputSpecs" method, which usually does the checking, is empty: --- public final class OverwritingTextOutputFormat extends TextOutputFormat { @Override public void c

AW: Performance improvement-Cluster vs Pseudo

2012-03-30 Thread Christoph Schmitz
Hi Ashish, IMHO your numbers (2 machines, 10 URLs) are way too small to outweigh the natural overhead that occurs with a distributed computation (distributing the program code, coordinating the distributed file system, making sure everybody is starting and stopping, etc.). Also, if you're web c

AW: Other than hadoop

2012-01-30 Thread Christoph Schmitz
How about GridGain? Not sure abouts its liveliness, though. Regards, Christoph -Ursprüngliche Nachricht- Von: real great.. [mailto:greatness.hardn...@gmail.com] Gesendet: Montag, 30. Januar 2012 14:48 An: mapreduce-user@hadoop.apache.org; ashwanthku...@googlemail.com Betreff: Re: Other t

AW: Output of MAP Class only

2011-09-30 Thread Christoph Schmitz
Hi Rajen, you can write stuff to the task attempt directory and it will be included in the output of your MapReduce job. You can get the directory from the Mapper context: FileOutputFormat.getWorkOutputPath(context) In that path, you can just open new files via the FileSystem methods. Hope th

AW: Under-replication warnings for Distributed Cache?

2011-08-15 Thread Christoph Schmitz
> Von: Harsh J [mailto:ha...@cloudera.com] > Gesendet: Dienstag, 16. August 2011 07:15 > An: mapreduce-user@hadoop.apache.org > Betreff: Re: Under-replication warnings for Distributed Cache? > > On Mon, Aug 15, 2011 at 7:10 PM, Christoph Schmitz > wrote: > >

Under-replication warnings for Distributed Cache?

2011-08-15 Thread Christoph Schmitz
ut this warning? Thanks and best regards, Christoph -- Christoph Schmitz 1&1 Internet AG Ernst-Frey-Straße 10 · DE-76135 Karlsruhe Telefon: +49 721 91374-6733 christoph.schm...@1und1.de Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning

AW: AW: How to split a big file in HDFS by size

2011-06-20 Thread Christoph Schmitz
only go to 1 mapper than the case if it was split into 60 1 GB files which will make map-red job finish earlier than one 60 GB file as it will Hv 60 mappers running in parallel. Isn't it so ? Sent from my iPhone On Jun 20, 2011, at 12:59 AM, Christoph Schmitz wrote: > Simple answer: do

AW: how to change default name of a sequnce file

2011-06-20 Thread Christoph Schmitz
- _30 ? On Sun, Jun 19, 2011 at 11:19 PM, Mapred Learn wrote: Thanks ! I will try this ! On Sun, Jun 19, 2011 at 11:16 PM, Christoph Schmitz wrote: Hi JJ, you can do that by subclassing

AW: How to split a big file in HDFS by size

2011-06-20 Thread Christoph Schmitz
d job on those 60 text fixed length files ? If yes, do you have any idea how to do this ? On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz wrote: JJ, uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will be slow. If possible, try to get the

AW: How to split a big file in HDFS by size

2011-06-19 Thread Christoph Schmitz
JJ, uploading 60 GB single-threaded (i.e. hadoop fs -copyFromLocal etc.) will be slow. If possible, try to get the files in smaller chunks where they are created, and upload them in parallel with a simple MapReduce job that only passes the data through (i.e. uses the standard Mapper and Reducer

AW: how to change default name of a sequnce file

2011-06-19 Thread Christoph Schmitz
Hi JJ, you can do that by subclassing TextOutputFormat (or whichever output format you're using) and overloading the getDefaultWorkFile method: public class MyOutputFormat extends TextOutputFormat { // ... public Path getDefaultWorkFile(TaskAttemptContext context, String exte

AW: How to merge several SequenceFile into one?

2011-05-12 Thread Christoph Schmitz
rg Betreff: Re: How to merge several SequenceFile into one? Hi Christoph, If there is no reducer, how can these sequence files be merged? Thanks for you advice. Best Wishes, -Lin 在 2011年5月12日 下午7:44,Christoph Schmitz 写道: > Hi Lin, > > you could run a map-only job, i.e. read your data

AW: How to merge several SequenceFile into one?

2011-05-12 Thread Christoph Schmitz
Hi Lin, you could run a map-only job, i.e. read your data and output it from the mapper without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use job.setNumReduceTasks(0)). That way, you parallelize over your inputs through a number of mappers and do not have any sort/shuffle

Accessing jobs on the JobTracker

2011-05-10 Thread Christoph Schmitz
Hi, for reporting and monitoring purposes, I would like to access - from Java code - the job configuration of Jobs that someone else has submitted to a JobTracker (in 0.20.169). Basically, this would mean doing a lot of what "hadoop job -status " does (to get to the location of the job.xml fi

AW: AW: Out-of-band writing from mapper

2011-04-20 Thread Christoph Schmitz
Gah, that sucks. I'm using 0.20.1-169 from Cloudera CDH2 and assumed it would be there in 0.20.2 as well. Sorry, I have no idea what happened to MultipleOutputs in 0.20.2. Regards, Christoph -Ursprüngliche Nachricht- Von: Panayotis Antonopoulos [mailto:antonopoulos...@hotmail.com] Ges

AW: Out-of-band writing from mapper

2011-04-20 Thread Christoph Schmitz
r 2011 14:48:10 +0530 > Subject: Re: Out-of-band writing from mapper > To: mapreduce-user@hadoop.apache.org > > Hello Christoph, > > On Wed, Apr 20, 2011 at 2:12 PM, Christoph Schmitz > wrote: > > My question is: is there any mechanism to assist me in writing to some &

Re: Out-of-band writing from mapper

2011-04-20 Thread Christoph Schmitz
far as I understand, MultipleOutputs would be used in the reducer, right? (Which I wanted to avoid for the bulk of my data.) Regards, Christoph On 04/20/2011 11:18 AM, Harsh J wrote: Hello Christoph, On Wed, Apr 20, 2011 at 2:12 PM, Christoph Schmitz wrote: My question is: is there any mechanism to ass

Out-of-band writing from mapper

2011-04-20 Thread Christoph Schmitz
Hi, I need to process data in a Java MR job (using 0.20.1) in a way such that the largest part of the data is manipulated in the mapper only (i.e. some simple per-record transformation without the need for sort + shuffle), and some small pieces have to be passed on to the reducer. The mapper-on