Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sebastian Briesemeister
Thank you. Splitting the files leads to multiple MR-tasks! Only changing the MR settings of hadoop did not help. In the future it would be nice if the drivers would scale themself and would split the data according to the dataset size and the number of available MR-slots. Cheers Sebastian Am

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sean Owen
This is really a Hadoop-level thing. I am not sure I have ever successfully induced M/R to run multiple mappers on less than one block of data, even with a low max split size. Reducers you can control. On Thu, Mar 28, 2013 at 9:04 AM, Sebastian Briesemeister

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Ted Dunning
This is a longstanding Hadoop issue. Your suggestion is interesting, but only a few cases would benefit. The problem is that splitting involves reading from a very small number of nodes and thus is not much better than just running the program with few mappers. If the data is large enough to

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sebastian Schelter
It would also be very hard to do automatically, as clusters are shared and a framework cannot know how much of the shared resources (available map slots) it can take. On 28.03.2013 10:07, Sean Owen wrote: This is really a Hadoop-level thing. I am not sure I have ever successfully induced M/R to

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sebastian Schelter
Sebastian, For CPU-bound problems like matrix factorization with ALS, we have recently seen good results with multithreaded mappers, where we had the users specify the number of cores to use per mapper. On 28.03.2013 10:20, Ted Dunning wrote: This is a longstanding Hadoop issue. Your

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sebastian Briesemeister
In my case, each map processes requires a lot of memory and I would like to distribute this consumption on multiple nodes. However, I still get out of memory exceptions even if I split the input file into several very small input files??? I though the mapper would consider only one file at a time

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Dan Filimon
From what I've seen, even if the mapper does throw an out of memory exception, Hadoop will restart it increasing the memory. There are ways to configure the mapper/reducer JVMs to use more memory by default through the Configuration although I don't recall the exact options. It's probably

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sebastian Briesemeister
I tried to increase the heap space, but it wasn't enough. It seems the problem is not the number of mappers. I will start another thread for this problem with some more details. Cheers Sebastian Am 28.03.2013 16:41, schrieb Dan Filimon: From what I've seen, even if the mapper does throw an

Number of Clustering MR-Jobs

2013-03-27 Thread Sebastian Briesemeister
Dear all, I am trying to start the FuzzyKMeansDriver on a hadoop cluster so that it starts multiple MapReduce-Jobs. However, it always starts just a single MR-Job?! I figured it might be caused by the fact that I generated my input data into a single file using SequenceFile.Writer??? Or is there

Re: Number of Clustering MR-Jobs

2013-03-27 Thread Ted Dunning
Do you mean that it starts a single map task? On Wed, Mar 27, 2013 at 5:10 PM, Sebastian Briesemeister sebastian.briesemeis...@unister-gmbh.de wrote: Dear all, I am trying to start the FuzzyKMeansDriver on a hadoop cluster so that it starts multiple MapReduce-Jobs. However, it always starts

Re: Number of Clustering MR-Jobs

2013-03-27 Thread Sebastian Briesemeister
Yes, correct. It currently starts a single Map task. Ted Dunning ted.dunn...@gmail.com schrieb: Do you mean that it starts a single map task? On Wed, Mar 27, 2013 at 5:10 PM, Sebastian Briesemeister sebastian.briesemeis...@unister-gmbh.de wrote: Dear all, I am trying to start the

Re: Number of Clustering MR-Jobs

2013-03-27 Thread Ted Dunning
Your idea that this is related to your single input file is the most likely cause. If your input file is relatively small then splitting it up to force multiple mappers is the easiest solution. If your input file is larger, then you might be able to convince the map-reduce framework to use more