Re: Cluster dumper crashes when run on a large dataset

Paritosh Ranjan Thu, 03 Nov 2011 23:22:31 -0700

Such big data would need to run on Hadoop cluster.

Right now, I think there is no utility which can help you collect datain the form you want. You will have to read it line by line, groupvectors belonging to similar cluster. Would be good if you can write iton file system incrementally, as this would get rid of memory problem.

Or, try CanopyDriver with clusterFilter > 0 , which might help inreducing the number of clusters that you are getting as output, which,in turn, might help in less memory usage.


On 04-11-2011 11:43, gaurav redkar wrote:

Actually i have to run the meanshift algorithm on a large dataset for my
project. the clusterdumper facility works on smaller data sets .

But my project will mostly include large-scale data (size will mostly
extend to gigabytes). So i need to modify the clusterdumper facility to
work on the such dataset. Also the vector is densely populated.

i probably need to read each file from pointsDir one at a tym while
constructing the "result" map. Any pointers as to how do i do it.??

Thanks

On Fri, Nov 4, 2011 at 11:27 AM, Paritosh Ranjan<pran...@xebia.com>  wrote:

Reducing dimension (drastically, try less than 100 if functionality allows
this) can be a solution.

Which vector implementation are you using? If the vectors are sparsely
populated ( have lots of uninitialized/unused dimensions) , you can use
RandomAccessSparseVector or SequentialAccessSparseVector, which will
populate only the dimensions which you are using. This can also decrease
memory consumption.


On 04-11-2011 11:19, gaurav redkar wrote:

Hi,

yes Paritosh..even i think the same. actually i am using a test data set
that has 5000 tuples with 1000 dimensions each.  the thing is der are too
many files created in the pointsDir folder and i think the program tries
to
open a path to all d files(i.e. read all the files in memory at once). Is
my interpretation correct.?? Also how do i go about fixing it..?

Thanks



On Fri, Nov 4, 2011 at 11:03 AM, Paritosh Ranjan<pran...@xebia.com>
  wrote:

  Reading point is keeping everything in memory which might have crashed

it.
pointList.add(record.****getSecond());


Your dataset size is 40 MB but the vectors might be too large. How many
dimensions are you having in your Vector?


On 04-11-2011 10:57, gaurav redkar wrote:

  Hello,

I am in  a fix with the Clusterdumper utility. The clusterdump utility
crashes when it tries to output the clusters by outputting an out of
memory
exception: java heap space.

when i checked the error stack, it seems that the program crashed in
readPoints() function. i guess it is unable to build the "result" map.
Any
idea how do i fix this.??

I am working on a dataset of size 40mb. I had tried increaseing the heap
space but with no luck.

Thanks

Gaurav



-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11



-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11

Re: Cluster dumper crashes when run on a large dataset

Reply via email to