Thanks a lot for ur help. Yes i will be running it on a hadoop cluster. Can
u elaborate a bit on writing to file system incrementally..?

On Fri, Nov 4, 2011 at 11:51 AM, Paritosh Ranjan <pran...@xebia.com> wrote:

> Such big data would need to run on Hadoop cluster.
>
> Right now, I think there is no utility which can help you collect data in
> the form you want. You will have to read it line by line, group vectors
> belonging to similar cluster. Would be good if you can write it on file
> system incrementally, as this would get rid of memory problem.
>
> Or, try CanopyDriver with clusterFilter > 0 , which might help in reducing
> the number of clusters that you are getting as output, which, in turn,
> might help in less memory usage.
>
>
> On 04-11-2011 11:43, gaurav redkar wrote:
>
>> Actually i have to run the meanshift algorithm on a large dataset for my
>> project. the clusterdumper facility works on smaller data sets .
>>
>> But my project will mostly include large-scale data (size will mostly
>> extend to gigabytes). So i need to modify the clusterdumper facility to
>> work on the such dataset. Also the vector is densely populated.
>>
>> i probably need to read each file from pointsDir one at a tym while
>> constructing the "result" map. Any pointers as to how do i do it.??
>>
>> Thanks
>>
>> On Fri, Nov 4, 2011 at 11:27 AM, Paritosh Ranjan<pran...@xebia.com>
>>  wrote:
>>
>>  Reducing dimension (drastically, try less than 100 if functionality
>>> allows
>>> this) can be a solution.
>>>
>>> Which vector implementation are you using? If the vectors are sparsely
>>> populated ( have lots of uninitialized/unused dimensions) , you can use
>>> RandomAccessSparseVector or SequentialAccessSparseVector, which will
>>> populate only the dimensions which you are using. This can also decrease
>>> memory consumption.
>>>
>>>
>>> On 04-11-2011 11:19, gaurav redkar wrote:
>>>
>>>  Hi,
>>>>
>>>> yes Paritosh..even i think the same. actually i am using a test data set
>>>> that has 5000 tuples with 1000 dimensions each.  the thing is der are
>>>> too
>>>> many files created in the pointsDir folder and i think the program tries
>>>> to
>>>> open a path to all d files(i.e. read all the files in memory at once).
>>>> Is
>>>> my interpretation correct.?? Also how do i go about fixing it..?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> On Fri, Nov 4, 2011 at 11:03 AM, Paritosh Ranjan<pran...@xebia.com>
>>>>  wrote:
>>>>
>>>>  Reading point is keeping everything in memory which might have crashed
>>>>
>>>>> it.
>>>>> pointList.add(record.******getSecond());
>>>>>
>>>>>
>>>>>
>>>>> Your dataset size is 40 MB but the vectors might be too large. How many
>>>>> dimensions are you having in your Vector?
>>>>>
>>>>>
>>>>> On 04-11-2011 10:57, gaurav redkar wrote:
>>>>>
>>>>>  Hello,
>>>>>
>>>>>> I am in  a fix with the Clusterdumper utility. The clusterdump utility
>>>>>> crashes when it tries to output the clusters by outputting an out of
>>>>>> memory
>>>>>> exception: java heap space.
>>>>>>
>>>>>> when i checked the error stack, it seems that the program crashed in
>>>>>> readPoints() function. i guess it is unable to build the "result" map.
>>>>>> Any
>>>>>> idea how do i fix this.??
>>>>>>
>>>>>> I am working on a dataset of size 40mb. I had tried increaseing the
>>>>>> heap
>>>>>> space but with no luck.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Gaurav
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----
>>>>>> No virus found in this message.
>>>>>> Checked by AVG - www.avg.com
>>>>>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date:
>>>>>> 11/03/11
>>>>>>
>>>>>>
>>>>>>  -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11
>>>>
>>>>
>>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11
>>
>
>

Reply via email to