Thanks a lot for ur help. Yes i will be running it on a hadoop cluster. Can u elaborate a bit on writing to file system incrementally..?
On Fri, Nov 4, 2011 at 11:51 AM, Paritosh Ranjan <pran...@xebia.com> wrote: > Such big data would need to run on Hadoop cluster. > > Right now, I think there is no utility which can help you collect data in > the form you want. You will have to read it line by line, group vectors > belonging to similar cluster. Would be good if you can write it on file > system incrementally, as this would get rid of memory problem. > > Or, try CanopyDriver with clusterFilter > 0 , which might help in reducing > the number of clusters that you are getting as output, which, in turn, > might help in less memory usage. > > > On 04-11-2011 11:43, gaurav redkar wrote: > >> Actually i have to run the meanshift algorithm on a large dataset for my >> project. the clusterdumper facility works on smaller data sets . >> >> But my project will mostly include large-scale data (size will mostly >> extend to gigabytes). So i need to modify the clusterdumper facility to >> work on the such dataset. Also the vector is densely populated. >> >> i probably need to read each file from pointsDir one at a tym while >> constructing the "result" map. Any pointers as to how do i do it.?? >> >> Thanks >> >> On Fri, Nov 4, 2011 at 11:27 AM, Paritosh Ranjan<pran...@xebia.com> >> wrote: >> >> Reducing dimension (drastically, try less than 100 if functionality >>> allows >>> this) can be a solution. >>> >>> Which vector implementation are you using? If the vectors are sparsely >>> populated ( have lots of uninitialized/unused dimensions) , you can use >>> RandomAccessSparseVector or SequentialAccessSparseVector, which will >>> populate only the dimensions which you are using. This can also decrease >>> memory consumption. >>> >>> >>> On 04-11-2011 11:19, gaurav redkar wrote: >>> >>> Hi, >>>> >>>> yes Paritosh..even i think the same. actually i am using a test data set >>>> that has 5000 tuples with 1000 dimensions each. the thing is der are >>>> too >>>> many files created in the pointsDir folder and i think the program tries >>>> to >>>> open a path to all d files(i.e. read all the files in memory at once). >>>> Is >>>> my interpretation correct.?? Also how do i go about fixing it..? >>>> >>>> Thanks >>>> >>>> >>>> >>>> On Fri, Nov 4, 2011 at 11:03 AM, Paritosh Ranjan<pran...@xebia.com> >>>> wrote: >>>> >>>> Reading point is keeping everything in memory which might have crashed >>>> >>>>> it. >>>>> pointList.add(record.******getSecond()); >>>>> >>>>> >>>>> >>>>> Your dataset size is 40 MB but the vectors might be too large. How many >>>>> dimensions are you having in your Vector? >>>>> >>>>> >>>>> On 04-11-2011 10:57, gaurav redkar wrote: >>>>> >>>>> Hello, >>>>> >>>>>> I am in a fix with the Clusterdumper utility. The clusterdump utility >>>>>> crashes when it tries to output the clusters by outputting an out of >>>>>> memory >>>>>> exception: java heap space. >>>>>> >>>>>> when i checked the error stack, it seems that the program crashed in >>>>>> readPoints() function. i guess it is unable to build the "result" map. >>>>>> Any >>>>>> idea how do i fix this.?? >>>>>> >>>>>> I am working on a dataset of size 40mb. I had tried increaseing the >>>>>> heap >>>>>> space but with no luck. >>>>>> >>>>>> Thanks >>>>>> >>>>>> Gaurav >>>>>> >>>>>> >>>>>> >>>>>> ----- >>>>>> No virus found in this message. >>>>>> Checked by AVG - www.avg.com >>>>>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: >>>>>> 11/03/11 >>>>>> >>>>>> >>>>>> ----- >>>> No virus found in this message. >>>> Checked by AVG - www.avg.com >>>> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11 >>>> >>>> >>> >> >> ----- >> No virus found in this message. >> Checked by AVG - www.avg.com >> Version: 10.0.1411 / Virus Database: 2092/3994 - Release Date: 11/03/11 >> > >