Re: Incremently load big RDD file into Memory

MUHAMMAD AAMIR Wed, 08 Apr 2015 23:26:14 -0700

Hi,

Thanks a lot for such a detailed response.


On Wed, Apr 8, 2015 at 8:55 PM, Guillaume Pitel <guillaume.pi...@exensa.com>
wrote:

>  Hi Muhammad,
>
> There are lots of ways to do it. My company actually develops a text
> mining solution which embeds a very fast Approximate Neighbours solution (a
> demo with real time queries on the wikipedia dataset can be seen at
> wikinsights.org). For the record, we now prepare a dataset of 4.5 million
> documents for querying in about 2 or 3 minutes on a 32 cores cluster, and
> the queries take less than 10ms when the dataset is in memory.
>
> But if you just want to precompute everything and don't mind waiting a few
> tens of minutes (or hours), and don't want to bother with an approximate
> neighbour solution, then the best way is probably something like this :
>
> 1 - block your data (i.e. group your items in X large groups). Instead of
> a dataset of N elements, you should now have a dataset of X blocks
> containing N/X elements each.
> 2 - do the cartesian product (instead of N*N elements, you now have "just"
> X*X blocks, which should take less memory)
> 3 - for each pair of blocks (blockA,blockB), perform the computation of
> distances for each elements of blockA with each element of blockB, but keep
> only the top K best for each element of blockA. Output is
> List((elementOfBlockA, listOfKNearestElementsOfBlockBWithTheDistance),..)
> 4 - reduceByKey (the key is the elementOfBlockA), by merging the
> listOfNearestElements and always keeping the K nearest.
>
> This is an exact version of top K. This is only interesting if K << N/X.
> But even if K is large, it is possible that it will fit your needs.
> Remember that you will still compute N*N distances (this is the problem
> with exact nearest neighbours), the only difference with what you're doing
> now is that you produces less items and duplicates less data. Indeed, if
> one of your elements takes 100bytes, the per element cartesian will produce
> N*N*100*2 bytes, while the blocked version will produce X*X*100*2*N/X, ie
> X*N*100*2 bytes.
>
> Guillaume
>
> Hi Guillaume,
>
>  Thanks for you reply. Can you please tell me how can i improve for Top-k
> nearest points.
>
>  P.S. My post is not accepted on the list thats why i am sending you
> email here.
> I would be really grateful to you if you reply it.
> Thanks,
>
> On Wed, Apr 8, 2015 at 1:23 PM, Guillaume Pitel <
> guillaume.pi...@exensa.com> wrote:
>
>>  This kind of operation is not scalable, not matter what you do, at
>> least if you _really_ want to do that.
>>
>> However, if what you're looking for is not to really compute all
>> distances, (for instance if you're looking only for the top K nearest
>> points), then it can be highly improved.
>>
>> It all depends of what you want to do eventually.
>>
>> Guillaume
>>
>> val locations = filelines.map(line => line.split("\t")).map(t =>
>> (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()
>>
>> val cartesienProduct=locations.cartesian(locations).map(t=>
>> Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))
>>
>> Code executes perfectly fine uptill here but when i try to use
>> "cartesienProduct" it got stuck i.e.
>>
>> val count =cartesienProduct.count()
>>
>> Any help to efficiently do this will be highly appreciated.
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>> --
>>    [image: eXenSa]
>>  *Guillaume PITEL, Président*
>> +33(0)626 222 431
>>
>> eXenSa S.A.S. <http://www.exensa.com/>
>>  41, rue Périer - 92120 Montrouge - FRANCE
>> Tel +33(0)184 163 677 / Fax +33(0)972 283 705
>>
>
>
>
>  --
>  Regards,
> Muhammad Aamir
>
>
> *CONFIDENTIALITY:This email is intended solely for the person(s) named and
> may be confidential and/or privileged.If you are not the intended
> recipient,please delete it,notify me and do not copy,use,or disclose its
> content.*
>
>
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)626 222 431
>
> eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)184 163 677 / Fax +33(0)972 283 705
>



-- 
Regards,
Muhammad Aamir


*CONFIDENTIALITY:This email is intended solely for the person(s) named and
may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose its
content.*

Re: Incremently load big RDD file into Memory

Reply via email to