Hi, Thanks a lot for such a detailed response.
On Wed, Apr 8, 2015 at 8:55 PM, Guillaume Pitel <guillaume.pi...@exensa.com> wrote: > Hi Muhammad, > > There are lots of ways to do it. My company actually develops a text > mining solution which embeds a very fast Approximate Neighbours solution (a > demo with real time queries on the wikipedia dataset can be seen at > wikinsights.org). For the record, we now prepare a dataset of 4.5 million > documents for querying in about 2 or 3 minutes on a 32 cores cluster, and > the queries take less than 10ms when the dataset is in memory. > > But if you just want to precompute everything and don't mind waiting a few > tens of minutes (or hours), and don't want to bother with an approximate > neighbour solution, then the best way is probably something like this : > > 1 - block your data (i.e. group your items in X large groups). Instead of > a dataset of N elements, you should now have a dataset of X blocks > containing N/X elements each. > 2 - do the cartesian product (instead of N*N elements, you now have "just" > X*X blocks, which should take less memory) > 3 - for each pair of blocks (blockA,blockB), perform the computation of > distances for each elements of blockA with each element of blockB, but keep > only the top K best for each element of blockA. Output is > List((elementOfBlockA, listOfKNearestElementsOfBlockBWithTheDistance),..) > 4 - reduceByKey (the key is the elementOfBlockA), by merging the > listOfNearestElements and always keeping the K nearest. > > This is an exact version of top K. This is only interesting if K << N/X. > But even if K is large, it is possible that it will fit your needs. > Remember that you will still compute N*N distances (this is the problem > with exact nearest neighbours), the only difference with what you're doing > now is that you produces less items and duplicates less data. Indeed, if > one of your elements takes 100bytes, the per element cartesian will produce > N*N*100*2 bytes, while the blocked version will produce X*X*100*2*N/X, ie > X*N*100*2 bytes. > > Guillaume > > Hi Guillaume, > > Thanks for you reply. Can you please tell me how can i improve for Top-k > nearest points. > > P.S. My post is not accepted on the list thats why i am sending you > email here. > I would be really grateful to you if you reply it. > Thanks, > > On Wed, Apr 8, 2015 at 1:23 PM, Guillaume Pitel < > guillaume.pi...@exensa.com> wrote: > >> This kind of operation is not scalable, not matter what you do, at >> least if you _really_ want to do that. >> >> However, if what you're looking for is not to really compute all >> distances, (for instance if you're looking only for the top K nearest >> points), then it can be highly improved. >> >> It all depends of what you want to do eventually. >> >> Guillaume >> >> val locations = filelines.map(line => line.split("\t")).map(t => >> (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect() >> >> val cartesienProduct=locations.cartesian(locations).map(t=> >> Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2))) >> >> Code executes perfectly fine uptill here but when i try to use >> "cartesienProduct" it got stuck i.e. >> >> val count =cartesienProduct.count() >> >> Any help to efficiently do this will be highly appreciated. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> >> -- >> [image: eXenSa] >> *Guillaume PITEL, Président* >> +33(0)626 222 431 >> >> eXenSa S.A.S. <http://www.exensa.com/> >> 41, rue Périer - 92120 Montrouge - FRANCE >> Tel +33(0)184 163 677 / Fax +33(0)972 283 705 >> > > > > -- > Regards, > Muhammad Aamir > > > *CONFIDENTIALITY:This email is intended solely for the person(s) named and > may be confidential and/or privileged.If you are not the intended > recipient,please delete it,notify me and do not copy,use,or disclose its > content.* > > > > -- > [image: eXenSa] > *Guillaume PITEL, Président* > +33(0)626 222 431 > > eXenSa S.A.S. <http://www.exensa.com/> > 41, rue Périer - 92120 Montrouge - FRANCE > Tel +33(0)184 163 677 / Fax +33(0)972 283 705 > -- Regards, Muhammad Aamir *CONFIDENTIALITY:This email is intended solely for the person(s) named and may be confidential and/or privileged.If you are not the intended recipient,please delete it,notify me and do not copy,use,or disclose its content.*