Do you want to compare within the rdd or do you have some external list or data coming in ?
For matching, you could look at string edit distances or cosine similarity if you are only comparing title strings. On Oct 20, 2015 9:09 PM, "Ascot Moss" <ascot.m...@gmail.com> wrote: > Hi, > > I have my RDD that stores the titles of some articles: > 1. "About Spark Streaming" > 2. "About Spark MLlib" > 3. "About Spark SQL" > 4. "About Spark Installation" > 5. "Kafka Streaming" > 6. "Kafka Setup" > 7. .... > > I need to build a model to find titles by similarity, > e.g > if given "About Spark", hope to get: > > "About Spark Installation", 0.98622 (where 0.98622 is the score > of similarity, range between 0 to 1) > "About Spark MLlib", 0.95394 > "About Spark Streaming", 0.94332 > "About Spark SQL", 0.9111 > > Any idea or reference to do so? > > Thanks > Ascot > > > > > > and need to find out similar titles >