There is support for Spark in ElasticSearch’s Hadoop integration package.

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html

Maybe you could split and insert all of your documents from Spark and then 
query for “MoreLikeThis” on the ElasticSearch index.  I haven’t tried it, but 
maybe someone else has more experience using Spark with ElasticSearch.  At some 
point, maybe there could be an information retrieval package for Spark with 
locality sensitive hashing and other similar functions.

 
On Sep 3, 2014, at 10:40 AM, Victor Tso-Guillen <v...@paxata.com> wrote:

> Interestingly, there was an almost identical question posed on Aug 22 by 
> cjwang. Here's the link to the archive: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664
> 
> 
> On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) 
> <r.dan...@elsevier.com> wrote:
> Hi all,
> 
> Assume I have read the lines of a text file into an RDD:
> 
>     textFile = sc.textFile("SomeArticle.txt")
> 
> Also assume that the sentence breaks in SomeArticle.txt were done by machine 
> and have some errors, such as the break at Fig. in the sample text below.
> 
> Index   Text
> N        ...as shown in Fig.
> N+1     1.
> N+2     The figure shows...
> 
> What I want is an RDD with:
> 
> N       ... as shown in Fig. 1.
> N+1     The figure shows...
> 
> Is there some way a filter() can look at neighboring elements in an RDD? That 
> way I could look, in parallel, at neighboring elements in an RDD and come up 
> with a new RDD that may have a different number of elements.  Or do I just 
> have to sequentially iterate through the RDD?
> 
> Thanks,
> Ron
> 
> 
> 

Reply via email to