Thanks for the pointer to that thread. Looks like there is some demand for this capability, but not a lot yet. Also doesn't look like there is an easy answer right now.
Thanks, Ron From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Wednesday, September 03, 2014 10:40 AM To: Daniel, Ronald (ELS-SDG) Cc: user@spark.apache.org Subject: Re: Accessing neighboring elements in an RDD Interestingly, there was an almost identical question posed on Aug 22 by cjwang. Here's the link to the archive: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) <r.dan...@elsevier.com<mailto:r.dan...@elsevier.com>> wrote: Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile("SomeArticle.txt") Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below. Index Text N ...as shown in Fig. N+1 1. N+2 The figure shows... What I want is an RDD with: N ... as shown in Fig. 1. N+1 The figure shows... Is there some way a filter() can look at neighboring elements in an RDD? That way I could look, in parallel, at neighboring elements in an RDD and come up with a new RDD that may have a different number of elements. Or do I just have to sequentially iterate through the RDD? Thanks, Ron