OK, then you do not want to collect() the RDD. You can get an iterator, yes. There is no such thing as making an Iterator into an InputStream. An Iterator is a sequence of arbitrary objects; an InputStream is a channel to a stream of bytes. I think you can employ similar Guava / Commons utilities to make an Iterator of Streams in a stream of Readers, join the Readers, and encode the result as bytes in an InputStream.
On Fri, Mar 13, 2015 at 10:33 AM, Ayoub <benali.ayoub.i...@gmail.com> wrote: > Thanks Sean, > > I forgot to mention that the data is too big to be collected on the driver. > > So yes your proposition would work in theory but in my case I cannot hold > all the data in the driver memory, therefore it wouldn't work. > > I guess the crucial point to to do the collect in a lazy way and in that > subject I noticed that we can get a local iterator from an RDD but that > rises two questions: > > - does that involves an mediate collect just like with "collect()" or is it > lazy process ? > - how to go from an iterator to an InputStream ? > > > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>: >> >> These are quite different creatures. You have a distributed set of >> Strings, but want a local stream of bytes, which involves three >> conversions: >> >> - collect data to driver >> - concatenate strings in some way >> - encode strings as bytes according to an encoding >> >> Your approach is OK but might be faster to avoid disk, if you have >> enough memory: >> >> - collect() to a Array[String] locally >> - use Guava utilities to turn a bunch of Strings into a Reader >> - Use the Apache Commons ReaderInputStream to read it as encoded bytes >> >> I might wonder if that's all really what you want to do though. >> >> >> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <[hidden email]> wrote: >> > Hello, >> > >> > I need to convert an RDD[String] to a java.io.InputStream but I didn't >> > find >> > an east way to do it. >> > Currently I am saving the RDD as temporary file and then opening an >> > inputstream on the file but that is not really optimal. >> > >> > Does anybody know a better way to do that ? >> > >> > Thanks, >> > Ayoub. >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: [hidden email] >> > For additional commands, e-mail: [hidden email] >> > > > > > ________________________________ > View this message in context: Re: RDD to InputStream > > Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org