In case it would interest other peoples, here is what I come up with and it seems to work fine:
case class RDDAsInputStream(private val rdd: RDD[String]) extends java.io.InputStream { var bytes = rdd.flatMap(_.getBytes("UTF-8")).toLocalIterator def read(): Int = { if(bytes.hasNext) bytes.next.toInt else -1 } override def markSupported(): Boolean = false } 2015-03-13 13:56 GMT+01:00 Sean Owen <so...@cloudera.com>: > OK, then you do not want to collect() the RDD. You can get an iterator, > yes. > There is no such thing as making an Iterator into an InputStream. An > Iterator is a sequence of arbitrary objects; an InputStream is a > channel to a stream of bytes. > I think you can employ similar Guava / Commons utilities to make an > Iterator of Streams in a stream of Readers, join the Readers, and > encode the result as bytes in an InputStream. > > On Fri, Mar 13, 2015 at 10:33 AM, Ayoub <benali.ayoub.i...@gmail.com> > wrote: > > Thanks Sean, > > > > I forgot to mention that the data is too big to be collected on the > driver. > > > > So yes your proposition would work in theory but in my case I cannot hold > > all the data in the driver memory, therefore it wouldn't work. > > > > I guess the crucial point to to do the collect in a lazy way and in that > > subject I noticed that we can get a local iterator from an RDD but that > > rises two questions: > > > > - does that involves an mediate collect just like with "collect()" or is > it > > lazy process ? > > - how to go from an iterator to an InputStream ? > > > > > > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>: > >> > >> These are quite different creatures. You have a distributed set of > >> Strings, but want a local stream of bytes, which involves three > >> conversions: > >> > >> - collect data to driver > >> - concatenate strings in some way > >> - encode strings as bytes according to an encoding > >> > >> Your approach is OK but might be faster to avoid disk, if you have > >> enough memory: > >> > >> - collect() to a Array[String] locally > >> - use Guava utilities to turn a bunch of Strings into a Reader > >> - Use the Apache Commons ReaderInputStream to read it as encoded bytes > >> > >> I might wonder if that's all really what you want to do though. > >> > >> > >> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <[hidden email]> wrote: > >> > Hello, > >> > > >> > I need to convert an RDD[String] to a java.io.InputStream but I didn't > >> > find > >> > an east way to do it. > >> > Currently I am saving the RDD as temporary file and then opening an > >> > inputstream on the file but that is not really optimal. > >> > > >> > Does anybody know a better way to do that ? > >> > > >> > Thanks, > >> > Ayoub. > >> > > >> > > >> > > >> > -- > >> > View this message in context: > >> > > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html > >> > Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: [hidden email] > >> > For additional commands, e-mail: [hidden email] > >> > > > > > > > > > ________________________________ > > View this message in context: Re: RDD to InputStream > > > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22121.html Sent from the Apache Spark User List mailing list archive at Nabble.com.