Thanks Sean, I forgot to mention that the data is too big to be collected on the driver.
So yes your proposition would work in theory but in my case I cannot hold all the data in the driver memory, therefore it wouldn't work. I guess the crucial point to to do the collect in a lazy way and in that subject I noticed that we can get a local iterator from an RDD but that rises two questions: - does that involves an mediate collect just like with "collect()" or is it lazy process ? - how to go from an iterator to an InputStream ? 2015-03-13 11:17 GMT+01:00 Sean Owen <so...@cloudera.com>: > These are quite different creatures. You have a distributed set of > Strings, but want a local stream of bytes, which involves three > conversions: > > - collect data to driver > - concatenate strings in some way > - encode strings as bytes according to an encoding > > Your approach is OK but might be faster to avoid disk, if you have > enough memory: > > - collect() to a Array[String] locally > - use Guava utilities to turn a bunch of Strings into a Reader > - Use the Apache Commons ReaderInputStream to read it as encoded bytes > > I might wonder if that's all really what you want to do though. > > > On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <benali.ayoub.i...@gmail.com> > wrote: > > Hello, > > > > I need to convert an RDD[String] to a java.io.InputStream but I didn't > find > > an east way to do it. > > Currently I am saving the RDD as temporary file and then opening an > > inputstream on the file but that is not really optimal. > > > > Does anybody know a better way to do that ? > > > > Thanks, > > Ayoub. > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22032.html Sent from the Apache Spark User List mailing list archive at Nabble.com.