Thanks Sean,

I forgot to mention that the data is too big to be collected on the driver.

So yes your proposition would work in theory but in my case I cannot hold
all the data in the driver memory, therefore it wouldn't work.

I guess the crucial point to to do the collect in a lazy way and in that
subject I noticed that we can get a local iterator from an RDD but that
rises two questions:

- does that involves an mediate collect just like with "collect()" or is it
lazy process ?
- how to go from an iterator to an InputStream ?


2015-03-13 11:17 GMT+01:00 Sean Owen <so...@cloudera.com>:

> These are quite different creatures. You have a distributed set of
> Strings, but want a local stream of bytes, which involves three
> conversions:
>
> - collect data to driver
> - concatenate strings in some way
> - encode strings as bytes according to an encoding
>
> Your approach is OK but might be faster to avoid disk, if you have
> enough memory:
>
> - collect() to a Array[String] locally
> - use Guava utilities to turn a bunch of Strings into a Reader
> - Use the Apache Commons ReaderInputStream to read it as encoded bytes
>
> I might wonder if that's all really what you want to do though.
>
>
> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <benali.ayoub.i...@gmail.com>
> wrote:
> > Hello,
> >
> > I need to convert an RDD[String] to a java.io.InputStream but I didn't
> find
> > an east way to do it.
> > Currently I am saving the RDD as temporary file and then opening an
> > inputstream on the file but that is not really optimal.
> >
> > Does anybody know a better way to do that ?
> >
> > Thanks,
> > Ayoub.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22032.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to