OK, then you do not want to collect() the RDD. You can get an iterator, yes.
There is no such thing as making an Iterator into an InputStream. An
Iterator is a sequence of arbitrary objects; an InputStream is a
channel to a stream of bytes.
I think you can employ similar Guava / Commons utilities to make an
Iterator of Streams in a stream of Readers, join the Readers, and
encode the result as bytes in an InputStream.

On Fri, Mar 13, 2015 at 10:33 AM, Ayoub <benali.ayoub.i...@gmail.com> wrote:
> Thanks Sean,
>
> I forgot to mention that the data is too big to be collected on the driver.
>
> So yes your proposition would work in theory but in my case I cannot hold
> all the data in the driver memory, therefore it wouldn't work.
>
> I guess the crucial point to to do the collect in a lazy way and in that
> subject I noticed that we can get a local iterator from an RDD but that
> rises two questions:
>
> - does that involves an mediate collect just like with "collect()" or is it
> lazy process ?
> - how to go from an iterator to an InputStream ?
>
>
> 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>:
>>
>> These are quite different creatures. You have a distributed set of
>> Strings, but want a local stream of bytes, which involves three
>> conversions:
>>
>> - collect data to driver
>> - concatenate strings in some way
>> - encode strings as bytes according to an encoding
>>
>> Your approach is OK but might be faster to avoid disk, if you have
>> enough memory:
>>
>> - collect() to a Array[String] locally
>> - use Guava utilities to turn a bunch of Strings into a Reader
>> - Use the Apache Commons ReaderInputStream to read it as encoded bytes
>>
>> I might wonder if that's all really what you want to do though.
>>
>>
>> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <[hidden email]> wrote:
>> > Hello,
>> >
>> > I need to convert an RDD[String] to a java.io.InputStream but I didn't
>> > find
>> > an east way to do it.
>> > Currently I am saving the RDD as temporary file and then opening an
>> > inputstream on the file but that is not really optimal.
>> >
>> > Does anybody know a better way to do that ?
>> >
>> > Thanks,
>> > Ayoub.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>
>
>
> ________________________________
> View this message in context: Re: RDD to InputStream
>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to