In case it would interest other peoples, here is what I come up with and it
seems to work fine:

  case class RDDAsInputStream(private val rdd: RDD[String]) extends
java.io.InputStream {
    var bytes = rdd.flatMap(_.getBytes("UTF-8")).toLocalIterator

    def read(): Int = {
      if(bytes.hasNext) bytes.next.toInt
      else -1
    }
    override def markSupported(): Boolean = false
  }


2015-03-13 13:56 GMT+01:00 Sean Owen <so...@cloudera.com>:

> OK, then you do not want to collect() the RDD. You can get an iterator,
> yes.
> There is no such thing as making an Iterator into an InputStream. An
> Iterator is a sequence of arbitrary objects; an InputStream is a
> channel to a stream of bytes.
> I think you can employ similar Guava / Commons utilities to make an
> Iterator of Streams in a stream of Readers, join the Readers, and
> encode the result as bytes in an InputStream.
>
> On Fri, Mar 13, 2015 at 10:33 AM, Ayoub <benali.ayoub.i...@gmail.com>
> wrote:
> > Thanks Sean,
> >
> > I forgot to mention that the data is too big to be collected on the
> driver.
> >
> > So yes your proposition would work in theory but in my case I cannot hold
> > all the data in the driver memory, therefore it wouldn't work.
> >
> > I guess the crucial point to to do the collect in a lazy way and in that
> > subject I noticed that we can get a local iterator from an RDD but that
> > rises two questions:
> >
> > - does that involves an mediate collect just like with "collect()" or is
> it
> > lazy process ?
> > - how to go from an iterator to an InputStream ?
> >
> >
> > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>:
> >>
> >> These are quite different creatures. You have a distributed set of
> >> Strings, but want a local stream of bytes, which involves three
> >> conversions:
> >>
> >> - collect data to driver
> >> - concatenate strings in some way
> >> - encode strings as bytes according to an encoding
> >>
> >> Your approach is OK but might be faster to avoid disk, if you have
> >> enough memory:
> >>
> >> - collect() to a Array[String] locally
> >> - use Guava utilities to turn a bunch of Strings into a Reader
> >> - Use the Apache Commons ReaderInputStream to read it as encoded bytes
> >>
> >> I might wonder if that's all really what you want to do though.
> >>
> >>
> >> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <[hidden email]> wrote:
> >> > Hello,
> >> >
> >> > I need to convert an RDD[String] to a java.io.InputStream but I didn't
> >> > find
> >> > an east way to do it.
> >> > Currently I am saving the RDD as temporary file and then opening an
> >> > inputstream on the file but that is not really optimal.
> >> >
> >> > Does anybody know a better way to do that ?
> >> >
> >> > Thanks,
> >> > Ayoub.
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> >
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
> >> > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [hidden email]
> >> > For additional commands, e-mail: [hidden email]
> >> >
> >
> >
> >
> > ________________________________
> > View this message in context: Re: RDD to InputStream
> >
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22121.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to