الجنرالOn 2015/03/18 17:20:54 Ayoub wrote:> In case it would interest other 
peoples, here is what I come up with and it> seems to work fine:> >   case 
class RDDAsInputStream(private val rdd: RDD[String]) extends> 
java.io.InputStream {>     var bytes = 
rdd.flatMap(_.getBytes("UTF-8")).toLocalIterator> >     def read(): Int = {>    
   if(bytes.hasNext) bytes.next.toInt>       else -1>     }>     override def 
markSupported(): Boolean = false>   }> > > 2015-03-13 13:56 GMT+01:00 Sean Owen 
<so...@cloudera.com>:> > > OK, then you do not want to collect() the RDD. You 
can get an iterator,> > yes.> > There is no such thing as making an Iterator 
into an InputStream. An> > Iterator is a sequence of arbitrary objects; an 
InputStream is a> > channel to a stream of bytes.> > I think you can employ 
similar Guava / Commons utilities to make an> > Iterator of Streams in a stream 
of Readers, join the Readers, and> > encode the result as bytes in an 
InputStream.> >> > On Fri, Mar 13, 2015 at 10:33 AM, Ayoub <be...@gmail.com>> > 
wrote:> > > Thanks Sean,> > >> > > I forgot to mention that the data is too big 
to be collected on the> > driver.> > >> > > So yes your proposition would work 
in theory but in my case I cannot hold> > > all the data in the driver memory, 
therefore it wouldn't work.> > >> > > I guess the crucial point to to do the 
collect in a lazy way and in that> > > subject I noticed that we can get a 
local iterator from an RDD but that> > > rises two questions:> > >> > > - does 
that involves an mediate collect just like with "collect()" or is> > it> > > 
lazy process ?> > > - how to go from an iterator to an InputStream ?> > >> > >> 
> > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>:> > >>> > >> These 
are quite different creatures. You have a distributed set of> > >> Strings, but 
want a local stream of bytes, which involves three> > >> conversions:> > >>> > 
>> - collect data to driver> > >> - concatenate strings in some way> > >> - 
encode strings as bytes according to an encoding> > >>> > >> Your approach is 
OK but might be faster to avoid disk, if you have> > >> enough memory:> > >>> > 
>> - collect() to a Array[String] locally> > >> - use Guava utilities to turn a 
bunch of Strings into a Reader> > >> - Use the Apache Commons ReaderInputStream 
to read it as encoded bytes> > >>> > >> I might wonder if that's all really 
what you want to do though.> > >>> > >>> > >> On Fri, Mar 13, 2015 at 9:54 AM, 
Ayoub <[hidden email]> wrote:> > >> > Hello,> > >> >> > >> > I need to convert 
an RDD[String] to a java.io.InputStream but I didn't> > >> > find> > >> > an 
east way to do it.> > >> > Currently I am saving the RDD as temporary file and 
then opening an> > >> > inputstream on the file but that is not really 
optimal.> > >> >> > >> > Does anybody know a better way to do that ?> > >> >> > 
>> > Thanks,> > >> > Ayoub.> > >> >> > >> >> > >> >> > >> > --> > >> > View 
this message in context:> > >> >> > 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html>
 > >> > Sent from the Apache Spark User List mailing list archive at> > 
Nabble.com.> > >> >> > >> > 
---------------------------------------------------------------------> > >> > 
To unsubscribe, e-mail: [hidden email]> > >> > For additional commands, e-mail: 
[hidden email]> > >> >> > >> > >> > >> > > ________________________________> > 
> View this message in context: Re: RDD to InputStream> > >> > > Sent from the 
Apache Spark User List mailing list archive at Nabble.com.> >> > > > > --> View 
this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22121.html>
 Sent from the Apache Spark User List mailing list archive at Nabble.com.مرسل 
من هاتف Samsung Galaxy الذكي.

Reply via email to