RE: Re: RDD to InputStream

2022-12-25 Thread ayuio5799
الجنرالOn 2015/03/18 17:20:54 Ayoub wrote:> In case it would interest other 
peoples, here is what I come up with and it> seems to work fine:> >   case 
class RDDAsInputStream(private val rdd: RDD[String]) extends> 
java.io.InputStream {> var bytes = 
rdd.flatMap(_.getBytes("UTF-8")).toLocalIterator> > def read(): Int = {>
   if(bytes.hasNext) bytes.next.toInt>   else -1> }> override def 
markSupported(): Boolean = false>   }> > > 2015-03-13 13:56 GMT+01:00 Sean Owen 
:> > > OK, then you do not want to collect() the RDD. You 
can get an iterator,> > yes.> > There is no such thing as making an Iterator 
into an InputStream. An> > Iterator is a sequence of arbitrary objects; an 
InputStream is a> > channel to a stream of bytes.> > I think you can employ 
similar Guava / Commons utilities to make an> > Iterator of Streams in a stream 
of Readers, join the Readers, and> > encode the result as bytes in an 
InputStream.> >> > On Fri, Mar 13, 2015 at 10:33 AM, Ayoub > > 
wrote:> > > Thanks Sean,> > >> > > I forgot to mention that the data is too big 
to be collected on the> > driver.> > >> > > So yes your proposition would work 
in theory but in my case I cannot hold> > > all the data in the driver memory, 
therefore it wouldn't work.> > >> > > I guess the crucial point to to do the 
collect in a lazy way and in that> > > subject I noticed that we can get a 
local iterator from an RDD but that> > > rises two questions:> > >> > > - does 
that involves an mediate collect just like with "collect()" or is> > it> > > 
lazy process ?> > > - how to go from an iterator to an InputStream ?> > >> > >> 
> > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>:> > >>> > >> These 
are quite different creatures. You have a distributed set of> > >> Strings, but 
want a local stream of bytes, which involves three> > >> conversions:> > >>> > 
>> - collect data to driver> > >> - concatenate strings in some way> > >> - 
encode strings as bytes according to an encoding> > >>> > >> Your approach is 
OK but might be faster to avoid disk, if you have> > >> enough memory:> > >>> > 
>> - collect() to a Array[String] locally> > >> - use Guava utilities to turn a 
bunch of Strings into a Reader> > >> - Use the Apache Commons ReaderInputStream 
to read it as encoded bytes> > >>> > >> I might wonder if that's all really 
what you want to do though.> > >>> > >>> > >> On Fri, Mar 13, 2015 at 9:54 AM, 
Ayoub <[hidden email]> wrote:> > >> > Hello,> > >> >> > >> > I need to convert 
an RDD[String] to a java.io.InputStream but I didn't> > >> > find> > >> > an 
east way to do it.> > >> > Currently I am saving the RDD as temporary file and 
then opening an> > >> > inputstream on the file but that is not really 
optimal.> > >> >> > >> > Does anybody know a better way to do that ?> > >> >> > 
>> > Thanks,> > >> > Ayoub.> > >> >> > >> >> > >> >> > >> > --> > >> > View 
this message in context:> > >> >> > 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html>
 > >> > Sent from the Apache Spark User List mailing list archive at> > 
Nabble.com.> > >> >> > >> > 
-> > >> > 
To unsubscribe, e-mail: [hidden email]> > >> > For additional commands, e-mail: 
[hidden email]> > >> >> > >> > >> > >> > > > > 
> View this message in context: Re: RDD to InputStream> > >> > > Sent from the 
Apache Spark User List mailing list archive at Nabble.com.> >> > > > > --> View 
this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22121.html>
 Sent from the Apache Spark User List mailing list archive at Nabble.com.مرسل 
من هاتف Samsung Galaxy الذكي.

Re: RDD to InputStream

2015-03-18 Thread Ayoub
In case it would interest other peoples, here is what I come up with and it
seems to work fine:

  case class RDDAsInputStream(private val rdd: RDD[String]) extends
java.io.InputStream {
var bytes = rdd.flatMap(_.getBytes(UTF-8)).toLocalIterator

def read(): Int = {
  if(bytes.hasNext) bytes.next.toInt
  else -1
}
override def markSupported(): Boolean = false
  }


2015-03-13 13:56 GMT+01:00 Sean Owen so...@cloudera.com:

 OK, then you do not want to collect() the RDD. You can get an iterator,
 yes.
 There is no such thing as making an Iterator into an InputStream. An
 Iterator is a sequence of arbitrary objects; an InputStream is a
 channel to a stream of bytes.
 I think you can employ similar Guava / Commons utilities to make an
 Iterator of Streams in a stream of Readers, join the Readers, and
 encode the result as bytes in an InputStream.

 On Fri, Mar 13, 2015 at 10:33 AM, Ayoub benali.ayoub.i...@gmail.com
 wrote:
  Thanks Sean,
 
  I forgot to mention that the data is too big to be collected on the
 driver.
 
  So yes your proposition would work in theory but in my case I cannot hold
  all the data in the driver memory, therefore it wouldn't work.
 
  I guess the crucial point to to do the collect in a lazy way and in that
  subject I noticed that we can get a local iterator from an RDD but that
  rises two questions:
 
  - does that involves an mediate collect just like with collect() or is
 it
  lazy process ?
  - how to go from an iterator to an InputStream ?
 
 
  2015-03-13 11:17 GMT+01:00 Sean Owen [hidden email]:
 
  These are quite different creatures. You have a distributed set of
  Strings, but want a local stream of bytes, which involves three
  conversions:
 
  - collect data to driver
  - concatenate strings in some way
  - encode strings as bytes according to an encoding
 
  Your approach is OK but might be faster to avoid disk, if you have
  enough memory:
 
  - collect() to a Array[String] locally
  - use Guava utilities to turn a bunch of Strings into a Reader
  - Use the Apache Commons ReaderInputStream to read it as encoded bytes
 
  I might wonder if that's all really what you want to do though.
 
 
  On Fri, Mar 13, 2015 at 9:54 AM, Ayoub [hidden email] wrote:
   Hello,
  
   I need to convert an RDD[String] to a java.io.InputStream but I didn't
   find
   an east way to do it.
   Currently I am saving the RDD as temporary file and then opening an
   inputstream on the file but that is not really optimal.
  
   Does anybody know a better way to do that ?
  
   Thanks,
   Ayoub.
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   -
   To unsubscribe, e-mail: [hidden email]
   For additional commands, e-mail: [hidden email]
  
 
 
 
  
  View this message in context: Re: RDD to InputStream
 
  Sent from the Apache Spark User List mailing list archive at Nabble.com.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22121.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: RDD to InputStream

2015-03-13 Thread Sean Owen
These are quite different creatures. You have a distributed set of
Strings, but want a local stream of bytes, which involves three
conversions:

- collect data to driver
- concatenate strings in some way
- encode strings as bytes according to an encoding

Your approach is OK but might be faster to avoid disk, if you have
enough memory:

- collect() to a Array[String] locally
- use Guava utilities to turn a bunch of Strings into a Reader
- Use the Apache Commons ReaderInputStream to read it as encoded bytes

I might wonder if that's all really what you want to do though.


On Fri, Mar 13, 2015 at 9:54 AM, Ayoub benali.ayoub.i...@gmail.com wrote:
 Hello,

 I need to convert an RDD[String] to a java.io.InputStream but I didn't find
 an east way to do it.
 Currently I am saving the RDD as temporary file and then opening an
 inputstream on the file but that is not really optimal.

 Does anybody know a better way to do that ?

 Thanks,
 Ayoub.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD to InputStream

2015-03-13 Thread Ayoub
Thanks Sean,

I forgot to mention that the data is too big to be collected on the driver.

So yes your proposition would work in theory but in my case I cannot hold
all the data in the driver memory, therefore it wouldn't work.

I guess the crucial point to to do the collect in a lazy way and in that
subject I noticed that we can get a local iterator from an RDD but that
rises two questions:

- does that involves an mediate collect just like with collect() or is it
lazy process ?
- how to go from an iterator to an InputStream ?


2015-03-13 11:17 GMT+01:00 Sean Owen so...@cloudera.com:

 These are quite different creatures. You have a distributed set of
 Strings, but want a local stream of bytes, which involves three
 conversions:

 - collect data to driver
 - concatenate strings in some way
 - encode strings as bytes according to an encoding

 Your approach is OK but might be faster to avoid disk, if you have
 enough memory:

 - collect() to a Array[String] locally
 - use Guava utilities to turn a bunch of Strings into a Reader
 - Use the Apache Commons ReaderInputStream to read it as encoded bytes

 I might wonder if that's all really what you want to do though.


 On Fri, Mar 13, 2015 at 9:54 AM, Ayoub benali.ayoub.i...@gmail.com
 wrote:
  Hello,
 
  I need to convert an RDD[String] to a java.io.InputStream but I didn't
 find
  an east way to do it.
  Currently I am saving the RDD as temporary file and then opening an
  inputstream on the file but that is not really optimal.
 
  Does anybody know a better way to do that ?
 
  Thanks,
  Ayoub.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22032.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: RDD to InputStream

2015-03-13 Thread Sean Owen
OK, then you do not want to collect() the RDD. You can get an iterator, yes.
There is no such thing as making an Iterator into an InputStream. An
Iterator is a sequence of arbitrary objects; an InputStream is a
channel to a stream of bytes.
I think you can employ similar Guava / Commons utilities to make an
Iterator of Streams in a stream of Readers, join the Readers, and
encode the result as bytes in an InputStream.

On Fri, Mar 13, 2015 at 10:33 AM, Ayoub benali.ayoub.i...@gmail.com wrote:
 Thanks Sean,

 I forgot to mention that the data is too big to be collected on the driver.

 So yes your proposition would work in theory but in my case I cannot hold
 all the data in the driver memory, therefore it wouldn't work.

 I guess the crucial point to to do the collect in a lazy way and in that
 subject I noticed that we can get a local iterator from an RDD but that
 rises two questions:

 - does that involves an mediate collect just like with collect() or is it
 lazy process ?
 - how to go from an iterator to an InputStream ?


 2015-03-13 11:17 GMT+01:00 Sean Owen [hidden email]:

 These are quite different creatures. You have a distributed set of
 Strings, but want a local stream of bytes, which involves three
 conversions:

 - collect data to driver
 - concatenate strings in some way
 - encode strings as bytes according to an encoding

 Your approach is OK but might be faster to avoid disk, if you have
 enough memory:

 - collect() to a Array[String] locally
 - use Guava utilities to turn a bunch of Strings into a Reader
 - Use the Apache Commons ReaderInputStream to read it as encoded bytes

 I might wonder if that's all really what you want to do though.


 On Fri, Mar 13, 2015 at 9:54 AM, Ayoub [hidden email] wrote:
  Hello,
 
  I need to convert an RDD[String] to a java.io.InputStream but I didn't
  find
  an east way to do it.
  Currently I am saving the RDD as temporary file and then opening an
  inputstream on the file but that is not really optimal.
 
  Does anybody know a better way to do that ?
 
  Thanks,
  Ayoub.
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
  For additional commands, e-mail: [hidden email]
 



 
 View this message in context: Re: RDD to InputStream

 Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org