Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
Hi,
 Thank you for the  response.

 Can I give a PR to use transform for all the functions like map,flatMap
etc so they are consistent with other API's?.

Regards,
Madhukara Phatak
http://datamantra.io/

On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das t...@databricks.com wrote:

 It's mostly for legacy reasons. First we had added all the MappedDStream,
 etc. and then later we realized we need to expose something that is more
 generic for arbitrary RDD-RDD transformations. It can be easily replaced.
 However, there is a slight value in having MappedDStream, for developers to
 learn about DStreams.

 TD

 On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak phatak@gmail.com
 wrote:

 Hi,
  Thanks for the response. I understand that part. But I am asking why the
 internal implementation using a subclass when it can use an existing api?
 Unless there is a real difference, it feels like code smell to me.


 Regards,
 Madhukara Phatak
 http://datamantra.io/

 On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai saisai.s...@intel.com
 wrote:

  I think these two ways are both OK for you to write streaming job,
 `transform` is a more general way for you to transform from one DStream to
 another if there’s no related DStream API (but have related RDD API). But
 using map maybe more straightforward and easy to understand.



 Thanks

 Jerry



 *From:* madhu phatak [mailto:phatak@gmail.com]
 *Sent:* Monday, March 16, 2015 4:32 PM
 *To:* user@spark.apache.org
 *Subject:* MappedStream vs Transform API



 Hi,

   Current implementation of map function in spark streaming looks as
 below.



   *def *map[U: ClassTag](mapFunc: T = U): DStream[U] = {

   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
 }

  It creates an instance of MappedDStream which is a subclass of DStream.



 The same function can be also implemented using transform API



 *def map*[U: ClassTag](mapFunc: T = U): DStream[U] =

 this.transform(rdd = {

   rdd.map(mapFunc)
 })



 Both implementation looks same. If they are same, is there any advantage
 having a subclass of DStream?. Why can't we just use transform API?





 Regards,
 Madhukara Phatak
 http://datamantra.io/






Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
Hi,
Regards,
Madhukara Phatak
http://datamantra.io/

On Tue, Mar 17, 2015 at 2:31 PM, Tathagata Das t...@databricks.com wrote:

 That's not super essential, and hence hasn't been done till now. Even in
 core Spark there are MappedRDD, etc. even though all of them can be
 implemented by MapPartitionedRDD (may be the name is wrong). So its nice to
 maintain the consistency, MappedDStream creates MappedRDDs. :)
 Though this does not eliminate the possibility that we will do it. Maybe
 in future, if we find that maintaining these different DStreams is becoming
 a maintenance burden (its isn't yet), we may collapse them to use
 transform. We did so in the python API for exactly this reason.


  Ok. When I was going through source code it confused me to understand
what were right extension points were. So I thought whoever go   through
the code may get into same situation.  But if it's not super essential then
ok.


 If you are interested in contributing to Spark Streaming, i can point you
 to a number of issues where your contributions will be more valuable.


   Yes please.


 TD

 On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak phatak@gmail.com
 wrote:

 Hi,
  Thank you for the  response.

  Can I give a PR to use transform for all the functions like map,flatMap
 etc so they are consistent with other API's?.

 Regards,
 Madhukara Phatak
 http://datamantra.io/

 On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das t...@databricks.com
 wrote:

 It's mostly for legacy reasons. First we had added all the
 MappedDStream, etc. and then later we realized we need to expose something
 that is more generic for arbitrary RDD-RDD transformations. It can be
 easily replaced. However, there is a slight value in having MappedDStream,
 for developers to learn about DStreams.

 TD

 On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak phatak@gmail.com
 wrote:

 Hi,
  Thanks for the response. I understand that part. But I am asking why
 the internal implementation using a subclass when it can use an existing
 api? Unless there is a real difference, it feels like code smell to me.


 Regards,
 Madhukara Phatak
 http://datamantra.io/

 On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai saisai.s...@intel.com
 wrote:

  I think these two ways are both OK for you to write streaming job,
 `transform` is a more general way for you to transform from one DStream to
 another if there’s no related DStream API (but have related RDD API). But
 using map maybe more straightforward and easy to understand.



 Thanks

 Jerry



 *From:* madhu phatak [mailto:phatak@gmail.com]
 *Sent:* Monday, March 16, 2015 4:32 PM
 *To:* user@spark.apache.org
 *Subject:* MappedStream vs Transform API



 Hi,

   Current implementation of map function in spark streaming looks as
 below.



   *def *map[U: ClassTag](mapFunc: T = U): DStream[U] = {

   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
 }

  It creates an instance of MappedDStream which is a subclass of
 DStream.



 The same function can be also implemented using transform API



 *def map*[U: ClassTag](mapFunc: T = U): DStream[U] =

 this.transform(rdd = {

   rdd.map(mapFunc)
 })



 Both implementation looks same. If they are same, is there any
 advantage having a subclass of DStream?. Why can't we just use transform
 API?





 Regards,
 Madhukara Phatak
 http://datamantra.io/








Re: MappedStream vs Transform API

2015-03-17 Thread Tathagata Das
That's not super essential, and hence hasn't been done till now. Even in
core Spark there are MappedRDD, etc. even though all of them can be
implemented by MapPartitionedRDD (may be the name is wrong). So its nice to
maintain the consistency, MappedDStream creates MappedRDDs. :)
Though this does not eliminate the possibility that we will do it. Maybe in
future, if we find that maintaining these different DStreams is becoming a
maintenance burden (its isn't yet), we may collapse them to use transform.
We did so in the python API for exactly this reason.

If you are interested in contributing to Spark Streaming, i can point you
to a number of issues where your contributions will be more valuable.

TD

On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak phatak@gmail.com wrote:

 Hi,
  Thank you for the  response.

  Can I give a PR to use transform for all the functions like map,flatMap
 etc so they are consistent with other API's?.

 Regards,
 Madhukara Phatak
 http://datamantra.io/

 On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das t...@databricks.com
 wrote:

 It's mostly for legacy reasons. First we had added all the MappedDStream,
 etc. and then later we realized we need to expose something that is more
 generic for arbitrary RDD-RDD transformations. It can be easily replaced.
 However, there is a slight value in having MappedDStream, for developers to
 learn about DStreams.

 TD

 On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak phatak@gmail.com
 wrote:

 Hi,
  Thanks for the response. I understand that part. But I am asking why
 the internal implementation using a subclass when it can use an existing
 api? Unless there is a real difference, it feels like code smell to me.


 Regards,
 Madhukara Phatak
 http://datamantra.io/

 On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai saisai.s...@intel.com
 wrote:

  I think these two ways are both OK for you to write streaming job,
 `transform` is a more general way for you to transform from one DStream to
 another if there’s no related DStream API (but have related RDD API). But
 using map maybe more straightforward and easy to understand.



 Thanks

 Jerry



 *From:* madhu phatak [mailto:phatak@gmail.com]
 *Sent:* Monday, March 16, 2015 4:32 PM
 *To:* user@spark.apache.org
 *Subject:* MappedStream vs Transform API



 Hi,

   Current implementation of map function in spark streaming looks as
 below.



   *def *map[U: ClassTag](mapFunc: T = U): DStream[U] = {

   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
 }

  It creates an instance of MappedDStream which is a subclass of
 DStream.



 The same function can be also implemented using transform API



 *def map*[U: ClassTag](mapFunc: T = U): DStream[U] =

 this.transform(rdd = {

   rdd.map(mapFunc)
 })



 Both implementation looks same. If they are same, is there any
 advantage having a subclass of DStream?. Why can't we just use transform
 API?





 Regards,
 Madhukara Phatak
 http://datamantra.io/







Re: MappedStream vs Transform API

2015-03-17 Thread madhu phatak
Hi,
 Sorry for the wrong formatting in the earlier mail.

On Tue, Mar 17, 2015 at 2:31 PM, Tathagata Das t...@databricks.com wrote:

 That's not super essential, and hence hasn't been done till now. Even in
 core Spark there are MappedRDD, etc. even though all of them can be
 implemented by MapPartitionedRDD (may be the name is wrong). So its nice to
 maintain the consistency, MappedDStream creates MappedRDDs. :)
 Though this does not eliminate the possibility that we will do it. Maybe
 in future, if we find that maintaining these different DStreams is becoming
 a maintenance burden (its isn't yet), we may collapse them to use
 transform. We did so in the python API for exactly this reason.


  Ok. When I was going through source code it confused me to understand
what were right extension points were. So I thought whoever go   through
the code may get into same situation.  But if it's not super essential then
ok.



 If you are interested in contributing to Spark Streaming, i can point you
 to a number of issues where your contributions will be more valuable.


   That will be great.



 TD

 On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak phatak@gmail.com
 wrote:

 Hi,
  Thank you for the  response.

  Can I give a PR to use transform for all the functions like map,flatMap
 etc so they are consistent with other API's?.

 Regards,
 Madhukara Phatak
 http://datamantra.io/

 On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das t...@databricks.com
 wrote:

 It's mostly for legacy reasons. First we had added all the
 MappedDStream, etc. and then later we realized we need to expose something
 that is more generic for arbitrary RDD-RDD transformations. It can be
 easily replaced. However, there is a slight value in having MappedDStream,
 for developers to learn about DStreams.

 TD

 On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak phatak@gmail.com
 wrote:

 Hi,
  Thanks for the response. I understand that part. But I am asking why
 the internal implementation using a subclass when it can use an existing
 api? Unless there is a real difference, it feels like code smell to me.


 Regards,
 Madhukara Phatak
 http://datamantra.io/

 On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai saisai.s...@intel.com
 wrote:

  I think these two ways are both OK for you to write streaming job,
 `transform` is a more general way for you to transform from one DStream to
 another if there’s no related DStream API (but have related RDD API). But
 using map maybe more straightforward and easy to understand.



 Thanks

 Jerry



 *From:* madhu phatak [mailto:phatak@gmail.com]
 *Sent:* Monday, March 16, 2015 4:32 PM
 *To:* user@spark.apache.org
 *Subject:* MappedStream vs Transform API



 Hi,

   Current implementation of map function in spark streaming looks as
 below.



   *def *map[U: ClassTag](mapFunc: T = U): DStream[U] = {

   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
 }

  It creates an instance of MappedDStream which is a subclass of
 DStream.



 The same function can be also implemented using transform API



 *def map*[U: ClassTag](mapFunc: T = U): DStream[U] =

 this.transform(rdd = {

   rdd.map(mapFunc)
 })



 Both implementation looks same. If they are same, is there any
 advantage having a subclass of DStream?. Why can't we just use transform
 API?





 Regards,
 Madhukara Phatak
 http://datamantra.io/







Regards,
Madhukara Phatak
http://datamantra.io/


Re: MappedStream vs Transform API

2015-03-16 Thread madhu phatak
Hi,
 Thanks for the response. I understand that part. But I am asking why the
internal implementation using a subclass when it can use an existing api?
Unless there is a real difference, it feels like code smell to me.


Regards,
Madhukara Phatak
http://datamantra.io/

On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai saisai.s...@intel.com wrote:

  I think these two ways are both OK for you to write streaming job,
 `transform` is a more general way for you to transform from one DStream to
 another if there’s no related DStream API (but have related RDD API). But
 using map maybe more straightforward and easy to understand.



 Thanks

 Jerry



 *From:* madhu phatak [mailto:phatak@gmail.com]
 *Sent:* Monday, March 16, 2015 4:32 PM
 *To:* user@spark.apache.org
 *Subject:* MappedStream vs Transform API



 Hi,

   Current implementation of map function in spark streaming looks as below.



   *def *map[U: ClassTag](mapFunc: T = U): DStream[U] = {

   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
 }

  It creates an instance of MappedDStream which is a subclass of DStream.



 The same function can be also implemented using transform API



 *def map*[U: ClassTag](mapFunc: T = U): DStream[U] =

 this.transform(rdd = {

   rdd.map(mapFunc)
 })



 Both implementation looks same. If they are same, is there any advantage
 having a subclass of DStream?. Why can't we just use transform API?





 Regards,
 Madhukara Phatak
 http://datamantra.io/



Re: MappedStream vs Transform API

2015-03-16 Thread Tathagata Das
It's mostly for legacy reasons. First we had added all the MappedDStream,
etc. and then later we realized we need to expose something that is more
generic for arbitrary RDD-RDD transformations. It can be easily replaced.
However, there is a slight value in having MappedDStream, for developers to
learn about DStreams.

TD

On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak phatak@gmail.com wrote:

 Hi,
  Thanks for the response. I understand that part. But I am asking why the
 internal implementation using a subclass when it can use an existing api?
 Unless there is a real difference, it feels like code smell to me.


 Regards,
 Madhukara Phatak
 http://datamantra.io/

 On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai saisai.s...@intel.com
 wrote:

  I think these two ways are both OK for you to write streaming job,
 `transform` is a more general way for you to transform from one DStream to
 another if there’s no related DStream API (but have related RDD API). But
 using map maybe more straightforward and easy to understand.



 Thanks

 Jerry



 *From:* madhu phatak [mailto:phatak@gmail.com]
 *Sent:* Monday, March 16, 2015 4:32 PM
 *To:* user@spark.apache.org
 *Subject:* MappedStream vs Transform API



 Hi,

   Current implementation of map function in spark streaming looks as
 below.



   *def *map[U: ClassTag](mapFunc: T = U): DStream[U] = {

   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
 }

  It creates an instance of MappedDStream which is a subclass of DStream.



 The same function can be also implemented using transform API



 *def map*[U: ClassTag](mapFunc: T = U): DStream[U] =

 this.transform(rdd = {

   rdd.map(mapFunc)
 })



 Both implementation looks same. If they are same, is there any advantage
 having a subclass of DStream?. Why can't we just use transform API?





 Regards,
 Madhukara Phatak
 http://datamantra.io/





RE: MappedStream vs Transform API

2015-03-16 Thread Shao, Saisai
I think these two ways are both OK for you to write streaming job, `transform` 
is a more general way for you to transform from one DStream to another if 
there’s no related DStream API (but have related RDD API). But using map maybe 
more straightforward and easy to understand.

Thanks
Jerry

From: madhu phatak [mailto:phatak@gmail.com]
Sent: Monday, March 16, 2015 4:32 PM
To: user@spark.apache.org
Subject: MappedStream vs Transform API

Hi,
  Current implementation of map function in spark streaming looks as below.

  def map[U: ClassTag](mapFunc: T = U): DStream[U] = {

  new MappedDStream(this, context.sparkContext.clean(mapFunc))
}
It creates an instance of MappedDStream which is a subclass of DStream.

The same function can be also implemented using transform API


def map[U: ClassTag](mapFunc: T = U): DStream[U] =

this.transform(rdd = {

  rdd.map(mapFunc)
})

Both implementation looks same. If they are same, is there any advantage having 
a subclass of DStream?. Why can't we just use transform API?


Regards,
Madhukara Phatak
http://datamantra.io/