[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

Tathagata Das (JIRA) Tue, 23 Dec 2014 16:37:42 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257728#comment-14257728
 ]


Tathagata Das edited comment on SPARK-4817 at 12/24/14 12:36 AM:
-----------------------------------------------------------------

I agree with [~srowen] point. 
1. Updating database within a map operation is inherently not a good idea. The 
whole idea of map-reduce model is based on the assumption that the map and 
reduce functions do not have any side-effects, and are idempotent, and updating 
the database using map operation (i) violates that property, and (ii) is not a 
good programming style with RDDs. 
2. Sean Owen's suggestion of using {{rdd.foreach}} and {{rdd.print}} is a good 
idea in this case. After {{rdd.foreach}} has been executed, the {{rdd.print}} 
(which usually does not launch a job) should be usually cheap. 

Hence I am not convinced that this PR is needed.



was (Author: tdas):
I agree with [~srowen] point. 
1. Updating database within a map operation is inherently not a good idea. The 
whole idea of map-reduce model is based on the assumption that the map and 
reduce functions do not have any side-effects, and are idempotent, and updating 
the database using map operation (i) violates that property, and (ii) is not a 
good programming style with RDDs. 
2. Sean Owen's suggestion of using {{rdd.foreach}} and {{rdd.print}} is a good 
idea in this case. After {{rdd.foreach}} has been executed, the {{rdd.print}} 
(which usually does not launch a job) should be usually cheap. 



> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-4817
>                 URL: https://issues.apache.org/jira/browse/SPARK-4817
>             Project: Spark
>          Issue Type: New Feature
>          Components: Streaming
>            Reporter: 宿荣全
>            Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

Reply via email to