[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257728#comment-14257728 ]
Tathagata Das edited comment on SPARK-4817 at 12/24/14 12:36 AM: ----------------------------------------------------------------- I agree with [~srowen] point. 1. Updating database within a map operation is inherently not a good idea. The whole idea of map-reduce model is based on the assumption that the map and reduce functions do not have any side-effects, and are idempotent, and updating the database using map operation (i) violates that property, and (ii) is not a good programming style with RDDs. 2. Sean Owen's suggestion of using {{rdd.foreach}} and {{rdd.print}} is a good idea in this case. After {{rdd.foreach}} has been executed, the {{rdd.print}} (which usually does not launch a job) should be usually cheap. Hence I am not convinced that this PR is needed. was (Author: tdas): I agree with [~srowen] point. 1. Updating database within a map operation is inherently not a good idea. The whole idea of map-reduce model is based on the assumption that the map and reduce functions do not have any side-effects, and are idempotent, and updating the database using map operation (i) violates that property, and (ii) is not a good programming style with RDDs. 2. Sean Owen's suggestion of using {{rdd.foreach}} and {{rdd.print}} is a good idea in this case. After {{rdd.foreach}} has been executed, the {{rdd.print}} (which usually does not launch a job) should be usually cheap. > [streaming]Print the specified number of data and handle all of the elements > in RDD > ----------------------------------------------------------------------------------- > > Key: SPARK-4817 > URL: https://issues.apache.org/jira/browse/SPARK-4817 > Project: Spark > Issue Type: New Feature > Components: Streaming > Reporter: 宿荣全 > Priority: Minor > > Dstream.print function:Print 10 elements and handle 11 elements. > A new function based on Dstream.print function is presented: > the new function: > Print the specified number of data and handle all of the elements in RDD. > there is a work scene: > val dstream = stream.map->filter->mapPartitions->print > the data after filter need update database in mapPartitions,but don't need > print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org