[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

JIRA Thu, 11 Dec 2014 20:19:56 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243683#comment-14243683
 ]


宿荣全 commented on SPARK-4817:
----------------------------

[~srowen]
I‘m sorry that didn't describe the problem clearly.
If there is such a scene that there are multiple outputs:
*data from HDFS files - > map-> filter->map(each row to updata mysql DB)- > 
filter-> map->print(print 20 datas to console)*
# {color:red} output to mysql DB{color}
# {color:red} output to console DB{color}

this patch: ( the function {{processAllAndPrintFirst}} is new be defined.)
{code}
ssc.textFileStream("path").map(func1).filter(func2).map(f=>{updataMysql(f)}).filter(func3).
 map(func4).processAllAndPrintFirst(20)
{code}

How to do this scene if use {{foreachRDD}} and {{take}} or {{print(num)}}?
Both [ {{rdd.foreach}} and {{rdd.take}} ]or  [ {{rdd.foreach}} and 
{{stream.print(100)}} ] will have two Jobs in each streaming batch.
With have a job to compare it will whether or not the efficiency?

> [streaming]Print the specified number of data and handle all of the elements 
> in RDD
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-4817
>                 URL: https://issues.apache.org/jira/browse/SPARK-4817
>             Project: Spark
>          Issue Type: New Feature
>          Components: Streaming
>            Reporter: 宿荣全
>            Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need 
> print each data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD

Reply via email to