[ 
https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24106:
-----------------------------------
    Fix Version/s:     (was: 2.4.0)
                       (was: 2.3.0)

> Spark Structure Streaming with RF model taking long time in processing 
> probability for each mini batch
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24106
>                 URL: https://issues.apache.org/jira/browse/SPARK-24106
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0, 2.2.1, 2.3.0
>         Environment: Spark yarn / Standalone cluster
> 2 master nodes - 32 cores - 124 GB
> 9 worker nodes - 32 cores - 124 GB
> Kafka input and output topic with 6 partition
>            Reporter: Tamilselvan Veeramani
>            Priority: Major
>              Labels: performance
>
> RandomForestClassificationModel broadcasted to executors for every mini batch 
> in spark streaming while try to find probability
> RF model size 45MB
> spark kafka streaming job jar size 8 MB (including kafka dependency’s)
> following log show model broad cast to executors for every mini batch when we 
> call rf_model.transform(dataset).select("probability").
> due to which task deserialization time also increases comes to 6 to 7 second 
> for 45MB of rf model, although processing time is just 400 to 600 ms for mini 
> batch
> 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: 
> KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5))
>  18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
> 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory 
> on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)
> After 2 to 3 weeks of struggle, I found a potentially solution which will 
> help many people who is looking to use RF model for “probability” in real 
> time streaming context
> Since RandomForestClassificationModel class of transformImpl method 
> implements only “prediction” in current version of spark. Which can be 
> leveraged to implement “probability” also in RandomForestClassificationModel 
> class of transformImpl method.
> I have modified the code and implemented in our server and it’s working as 
> fast as 400ms to 500ms for every mini batch
> I see many people our there facing this issue and no solution provided in any 
> of the forums, Can you please review and put this fix in next release ? thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to