[ https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tamilselvan Veeramani updated SPARK-24106: ------------------------------------------ Target Version/s: 2.3.0, 2.2.1 (was: 2.2.1, 2.3.0) Component/s: (was: MLlib) ML > Spark Structure Streaming with RF model taking long time in processing > probability for each mini batch > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-24106 > URL: https://issues.apache.org/jira/browse/SPARK-24106 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.2.0, 2.2.1, 2.3.0 > Environment: Spark yarn / Standalone cluster > 2 master nodes - 32 cores - 124 GB > 9 worker nodes - 32 cores - 124 GB > Kafka input and output topic with 6 partition > Reporter: Tamilselvan Veeramani > Priority: Major > Labels: performance > Fix For: 2.3.0, 2.4.0 > > > RandomForestClassificationModel broadcasted to executors for every mini batch > in spark streaming while try to find probability > RF model size 45MB > spark kafka streaming job jar size 8 MB (including kafka dependency’s) > following log show model broad cast to executors for every mini batch when we > call rf_model.transform(dataset).select("probability"). > due to which task deserialization time also increases comes to 6 to 7 second > for 45MB of rf model, although processing time is just 400 to 600 ms for mini > batch > 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: > KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), > > KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) > 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106 > 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory > on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) > After 2 to 3 weeks of struggle, I found a potentially solution which will > help many people who is looking to use RF model for “probability” in real > time streaming context > Since RandomForestClassificationModel class of transformImpl method > implements only “prediction” in current version of spark. Which can be > leveraged to implement “probability” also in RandomForestClassificationModel > class of transformImpl method. > I have modified the code and implemented in our server and it’s working as > fast as 400ms to 500ms for every mini batch > I see many people our there facing this issue and no solution provided in any > of the forums, Can you please review and put this fix in next release ? thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org