Tamilselvan Veeramani created SPARK-24106: ---------------------------------------------
Summary: Spark Structure Streaming with RF model taking long time in processing probability for each mini batch Key: SPARK-24106 URL: https://issues.apache.org/jira/browse/SPARK-24106 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.3.0, 2.2.1, 2.2.0 Environment: Spark yarn / Standalone cluster 2 master nodes - 32 cores - 124 GB 9 worker nodes - 32 cores - 124 GB Kafka input and output topic with 6 partition Reporter: Tamilselvan Veeramani Fix For: 2.4.0, 2.3.0 RandomForestClassificationModel broadcasted to executors for every mini batch in spark streaming while try to find probability RF model size 45MB spark kafka streaming job jar size 8 MB (including kafka dependency’s) following log show model broad cast to executors for every mini batch when we call rf_model.transform(dataset).select("probability"). due to which task deserialization time also increases comes to 6 to 7 second for 45MB of rf model, although processing time is just 400 to 600 ms for mini batch 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) After 2 to 3 weeks of struggle, I found a potentially solution which will help many people who is looking to use RF model for “probability” in real time streaming context Since RandomForestClassificationModel class of transformImpl method implements only “prediction” in current version of spark. Which can be leveraged to implement “probability” also in RandomForestClassificationModel class of transformImpl method. I have modified the code and implemented in our server and it’s working as fast as 400ms to 500ms for every mini batch I see many people our there facing this issue and no solution provided in any of the forums, Can you please review and put this fix in next release ? thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org