Tamilselvan Veeramani created SPARK-24106:
---------------------------------------------

             Summary: Spark Structure Streaming with RF model taking long time 
in processing probability for each mini batch
                 Key: SPARK-24106
                 URL: https://issues.apache.org/jira/browse/SPARK-24106
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 2.3.0, 2.2.1, 2.2.0
         Environment: Spark yarn / Standalone cluster
2 master nodes - 32 cores - 124 GB
9 worker nodes - 32 cores - 124 GB
Kafka input and output topic with 6 partition
            Reporter: Tamilselvan Veeramani
             Fix For: 2.4.0, 2.3.0


RandomForestClassificationModel broadcasted to executors for every mini batch 
in spark streaming while try to find probability

RF model size 45MB
spark kafka streaming job jar size 8 MB (including kafka dependency’s)

following log show model broad cast to executors for every mini batch when we 
call rf_model.transform(dataset).select("probability").
due to which task deserialization time also increases comes to 6 to 7 second 
for 45MB of rf model, although processing time is just 400 to 600 ms for mini 
batch

18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: 
KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)),
 
KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)),
 
KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)),
 
KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)),
 
KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)),
 
KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5))
 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory on 
xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)

After 2 to 3 weeks of struggle, I found a potentially solution which will help 
many people who is looking to use RF model for “probability” in real time 
streaming context
Since RandomForestClassificationModel class of transformImpl method implements 
only “prediction” in current version of spark. Which can be leveraged to 
implement “probability” also in RandomForestClassificationModel class of 
transformImpl method.

I have modified the code and implemented in our server and it’s working as fast 
as 400ms to 500ms for every mini batch

I see many people our there facing this issue and no solution provided in any 
of the forums, Can you please review and put this fix in next release ? thanks




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to