Hi, I'm implementing a Spark Streaming + ML application. The data is coming in a Kafka topic as json format. The Spark Kafka connector reads the data from the Kafka topic as DStream. After several preprocessing steps, the input DStream is transformed to a feature DStream which is fed into Spark ML pipeline model. The code sample explains how the feature DStream interacts with the pipeline model.
prediction_stream = feature_stream.transform(lambda rdd: predict_rdd(rdd, pipeline_model) def predict_rdd(rdd, pipeline_model): if(rdd is not None) and (not rdd.isEmpty()): try: df = rdd.toDF() predictions = pipeline_model.transform(df) return predictions.rdd except Exception as e: print("Unable to make predictions") return None else: return None Here comes the problem. If the pipeline_model.transform(df) is failed due to some data issues in some rows of df, the try...except block won't be able to catch the exception since the exception is thrown in executors. As a result, the exception is bubbled up to Spark and the streaming application is terminated. I want the exception to be caught in some way that the streaming application won't be terminated and keep processing incoming data. Is it possible? I know one solution could be doing more thorough data validation in preprocessing step. However some sort of error handling should be put in place for the transform method of pipeline model just in case any unexpected things happen. Thanks, -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org