I am implementing a lambda architecture system for stream processing. I have no issue creating a Pipeline with GridSearch in Spark batch:
pipeline = Pipeline(stages=[data1_indexer, data2_indexer, ..., assembler, logistic_regressor]) paramGrid = ( ParamGridBuilder() .addGrid(logistic_regressor.regParam, (0.01, 0.1)) .addGrid(logistic_regressor.tol, (1e-5, 1e-6)) ...etcetera ).build() cv = CrossValidator( estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=4) pipeline_cv = cv.fit(raw_train_df) model_fitted = pipeline_cv.getEstimator().fit(raw_validation_df) model_fitted.write().overwrite().save("pipeline") However, I cant seem to find how to plug the pipeline in the Spark Streaming Process. I am using kafka as the DStream source and my code as of now is as follows: import json from pyspark.ml import PipelineModel from pyspark.streaming.kafka import KafkaUtils from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 1) kafkaStream = KafkaUtils.createStream( ssc, "localhost:2181", "spark- streaming-consumer", {"kafka_topic": 1} ) model = PipelineModel.load('pipeline/') parsed_stream = kafkaStream.map(lambda x: json.loads(x[1])) CODE MISSING GOES HERE ssc.start() ssc.awaitTermination() and now I need to find some way of doing the actual prediction on the StreamingContext. Based on the documentation found in the gitbooks Twitter Streaming Example (https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html) ) it seems like the model needs to implement the method predict in order to be able to use it on an rdd object (and hopefully on a kafkastream?) How could I use the pipeline on the Streaming context? The reloaded PipelineModel only seems to implement transform Does that mean the only way to use batch models in a Streaming context is to use pure models , and no pipelines? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-load-a-Pipeline-on-a-Stream-tp27828.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org