Hey, I have some additional Spark ML algorithms implemented in scala that I would like to make available in pyspark. For a reference I am looking at the available logistic regression implementation here:
https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/ml/classification.html I have couple of questions: 1) The constructor for the *class LogisticRegression* as far as I understand just accepts the arguments and then just constructs the underlying Scala object via /py4j/ and parses its arguments. This is done via the line *self._java_obj = self._new_java_obj( "org.apache.spark.ml.classification.LogisticRegression", self.uid)* Is this correct? What does the line *super(LogisticRegression, self).__init__()* do? Does this mean that any python datastructures used with it will be converted to java structures once the object is instantiated? 2) The corresponding model *class LogisticRegressionModel(JavaModel):* again just instantiates the Java object and nothing else? Is just enough for me to forward the arguments and instantiate the scala objects? Does this mean that when the pipeline is created, even if the pipeline is python it expects objects which are underlying scala code instantiated by /py4j/. Can one use pure python elements inside the pipeline (dealing with RDDs)? What would be the performance implication? Cheers, Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Several-questions-about-how-pyspark-ml-works-tp27141.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org