Re: [SPARK-48423] Unable to save ML Pipeline to azure blob storage

2024-06-19 Thread Chhavi Bansal
Hello Team,
I am pinging back on this thread to get a pair of eyes on this issue.
Ticket:  https://issues.apache.org/jira/browse/SPARK-48423

On Thu, 6 Jun 2024 at 00:19, Chhavi Bansal  wrote:

> Hello team,
> I was exploring on how to save ML pipeline to azure blob storage, but was
> setback by an issue where it complains of  `fs.azure.account.key`  not
> being found in the configuration even when I have provided the values in
> the pipelineModel.option(key1,value1) field. I considered raising a
> ticket on spark https://issues.apache.org/jira/browse/SPARK-48423, where
> I describe the entire scenario. I tried debugging the code and found that
> this key is being explicitly asked for in the code. The only solution was
> to again set it part of spark.conf which could result to a race condition
> since we work on multi-tenant architecture.
>
>
>
> Since saving to Azure blob storage would be common, Can someone please
> guide me if I am missing something in the `.option` clause?
>
>
>
> I would be happy to make a contribution to the code if someone can shed
> some light on how this could be solved.
>
> --
> Thanks and Regards,
> Chhavi Bansal
>


-- 
Thanks and Regards,
Chhavi Bansal


[SPARK-48423] Unable to save ML Pipeline to azure blob storage

2024-06-05 Thread Chhavi Bansal
Hello team,
I was exploring on how to save ML pipeline to azure blob storage, but was
setback by an issue where it complains of  `fs.azure.account.key`  not
being found in the configuration even when I have provided the values in
the pipelineModel.option(key1,value1) field. I considered raising a ticket
on spark https://issues.apache.org/jira/browse/SPARK-48423, where I
describe the entire scenario. I tried debugging the code and found that
this key is being explicitly asked for in the code. The only solution was
to again set it part of spark.conf which could result to a race condition
since we work on multi-tenant architecture.



Since saving to Azure blob storage would be common, Can someone please
guide me if I am missing something in the `.option` clause?



I would be happy to make a contribution to the code if someone can shed
some light on how this could be solved.

-- 
Thanks and Regards,
Chhavi Bansal


Re: Building a ML pipeline with no training

2022-07-20 Thread Sean Owen
The data transformation is all the same.
Sure, linear regression is easy:
https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression
These are components that operate on DataFrames.

You'll want to look at VectorAssembler to prepare data into an array column.
There are other transformations you may want like normalization, also in
the Spark ML components.
You can put those steps together into a Pipeline to fit and transform with
it as one unit.

On Wed, Jul 20, 2022 at 3:04 AM Edgar H  wrote:

> Morning everyone,
>
> The question may seem to broad but will try to synth as much as possible:
>
> I'm used to work with Spark SQL, DFs and such on a daily basis, easily
> grouping, getting extra counters and using functions or UDFs. However, I've
> come to an scenario where I need to make some predictions and linear
> regression is the way to go.
>
> However, lurking through the docs this belongs to the ML side of Spark and
> never been in there before...
>
> How is it working with Spark ML compared to what I'm used to? Training
> models, building a new one, adding more columns and such... Is there even a
> change or I'm just confused and it's pretty easy?
>
> When deploying ML pipelines, is there anything to take into account
> compared to the usual ones with Spark SQL and such?
>
> And... Is it even possible to do linear regression (or any other ML
> method) inside a traditional pipeline without training or any other ML
> related aspects?
>
> Some guidelines (or articles, ref to docs) would be helpful to start if
> possible.
>
> Thanks!
>


Building a ML pipeline with no training

2022-07-20 Thread Edgar H
Morning everyone,

The question may seem to broad but will try to synth as much as possible:

I'm used to work with Spark SQL, DFs and such on a daily basis, easily
grouping, getting extra counters and using functions or UDFs. However, I've
come to an scenario where I need to make some predictions and linear
regression is the way to go.

However, lurking through the docs this belongs to the ML side of Spark and
never been in there before...

How is it working with Spark ML compared to what I'm used to? Training
models, building a new one, adding more columns and such... Is there even a
change or I'm just confused and it's pretty easy?

When deploying ML pipelines, is there anything to take into account
compared to the usual ones with Spark SQL and such?

And... Is it even possible to do linear regression (or any other ML method)
inside a traditional pipeline without training or any other ML related
aspects?

Some guidelines (or articles, ref to docs) would be helpful to start if
possible.

Thanks!


RE: Re: [Spark ML Pipeline]: Error Loading Pipeline Model with Custom Transformer

2022-01-12 Thread Alana Young
I have updated the gist 
(https://gist.github.com/ally1221/5acddd9650de3dc67f6399a4687893aa 
). Please 
let me know if there are any additional questions.

Re: [Spark ML Pipeline]: Error Loading Pipeline Model with Custom Transformer

2022-01-12 Thread Gourav Sengupta
Hi,

may be I have less time, but can you please add some inline comments in
your code to explain what you are trying to do?

Regards,
Gourav Sengupta



On Tue, Jan 11, 2022 at 5:29 PM Alana Young  wrote:

> I am experimenting with creating and persisting ML pipelines using custom
> transformers (I am using Spark 3.1.2). I was able to create a transformer
> class (for testing purposes, I modeled the code off the SQLTransformer
> class) and save the pipeline model. When I attempt to load the saved
> pipeline model, I am running into the following error:
>
> java.lang.NullPointerException at
> java.base/java.lang.reflect.Method.invoke(Method.java:559) at
> org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstanceReader(ReadWrite.scala:631)
> at
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:276)
> at
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
> at
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
> at scala.util.Try$.apply(Try.scala:213) at
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
> at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
> at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) at
> org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) at
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
> at
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
> at scala.util.Try$.apply(Try.scala:213) at
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
> ... 38 elided
>
>
> Here is a gist
>  containing
> the relevant code. Any feedback and advice would be appreciated. Thank
> you.
>


[Spark ML Pipeline]: Error Loading Pipeline Model with Custom Transformer

2022-01-11 Thread Alana Young
I am experimenting with creating and persisting ML pipelines using custom 
transformers (I am using Spark 3.1.2). I was able to create a transformer class 
(for testing purposes, I modeled the code off the SQLTransformer class) and 
save the pipeline model. When I attempt to load the saved pipeline model, I am 
running into the following error: 

java.lang.NullPointerException
  at java.base/java.lang.reflect.Method.invoke(Method.java:559)
  at 
org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstanceReader(ReadWrite.scala:631)
  at 
org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:276)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
  at 
org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
  at 
org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
  at scala.util.Try$.apply(Try.scala:213)
  at 
org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
  at 
org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
  at 
org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
  at 
org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
  at 
org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
  at scala.util.Try$.apply(Try.scala:213)
  at 
org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
  at 
org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
  at 
org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
  ... 38 elided


Here is a gist 
 containing 
the relevant code. Any feedback and advice would be appreciated. Thank you. 

Re: Problem in Restoring ML Pipeline with UDF

2021-06-08 Thread Sean Owen
It's a little bit of a guess, but the class name
$line103090609224.$read$FeatureModder looks like something generated by the
shell. I think it's your 'real' classname in this case. If you redefined
this later and loaded it you may not find it matches up. Can you declare
this in a package?

On Tue, Jun 8, 2021 at 10:50 AM Artemis User  wrote:

> We have a feature engineering transformer defined as a custom class with
> UDF as follows:
>
> class FeatureModder extends Transformer with DefaultParamsWritable with
> DefaultParamsReadable[FeatureModder] {
> val uid: String = "FeatureModder"+randomUUID
>
> final val inputCol: Param[String] = new Param[String](this,
> "inputCos", "input column")
> final def setInputCol(col:String) = set(inputCol, col)
>
> final val outputCol: Param[String] = new Param[String](this,
> "outputCol", "output column")
> final def setOutputCol(col:String) = set(outputCol, col)
>
> final val size: Param[String] = new Param[String](this, "size",
> "length of output vector")
> final def setSize = (n:Int) => set(size, n.toString)
>
> override def transform(data: Dataset[_]) = {
> val modUDF = udf({n: Int => n % $(size).toInt})
> data.withColumn($(outputCol),
> modUDF(col($(inputCol)).cast(IntegerType)))
> }
>
> def transformSchema(schema: org.apache.spark.sql.types.StructType):
> org.apache.spark.sql.types.StructType = {
> val actualType = schema($(inputCol)).dataType
> require(actualType.equals(IntegerType) ||
> actualType.equals(DoubleType), s"Input column must be of numeric type")
> DataTypes.createStructType(schema.fields :+
> DataTypes.createStructField($(outputCol), IntegerType, false))
> }
>
> override def copy(extra: ParamMap): Transformer = copy(extra)
> }
>
> This was included in an ML pipeline, fitted into a model and persisted to
> a disk file.  When we try to load the pipeline model in a separate notebook
> (we use Zeppelin), an exception is thrown complaining class not fund.
>
> java.lang.ClassNotFoundException: $line103090609224.$read$FeatureModder at
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) at
> java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) at
> java.base/java.lang.Class.forName0(Native Method) at
> java.base/java.lang.Class.forName(Class.java:398) at
> org.apache.spark.util.Utils$.classForName(Utils.scala:207) at
> org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstanceReader(ReadWrite.scala:630)
> at
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:276)
> at
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
> at
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
> at scala.util.Try$.apply(Try.scala:213) at
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
> at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
> at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) at
> org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) at
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
> at
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
> at scala.util.Try$.apply(Try.scala:213) at
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
> at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355) at
> org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355) at
> org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:

Problem in Restoring ML Pipeline with UDF

2021-06-08 Thread Artemis User
We have a feature engineering transformer defined as a custom class with 
UDF as follows:


class FeatureModder extends Transformer with DefaultParamsWritable with 
DefaultParamsReadable[FeatureModder] {

    val uid: String = "FeatureModder"+randomUUID

    final val inputCol: Param[String] = new Param[String](this, 
"inputCos", "input column")

    final def setInputCol(col:String) = set(inputCol, col)

    final val outputCol: Param[String] = new Param[String](this, 
"outputCol", "output column")

    final def setOutputCol(col:String) = set(outputCol, col)

    final val size: Param[String] = new Param[String](this, "size", 
"length of output vector")

    final def setSize = (n:Int) => set(size, n.toString)

    override def transform(data: Dataset[_]) = {
    val modUDF = udf({n: Int => n % $(size).toInt})
    data.withColumn($(outputCol), 
modUDF(col($(inputCol)).cast(IntegerType)))

    }

    def transformSchema(schema: org.apache.spark.sql.types.StructType): 
org.apache.spark.sql.types.StructType = {

    val actualType = schema($(inputCol)).dataType
    require(actualType.equals(IntegerType) || 
actualType.equals(DoubleType), s"Input column must be of numeric type")
    DataTypes.createStructType(schema.fields :+ 
DataTypes.createStructField($(outputCol), IntegerType, false))

    }

    override def copy(extra: ParamMap): Transformer = copy(extra)
}

This was included in an ML pipeline, fitted into a model and persisted 
to a disk file.  When we try to load the pipeline model in a separate 
notebook (we use Zeppelin), an exception is thrown complaining class not 
fund.


java.lang.ClassNotFoundException: $line103090609224.$read$FeatureModder 
at 
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72) 
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) at 
java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) at 
java.base/java.lang.Class.forName0(Native Method) at 
java.base/java.lang.Class.forName(Class.java:398) at 
org.apache.spark.util.Utils$.classForName(Utils.scala:207) at 
org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstanceReader(ReadWrite.scala:630) 
at 
org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:276) 
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) 
at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) 
at scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274) 
at 
org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) 
at scala.util.Try$.apply(Try.scala:213) at 
org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) 
at 
org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268) 
at 
org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356) 
at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) 
at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) 
at 
org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42) 
at 
org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355) 
at 
org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) 
at scala.util.Try$.apply(Try.scala:213) at 
org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) 
at 
org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355) 
at 
org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349) 
at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355) at 
org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355) at 
org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337) ... 40 
elided Could someone help explaining why?  My guess was the class 
definition is not in the classpath.  The question is how to include the 
class definition or class metadata as part of the pipeline model 
serialization? or include the class definition in a notebook (we did 
include the class definition in the notebook that loads the pipeline model)?


Thanks a lot in advance for your help!

ND


[Spark MLlib]: Multiple input dataframes and non-linear ML pipeline

2020-04-09 Thread Qingsheng Ren
Hi all,

I'm using ML Pipeline to construct a flow of transformation. I'm wondering
if it is possible to set multiple dataframes as the input of a transformer?
For example I need to join two dataframes together in a transformer, then
feed into the estimator for training. If not, is there any plan to support
this in the future?

Another question is about non-linear pipeline. Since we can randomly assign
input and output column of a pipeline stage, what will happen if I build a
problematic DAG (like a circular one)? Is there any mechanism to prevent
this from happening?

Thanks~

Qingsheng (Patrick) Ren


[Spark MLlib]: Multiple input dataframes and non-linear ML pipeline

2020-04-09 Thread Qingsheng Ren
Hi all,

I'm using ML Pipeline to construct a flow of transformation. I'm wondering
if it is possible to set multiple dataframes as the input of a transformer?
For example I need to join two dataframes together in a transformer, then
feed into the estimator for training. If not, is there any plan to support
this in the future?

Another question is about non-linear pipeline. Since we can randomly assign
input and output column of a pipeline stage, what will happen if I build a
problematic DAG (like a circular one)? Is there any mechanism to prevent
this from happening?

Thanks~

Qingsheng (Patrick) Ren


Re: ml Pipeline read write

2019-05-10 Thread Koert Kuipers
i guess it simply is never set, in which case it is created in:

  protected final def sparkSession: SparkSession = {
if (optionSparkSession.isEmpty) {
  optionSparkSession = Some(SparkSession.builder().getOrCreate())
}
optionSparkSession.get
  }

On Fri, May 10, 2019 at 4:31 PM Koert Kuipers  wrote:

> i am trying to understand how ml persists pipelines. it seems a
> SparkSession or SparkContext is needed for this, to write to hdfs.
>
> MLWriter and MLReader both extend BaseReadWrite to have access to a
> SparkSession. but this is where it gets confusing... the only way to set
> the SparkSession seems to be in BaseReadWrite:
>
> def session(sparkSession: SparkSession): this.type
>
> and i can find no place this is actually used, except for in one unit
> test: org.apache.spark.ml.util.JavaDefaultReadWriteSuite
>
> i confirmed it is not used by simply adding a line inside that method that
> throws an error, and all unit tests pass except for
> JavaDefaultReadWriteSuite.
>
> how is the sparkSession set?
> thanks!
>
> koert
>
>
>


ml Pipeline read write

2019-05-10 Thread Koert Kuipers
i am trying to understand how ml persists pipelines. it seems a
SparkSession or SparkContext is needed for this, to write to hdfs.

MLWriter and MLReader both extend BaseReadWrite to have access to a
SparkSession. but this is where it gets confusing... the only way to set
the SparkSession seems to be in BaseReadWrite:

def session(sparkSession: SparkSession): this.type

and i can find no place this is actually used, except for in one unit test:
org.apache.spark.ml.util.JavaDefaultReadWriteSuite

i confirmed it is not used by simply adding a line inside that method that
throws an error, and all unit tests pass except for
JavaDefaultReadWriteSuite.

how is the sparkSession set?
thanks!

koert


Re: [Spark Streaming] [ML]: Exception handling for the transform method of Spark ML pipeline model

2018-08-17 Thread sudododo
Hi,

Any help on this?

Thanks,



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark Streaming] [ML]: Exception handling for the transform method of Spark ML pipeline model

2018-08-16 Thread sudododo
Hi,

I'm implementing a Spark Streaming + ML application. The data is coming in a
Kafka topic as json format. The Spark Kafka connector reads the data from
the Kafka topic as DStream. After several preprocessing steps, the input
DStream is transformed to a feature DStream which is fed into Spark ML
pipeline model. The code sample explains how the feature DStream interacts
with the pipeline model.

prediction_stream = feature_stream.transform(lambda rdd: predict_rdd(rdd,
pipeline_model)

def predict_rdd(rdd, pipeline_model):
if(rdd is not None) and (not rdd.isEmpty()):
try:
df = rdd.toDF()
predictions = pipeline_model.transform(df)
return predictions.rdd
except Exception as e:
print("Unable to make predictions")
return None
 else:
  return None

Here comes the problem. If the pipeline_model.transform(df) is failed due to
some data issues in some rows of df, the try...except block won't be able to
catch the exception since the exception is thrown in executors. As a result,
the exception is bubbled up to Spark and the streaming application is
terminated.

I want the exception to be caught in some way that the streaming application
won't be terminated and keep processing incoming data. Is it possible?

I know one solution could be doing more thorough data validation in
preprocessing step. However some sort of error handling should be put in
place for the transform method of pipeline model just in case any unexpected
things happen.


Thanks,



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Deploying ML Pipeline Model

2016-07-05 Thread Nick Pentreath
It all depends on your latency requirements and volume. 100s of queries per
minute, with an acceptable latency of up to a few seconds? Yes, you could
use Spark for serving, especially if you're smart about caching results
(and I don't mean just Spark caching, but caching recommendation results
for example similar items etc).

However for many serving use cases using a Spark cluster is too much
overhead. Bear in mind real-world serving of many models (recommendations,
ad-serving, fraud etc) is one component of a complex workflow (e.g. one
page request in ad tech cases involves tens of requests and hops between
various ad servers and exchanges). That is why often the practical latency
bounds are < 100ms (or way, way tighter for ad serving for example).


On Fri, 1 Jul 2016 at 21:59 Saurabh Sardeshpande <saurabh...@gmail.com>
wrote:

> Hi Nick,
>
> Thanks for the answer. Do you think an implementation like the one in this
> article is infeasible in production for say, hundreds of queries per
> minute?
> https://www.codementor.io/spark/tutorial/building-a-web-service-with-apache-spark-flask-example-app-part2.
> The article uses Flask to define routes and Spark for evaluating requests.
>
> Regards,
> Saurabh
>
>
>
>
>
>
> On Fri, Jul 1, 2016 at 10:47 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Generally there are 2 ways to use a trained pipeline model - (offline)
>> batch scoring, and real-time online scoring.
>>
>> For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
>> certainly loading the model back in Spark and feeding new data through the
>> pipeline for prediction works just fine, and this is essentially what is
>> supported in 1.6 (and more or less full coverage in 2.0). For large batch
>> cases this can be quite efficient.
>>
>> However, usually for real-time use cases, the latency required is fairly
>> low - of the order of a few ms to a few 100ms for a request (some examples
>> include recommendations, ad-serving, fraud detection etc).
>>
>> In these cases, using Spark has 2 issues: (1) latency for prediction on
>> the pipeline, which is based on DataFrames and therefore distributed
>> execution, is usually fairly high "per request"; (2) this requires pulling
>> in all of Spark for your real-time serving layer (or running a full Spark
>> cluster), which is usually way too much overkill - all you really need for
>> serving is a bit of linear algebra and some basic transformations.
>>
>> So for now, unfortunately there is not much in the way of options for
>> exporting your pipelines and serving them outside of Spark - the
>> JPMML-based project mentioned on this thread is one option. The other
>> option at this point is to write your own export functionality and your own
>> serving layer.
>>
>> There is (very initial) movement towards improving the local serving
>> possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which
>> was the "first step" in this process).
>>
>> On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski <ja...@japila.pl> wrote:
>>
>>> Hi Rishabh,
>>>
>>> I've just today had similar conversation about how to do a ML Pipeline
>>> deployment and couldn't really answer this question and more because I
>>> don't really understand the use case.
>>>
>>> What would you expect from ML Pipeline model deployment? You can save
>>> your model to a file by model.write.overwrite.save("model_v1").
>>>
>>> model_v1
>>> |-- metadata
>>> |   |-- _SUCCESS
>>> |   `-- part-0
>>> `-- stages
>>> |-- 0_regexTok_b4265099cc1c
>>> |   `-- metadata
>>> |   |-- _SUCCESS
>>> |   `-- part-0
>>> |-- 1_hashingTF_8de997cf54ba
>>> |   `-- metadata
>>> |   |-- _SUCCESS
>>> |   `-- part-0
>>> `-- 2_linReg_3942a71d2c0e
>>> |-- data
>>> |   |-- _SUCCESS
>>> |   |-- _common_metadata
>>> |   |-- _metadata
>>> |   `--
>>> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
>>> `-- metadata
>>> |-- _SUCCESS
>>> `-- part-0
>>>
>>> 9 directories, 12 files
>>>
>>> What would you like to have outside SparkContext? What's wrong with
>>> using Spark? Just curious hoping to understand the use case better.
>>> Thanks.
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>>

Re: Deploying ML Pipeline Model

2016-07-05 Thread Nick Pentreath
Sean is correct - we now use jpmml-model (which is actually BSD 3-clause,
where old jpmml was A2L, but either work)

On Fri, 1 Jul 2016 at 21:40 Sean Owen  wrote:

> (The more core JPMML libs are Apache 2; OpenScoring is AGPL. We use
> JPMML in Spark and couldn't otherwise because the Affero license is
> not Apache compatible.)
>
> On Fri, Jul 1, 2016 at 8:16 PM, Nick Pentreath 
> wrote:
> > I believe open-scoring is one of the well-known PMML serving frameworks
> in
> > Java land (https://github.com/jpmml/openscoring). One can also use the
> raw
> > https://github.com/jpmml/jpmml-evaluator for embedding in apps.
> >
> > (Note the license on both of these is AGPL - the older version of JPMML
> used
> > to be Apache2 if I recall correctly).
> >
>


Re: Deploying ML Pipeline Model

2016-07-01 Thread Saurabh Sardeshpande
Hi Nick,

Thanks for the answer. Do you think an implementation like the one in this
article is infeasible in production for say, hundreds of queries per
minute?
https://www.codementor.io/spark/tutorial/building-a-web-service-with-apache-spark-flask-example-app-part2.
The article uses Flask to define routes and Spark for evaluating requests.

Regards,
Saurabh






On Fri, Jul 1, 2016 at 10:47 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Generally there are 2 ways to use a trained pipeline model - (offline)
> batch scoring, and real-time online scoring.
>
> For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
> certainly loading the model back in Spark and feeding new data through the
> pipeline for prediction works just fine, and this is essentially what is
> supported in 1.6 (and more or less full coverage in 2.0). For large batch
> cases this can be quite efficient.
>
> However, usually for real-time use cases, the latency required is fairly
> low - of the order of a few ms to a few 100ms for a request (some examples
> include recommendations, ad-serving, fraud detection etc).
>
> In these cases, using Spark has 2 issues: (1) latency for prediction on
> the pipeline, which is based on DataFrames and therefore distributed
> execution, is usually fairly high "per request"; (2) this requires pulling
> in all of Spark for your real-time serving layer (or running a full Spark
> cluster), which is usually way too much overkill - all you really need for
> serving is a bit of linear algebra and some basic transformations.
>
> So for now, unfortunately there is not much in the way of options for
> exporting your pipelines and serving them outside of Spark - the
> JPMML-based project mentioned on this thread is one option. The other
> option at this point is to write your own export functionality and your own
> serving layer.
>
> There is (very initial) movement towards improving the local serving
> possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which
> was the "first step" in this process).
>
> On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi Rishabh,
>>
>> I've just today had similar conversation about how to do a ML Pipeline
>> deployment and couldn't really answer this question and more because I
>> don't really understand the use case.
>>
>> What would you expect from ML Pipeline model deployment? You can save
>> your model to a file by model.write.overwrite.save("model_v1").
>>
>> model_v1
>> |-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> `-- stages
>> |-- 0_regexTok_b4265099cc1c
>> |   `-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> |-- 1_hashingTF_8de997cf54ba
>> |   `-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> `-- 2_linReg_3942a71d2c0e
>> |-- data
>> |   |-- _SUCCESS
>> |   |-- _common_metadata
>> |   |-- _metadata
>> |   `--
>> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
>> `-- metadata
>> |-- _SUCCESS
>> `-- part-0
>>
>> 9 directories, 12 files
>>
>> What would you like to have outside SparkContext? What's wrong with
>> using Spark? Just curious hoping to understand the use case better.
>> Thanks.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj <rbnex...@gmail.com>
>> wrote:
>> > Hi All,
>> >
>> > I am looking for ways to deploy a ML Pipeline model in production .
>> > Spark has already proved to be a one of the best framework for model
>> > training and creation, but once the ml pipeline model is ready how can I
>> > deploy it outside spark context ?
>> > MLlib model has toPMML method but today Pipeline model can not be saved
>> to
>> > PMML. There are some frameworks like MLeap which are trying to abstract
>> > Pipeline Model and provide ML Pipeline Model deployment outside spark
>> > context,but currently they don't have most of the ml transformers and
>> > estimators.
>> > I am looking for related work going on this area.
>> > Any pointers will be helpful.
>> >
>> > Thanks,
>> > Rishabh.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Deploying ML Pipeline Model

2016-07-01 Thread Sean Owen
(The more core JPMML libs are Apache 2; OpenScoring is AGPL. We use
JPMML in Spark and couldn't otherwise because the Affero license is
not Apache compatible.)

On Fri, Jul 1, 2016 at 8:16 PM, Nick Pentreath  wrote:
> I believe open-scoring is one of the well-known PMML serving frameworks in
> Java land (https://github.com/jpmml/openscoring). One can also use the raw
> https://github.com/jpmml/jpmml-evaluator for embedding in apps.
>
> (Note the license on both of these is AGPL - the older version of JPMML used
> to be Apache2 if I recall correctly).
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Deploying ML Pipeline Model

2016-07-01 Thread Nick Pentreath
I believe open-scoring is one of the well-known PMML serving frameworks in
Java land (https://github.com/jpmml/openscoring). One can also use the raw
https://github.com/jpmml/jpmml-evaluator for embedding in apps.

(Note the license on both of these is AGPL - the older version of JPMML
used to be Apache2 if I recall correctly).

On Fri, 1 Jul 2016 at 20:15 Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Nick,
>
> Thanks a lot for the exhaustive and prompt response! (In the meantime
> I watched a video about PMML to get a better understanding of the
> topic).
>
> What are the tools that could "consume" PMML exports (after running
> JPMML)? What tools would be the endpoint to deliver low-latency
> predictions by doing this "a bit of linear algebra and some basic
> transformations"?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 1, 2016 at 6:47 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
> > Generally there are 2 ways to use a trained pipeline model - (offline)
> batch
> > scoring, and real-time online scoring.
> >
> > For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
> > certainly loading the model back in Spark and feeding new data through
> the
> > pipeline for prediction works just fine, and this is essentially what is
> > supported in 1.6 (and more or less full coverage in 2.0). For large batch
> > cases this can be quite efficient.
> >
> > However, usually for real-time use cases, the latency required is fairly
> low
> > - of the order of a few ms to a few 100ms for a request (some examples
> > include recommendations, ad-serving, fraud detection etc).
> >
> > In these cases, using Spark has 2 issues: (1) latency for prediction on
> the
> > pipeline, which is based on DataFrames and therefore distributed
> execution,
> > is usually fairly high "per request"; (2) this requires pulling in all of
> > Spark for your real-time serving layer (or running a full Spark cluster),
> > which is usually way too much overkill - all you really need for serving
> is
> > a bit of linear algebra and some basic transformations.
> >
> > So for now, unfortunately there is not much in the way of options for
> > exporting your pipelines and serving them outside of Spark - the
> JPMML-based
> > project mentioned on this thread is one option. The other option at this
> > point is to write your own export functionality and your own serving
> layer.
> >
> > There is (very initial) movement towards improving the local serving
> > possibilities (see https://issues.apache.org/jira/browse/SPARK-13944
> which
> > was the "first step" in this process).
> >
> > On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski <ja...@japila.pl> wrote:
> >>
> >> Hi Rishabh,
> >>
> >> I've just today had similar conversation about how to do a ML Pipeline
> >> deployment and couldn't really answer this question and more because I
> >> don't really understand the use case.
> >>
> >> What would you expect from ML Pipeline model deployment? You can save
> >> your model to a file by model.write.overwrite.save("model_v1").
> >>
> >> model_v1
> >> |-- metadata
> >> |   |-- _SUCCESS
> >> |   `-- part-0
> >> `-- stages
> >> |-- 0_regexTok_b4265099cc1c
> >> |   `-- metadata
> >> |   |-- _SUCCESS
> >> |   `-- part-0
> >> |-- 1_hashingTF_8de997cf54ba
> >> |   `-- metadata
> >> |   |-- _SUCCESS
> >> |   `-- part-0
> >> `-- 2_linReg_3942a71d2c0e
> >> |-- data
> >> |   |-- _SUCCESS
> >> |   |-- _common_metadata
> >> |   |-- _metadata
> >> |   `--
> >> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
> >> `-- metadata
> >> |-- _SUCCESS
> >> `-- part-0
> >>
> >> 9 directories, 12 files
> >>
> >> What would you like to have outside SparkContext? What's wrong with
> >> using Spark? Just curious hoping to understand the use case better.
> >> Thanks.
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark http://bit.ly/mastering-apach

Re: Deploying ML Pipeline Model

2016-07-01 Thread Jacek Laskowski
Hi Nick,

Thanks a lot for the exhaustive and prompt response! (In the meantime
I watched a video about PMML to get a better understanding of the
topic).

What are the tools that could "consume" PMML exports (after running
JPMML)? What tools would be the endpoint to deliver low-latency
predictions by doing this "a bit of linear algebra and some basic
transformations"?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 1, 2016 at 6:47 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote:
> Generally there are 2 ways to use a trained pipeline model - (offline) batch
> scoring, and real-time online scoring.
>
> For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
> certainly loading the model back in Spark and feeding new data through the
> pipeline for prediction works just fine, and this is essentially what is
> supported in 1.6 (and more or less full coverage in 2.0). For large batch
> cases this can be quite efficient.
>
> However, usually for real-time use cases, the latency required is fairly low
> - of the order of a few ms to a few 100ms for a request (some examples
> include recommendations, ad-serving, fraud detection etc).
>
> In these cases, using Spark has 2 issues: (1) latency for prediction on the
> pipeline, which is based on DataFrames and therefore distributed execution,
> is usually fairly high "per request"; (2) this requires pulling in all of
> Spark for your real-time serving layer (or running a full Spark cluster),
> which is usually way too much overkill - all you really need for serving is
> a bit of linear algebra and some basic transformations.
>
> So for now, unfortunately there is not much in the way of options for
> exporting your pipelines and serving them outside of Spark - the JPMML-based
> project mentioned on this thread is one option. The other option at this
> point is to write your own export functionality and your own serving layer.
>
> There is (very initial) movement towards improving the local serving
> possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which
> was the "first step" in this process).
>
> On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski <ja...@japila.pl> wrote:
>>
>> Hi Rishabh,
>>
>> I've just today had similar conversation about how to do a ML Pipeline
>> deployment and couldn't really answer this question and more because I
>> don't really understand the use case.
>>
>> What would you expect from ML Pipeline model deployment? You can save
>> your model to a file by model.write.overwrite.save("model_v1").
>>
>> model_v1
>> |-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> `-- stages
>> |-- 0_regexTok_b4265099cc1c
>> |   `-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> |-- 1_hashingTF_8de997cf54ba
>> |   `-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> `-- 2_linReg_3942a71d2c0e
>> |-- data
>> |   |-- _SUCCESS
>> |   |-- _common_metadata
>> |   |-- _metadata
>> |   `--
>> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
>> `-- metadata
>> |-- _SUCCESS
>> `-- part-0
>>
>> 9 directories, 12 files
>>
>> What would you like to have outside SparkContext? What's wrong with
>> using Spark? Just curious hoping to understand the use case better.
>> Thanks.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj <rbnex...@gmail.com>
>> wrote:
>> > Hi All,
>> >
>> > I am looking for ways to deploy a ML Pipeline model in production .
>> > Spark has already proved to be a one of the best framework for model
>> > training and creation, but once the ml pipeline model is ready how can I
>> > deploy it outside spark context ?
>> > MLlib model has toPMML method but today Pipeline model can not be saved
>> > to
>> > PMML. There are some frameworks like MLeap which are trying to abstract
>> > Pipeline Model and provide ML Pipeline Model deployment outside spark
>> > context,but currently they don't have most of the ml transformers and
>> > estimators.
>> > I am looking for related work going on this area.
>> > Any pointers will be helpful.
>> >
>> > Thanks,
>> > Rishabh.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Deploying ML Pipeline Model

2016-07-01 Thread Nick Pentreath
Generally there are 2 ways to use a trained pipeline model - (offline)
batch scoring, and real-time online scoring.

For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
certainly loading the model back in Spark and feeding new data through the
pipeline for prediction works just fine, and this is essentially what is
supported in 1.6 (and more or less full coverage in 2.0). For large batch
cases this can be quite efficient.

However, usually for real-time use cases, the latency required is fairly
low - of the order of a few ms to a few 100ms for a request (some examples
include recommendations, ad-serving, fraud detection etc).

In these cases, using Spark has 2 issues: (1) latency for prediction on the
pipeline, which is based on DataFrames and therefore distributed execution,
is usually fairly high "per request"; (2) this requires pulling in all of
Spark for your real-time serving layer (or running a full Spark cluster),
which is usually way too much overkill - all you really need for serving is
a bit of linear algebra and some basic transformations.

So for now, unfortunately there is not much in the way of options for
exporting your pipelines and serving them outside of Spark - the
JPMML-based project mentioned on this thread is one option. The other
option at this point is to write your own export functionality and your own
serving layer.

There is (very initial) movement towards improving the local serving
possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which
was the "first step" in this process).

On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Rishabh,
>
> I've just today had similar conversation about how to do a ML Pipeline
> deployment and couldn't really answer this question and more because I
> don't really understand the use case.
>
> What would you expect from ML Pipeline model deployment? You can save
> your model to a file by model.write.overwrite.save("model_v1").
>
> model_v1
> |-- metadata
> |   |-- _SUCCESS
> |   `-- part-0
> `-- stages
> |-- 0_regexTok_b4265099cc1c
> |   `-- metadata
> |   |-- _SUCCESS
> |   `-- part-0
> |-- 1_hashingTF_8de997cf54ba
> |   `-- metadata
> |   |-- _SUCCESS
> |   `-- part-0
> `-- 2_linReg_3942a71d2c0e
> |-- data
> |   |-- _SUCCESS
> |   |-- _common_metadata
> |   |-- _metadata
> |   `--
> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
> `-- metadata
> |-- _SUCCESS
> `-- part-0
>
> 9 directories, 12 files
>
> What would you like to have outside SparkContext? What's wrong with
> using Spark? Just curious hoping to understand the use case better.
> Thanks.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj <rbnex...@gmail.com>
> wrote:
> > Hi All,
> >
> > I am looking for ways to deploy a ML Pipeline model in production .
> > Spark has already proved to be a one of the best framework for model
> > training and creation, but once the ml pipeline model is ready how can I
> > deploy it outside spark context ?
> > MLlib model has toPMML method but today Pipeline model can not be saved
> to
> > PMML. There are some frameworks like MLeap which are trying to abstract
> > Pipeline Model and provide ML Pipeline Model deployment outside spark
> > context,but currently they don't have most of the ml transformers and
> > estimators.
> > I am looking for related work going on this area.
> > Any pointers will be helpful.
> >
> > Thanks,
> > Rishabh.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Deploying ML Pipeline Model

2016-07-01 Thread Jacek Laskowski
Hi Rishabh,

I've just today had similar conversation about how to do a ML Pipeline
deployment and couldn't really answer this question and more because I
don't really understand the use case.

What would you expect from ML Pipeline model deployment? You can save
your model to a file by model.write.overwrite.save("model_v1").

model_v1
|-- metadata
|   |-- _SUCCESS
|   `-- part-0
`-- stages
|-- 0_regexTok_b4265099cc1c
|   `-- metadata
|   |-- _SUCCESS
|   `-- part-0
|-- 1_hashingTF_8de997cf54ba
|   `-- metadata
|   |-- _SUCCESS
|   `-- part-0
`-- 2_linReg_3942a71d2c0e
|-- data
|   |-- _SUCCESS
|   |-- _common_metadata
|   |-- _metadata
|   `-- part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
`-- metadata
|-- _SUCCESS
`-- part-0

9 directories, 12 files

What would you like to have outside SparkContext? What's wrong with
using Spark? Just curious hoping to understand the use case better.
Thanks.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj <rbnex...@gmail.com> wrote:
> Hi All,
>
> I am looking for ways to deploy a ML Pipeline model in production .
> Spark has already proved to be a one of the best framework for model
> training and creation, but once the ml pipeline model is ready how can I
> deploy it outside spark context ?
> MLlib model has toPMML method but today Pipeline model can not be saved to
> PMML. There are some frameworks like MLeap which are trying to abstract
> Pipeline Model and provide ML Pipeline Model deployment outside spark
> context,but currently they don't have most of the ml transformers and
> estimators.
> I am looking for related work going on this area.
> Any pointers will be helpful.
>
> Thanks,
> Rishabh.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Deploying ML Pipeline Model

2016-07-01 Thread Silvio Fiorito
Hi Rishabh,

My colleague, Richard Garris from Databricks, actually just gave a talk last 
night at the Bay Area Spark Meetup on ML model deployment. The slides and 
recording should be up soon, you should be able to find a link here: 
http://www.meetup.com/spark-users/events/231574440/

Thanks,
Silvio

From: Rishabh Bhardwaj <rbnex...@gmail.com>
Date: Friday, July 1, 2016 at 7:54 AM
To: user <user@spark.apache.org>
Cc: "d...@spark.apache.org" <d...@spark.apache.org>
Subject: Deploying ML Pipeline Model

Hi All,

I am looking for ways to deploy a ML Pipeline model in production .
Spark has already proved to be a one of the best framework for model training 
and creation, but once the ml pipeline model is ready how can I deploy it 
outside spark context ?
MLlib model has toPMML method but today Pipeline model can not be saved to 
PMML. There are some frameworks like MLeap which are trying to abstract 
Pipeline Model and provide ML Pipeline Model deployment outside spark 
context,but currently they don't have most of the ml transformers and 
estimators.
I am looking for related work going on this area.
Any pointers will be helpful.

Thanks,
Rishabh.


Re: Deploying ML Pipeline Model

2016-07-01 Thread Steve Goodman
Hi Rishabh,

I have a similar use-case and have struggled to find the best solution. As
I understand it 1.6 provides pipeline persistence in Scala, and that will
be expanded in 2.x. This project https://github.com/jpmml/jpmml-sparkml
claims to support about a dozen pipeline transformers, and 6 or 7 different
model types, although I have not yet used it myself.

Looking forward to hearing better suggestions?

Steve


On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj <rbnex...@gmail.com>
wrote:

> Hi All,
>
> I am looking for ways to deploy a ML Pipeline model in production .
> Spark has already proved to be a one of the best framework for model
> training and creation, but once the ml pipeline model is ready how can I
> deploy it outside spark context ?
> MLlib model has toPMML method but today Pipeline model can not be saved to
> PMML. There are some frameworks like MLeap which are trying to abstract
> Pipeline Model and provide ML Pipeline Model deployment outside spark
> context,but currently they don't have most of the ml transformers and
> estimators.
> I am looking for related work going on this area.
> Any pointers will be helpful.
>
> Thanks,
> Rishabh.
>


Deploying ML Pipeline Model

2016-07-01 Thread Rishabh Bhardwaj
Hi All,

I am looking for ways to deploy a ML Pipeline model in production .
Spark has already proved to be a one of the best framework for model
training and creation, but once the ml pipeline model is ready how can I
deploy it outside spark context ?
MLlib model has toPMML method but today Pipeline model can not be saved to
PMML. There are some frameworks like MLeap which are trying to abstract
Pipeline Model and provide ML Pipeline Model deployment outside spark
context,but currently they don't have most of the ml transformers and
estimators.
I am looking for related work going on this area.
Any pointers will be helpful.

Thanks,
Rishabh.


Clear Threshold in Logistic Regression ML Pipeline

2016-05-03 Thread Abhishek Anand
Hi All,

I am trying to build a logistic regression pipeline in ML.

How can I clear the threshold which by default is 0.5. In mllib I am able
to clear the threshold to get the raw predictions using
model.clearThreshold() function.


Regards,
Abhi


Re: Strange ML pipeline errors from HashingTF using v1.6.1

2016-03-29 Thread Timothy Potter
FWIW - I synchronized access to the transformer and the problem went
away so this looks like some type of concurrent access issue when
dealing with UDFs

On Tue, Mar 29, 2016 at 9:19 AM, Timothy Potter <thelabd...@gmail.com> wrote:
> It's a local spark master, no cluster. I'm not sure what you mean
> about assembly or package? all of the Spark dependencies are on my
> classpath and this sometimes works.
>
> On Mon, Mar 28, 2016 at 11:45 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>> Hi,
>>
>> How do you run the pipeline? Do you assembly or package? Is this on
>> local or spark or other cluster manager? What's the build
>> configuration?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Mon, Mar 28, 2016 at 7:11 PM, Timothy Potter <thelabd...@gmail.com> wrote:
>>> I'm seeing the following error when trying to generate a prediction
>>> from a very simple ML pipeline based model. I've verified that the raw
>>> data sent to the tokenizer is valid (not null). It seems like this is
>>> some sort of weird classpath or class loading type issue. Any help you
>>> can provide in trying to troubleshoot this further would be
>>> appreciated.
>>>
>>>  Error in machine-learning, docId=20news-18828/alt.atheism/51176
>>> scala.reflect.internal.Symbols$CyclicReference: illegal cyclic
>>> reference involving package 
>>> at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2768)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$Roots$RootPackage$.(Mirrors.scala:268)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$Roots.RootPackage$lzycompute(Mirrors.scala:267)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at scala.reflect.internal.Mirrors$Roots.RootPackage(Mirrors.scala:267)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.runtime.JavaMirrors$JavaMirror.scala$reflect$runtime$JavaMirrors$$makeScalaPackage(JavaMirrors.scala:902)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.runtime.JavaMirrors$class.missingHook(JavaMirrors.scala:1299)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at scala.reflect.runtime.JavaUniverse.missingHook(JavaUniverse.scala:12)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.universeMissingHook(Mirrors.scala:77)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.missingHook(Mirrors.scala:79)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at 
>>> org.apache.spark.ml.feature.HashingTF$$typecreator1$1.apply(HashingTF.scala:66)
>>> ~[spark-mllib_2.10-1.6.1.jar:1.6.1]
>>> at 
>>> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>>> ~[scala-reflect-2.10.5.jar:?]
>>> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>>> ~[scala-

Re: Strange ML pipeline errors from HashingTF using v1.6.1

2016-03-29 Thread Timothy Potter
It's a local spark master, no cluster. I'm not sure what you mean
about assembly or package? all of the Spark dependencies are on my
classpath and this sometimes works.

On Mon, Mar 28, 2016 at 11:45 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> Hi,
>
> How do you run the pipeline? Do you assembly or package? Is this on
> local or spark or other cluster manager? What's the build
> configuration?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Mon, Mar 28, 2016 at 7:11 PM, Timothy Potter <thelabd...@gmail.com> wrote:
>> I'm seeing the following error when trying to generate a prediction
>> from a very simple ML pipeline based model. I've verified that the raw
>> data sent to the tokenizer is valid (not null). It seems like this is
>> some sort of weird classpath or class loading type issue. Any help you
>> can provide in trying to troubleshoot this further would be
>> appreciated.
>>
>>  Error in machine-learning, docId=20news-18828/alt.atheism/51176
>> scala.reflect.internal.Symbols$CyclicReference: illegal cyclic
>> reference involving package 
>> at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2768)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$Roots$RootPackage$.(Mirrors.scala:268)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$Roots.RootPackage$lzycompute(Mirrors.scala:267)
>> ~[scala-reflect-2.10.5.jar:?]
>> at scala.reflect.internal.Mirrors$Roots.RootPackage(Mirrors.scala:267)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.runtime.JavaMirrors$JavaMirror.scala$reflect$runtime$JavaMirrors$$makeScalaPackage(JavaMirrors.scala:902)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.runtime.JavaMirrors$class.missingHook(JavaMirrors.scala:1299)
>> ~[scala-reflect-2.10.5.jar:?]
>> at scala.reflect.runtime.JavaUniverse.missingHook(JavaUniverse.scala:12)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$RootsBase.universeMissingHook(Mirrors.scala:77)
>> ~[scala-reflect-2.10.5.jar:?]
>> at scala.reflect.internal.Mirrors$RootsBase.missingHook(Mirrors.scala:79)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>> ~[scala-reflect-2.10.5.jar:?]
>> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> org.apache.spark.ml.feature.HashingTF$$typecreator1$1.apply(HashingTF.scala:66)
>> ~[spark-mllib_2.10-1.6.1.jar:1.6.1]
>> at 
>> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>> ~[scala-reflect-2.10.5.jar:?]
>> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>> ~[scala-reflect-2.10.5.jar:?]
>> at 
>> org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
>> ~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
>> at 
>> org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
>> ~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
>> at 
>> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:642)
>> ~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
>> at 
>> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflecti

Re: Strange ML pipeline errors from HashingTF using v1.6.1

2016-03-28 Thread Jacek Laskowski
Hi,

How do you run the pipeline? Do you assembly or package? Is this on
local or spark or other cluster manager? What's the build
configuration?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Mar 28, 2016 at 7:11 PM, Timothy Potter <thelabd...@gmail.com> wrote:
> I'm seeing the following error when trying to generate a prediction
> from a very simple ML pipeline based model. I've verified that the raw
> data sent to the tokenizer is valid (not null). It seems like this is
> some sort of weird classpath or class loading type issue. Any help you
> can provide in trying to troubleshoot this further would be
> appreciated.
>
>  Error in machine-learning, docId=20news-18828/alt.atheism/51176
> scala.reflect.internal.Symbols$CyclicReference: illegal cyclic
> reference involving package 
> at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2768)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$Roots$RootPackage$.(Mirrors.scala:268)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$Roots.RootPackage$lzycompute(Mirrors.scala:267)
> ~[scala-reflect-2.10.5.jar:?]
> at scala.reflect.internal.Mirrors$Roots.RootPackage(Mirrors.scala:267)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.scala$reflect$runtime$JavaMirrors$$makeScalaPackage(JavaMirrors.scala:902)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.runtime.JavaMirrors$class.missingHook(JavaMirrors.scala:1299)
> ~[scala-reflect-2.10.5.jar:?]
> at scala.reflect.runtime.JavaUniverse.missingHook(JavaUniverse.scala:12)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$RootsBase.universeMissingHook(Mirrors.scala:77)
> ~[scala-reflect-2.10.5.jar:?]
> at scala.reflect.internal.Mirrors$RootsBase.missingHook(Mirrors.scala:79)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
> ~[scala-reflect-2.10.5.jar:?]
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
> ~[scala-reflect-2.10.5.jar:?]
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> org.apache.spark.ml.feature.HashingTF$$typecreator1$1.apply(HashingTF.scala:66)
> ~[spark-mllib_2.10-1.6.1.jar:1.6.1]
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
> ~[scala-reflect-2.10.5.jar:?]
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
> ~[scala-reflect-2.10.5.jar:?]
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
> ~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
> ~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:642)
> ~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
> ~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
> at org.apache.spark.sql.functions$.udf(functions.scala:2576)
> ~[spark-sql_2.10-1.6.1.jar:1.6.1]
> at org.apache.spark.ml.feature.HashingTF.transform(HashingTF.scala:66)
> ~[spark-mllib_2.10-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:297)
> ~[spark-mllib_2.10-1.6.1.jar:1.6.1]
> at 
> org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:297)
> ~[spark-mllib_2.10-1.6.1.jar:1.6.1]
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSe

Strange ML pipeline errors from HashingTF using v1.6.1

2016-03-28 Thread Timothy Potter
I'm seeing the following error when trying to generate a prediction
from a very simple ML pipeline based model. I've verified that the raw
data sent to the tokenizer is valid (not null). It seems like this is
some sort of weird classpath or class loading type issue. Any help you
can provide in trying to troubleshoot this further would be
appreciated.

 Error in machine-learning, docId=20news-18828/alt.atheism/51176
scala.reflect.internal.Symbols$CyclicReference: illegal cyclic
reference involving package 
at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2768)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$Roots$RootPackage$.(Mirrors.scala:268)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$Roots.RootPackage$lzycompute(Mirrors.scala:267)
~[scala-reflect-2.10.5.jar:?]
at scala.reflect.internal.Mirrors$Roots.RootPackage(Mirrors.scala:267)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.runtime.JavaMirrors$JavaMirror.scala$reflect$runtime$JavaMirrors$$makeScalaPackage(JavaMirrors.scala:902)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.runtime.JavaMirrors$class.missingHook(JavaMirrors.scala:1299)
~[scala-reflect-2.10.5.jar:?]
at scala.reflect.runtime.JavaUniverse.missingHook(JavaUniverse.scala:12)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$RootsBase.universeMissingHook(Mirrors.scala:77)
~[scala-reflect-2.10.5.jar:?]
at scala.reflect.internal.Mirrors$RootsBase.missingHook(Mirrors.scala:79)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
~[scala-reflect-2.10.5.jar:?]
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
~[scala-reflect-2.10.5.jar:?]
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
~[scala-reflect-2.10.5.jar:?]
at 
org.apache.spark.ml.feature.HashingTF$$typecreator1$1.apply(HashingTF.scala:66)
~[spark-mllib_2.10-1.6.1.jar:1.6.1]
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
~[scala-reflect-2.10.5.jar:?]
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
~[scala-reflect-2.10.5.jar:?]
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
at 
org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:642)
~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
~[spark-catalyst_2.10-1.6.1.jar:1.6.1]
at org.apache.spark.sql.functions$.udf(functions.scala:2576)
~[spark-sql_2.10-1.6.1.jar:1.6.1]
at org.apache.spark.ml.feature.HashingTF.transform(HashingTF.scala:66)
~[spark-mllib_2.10-1.6.1.jar:1.6.1]
at 
org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:297)
~[spark-mllib_2.10-1.6.1.jar:1.6.1]
at 
org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:297)
~[spark-mllib_2.10-1.6.1.jar:1.6.1]
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
~[scala-library-2.10.5.jar:?]
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
~[scala-library-2.10.5.jar:?]
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
~[scala-library-2.10.5.jar:?]
at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:297)
~[spark-mllib_2.10-1.6.1.jar:1.6.1]
at 
org.apache.spark.ml.tuning.CrossValidatorModel.transform(CrossValidator.scala:338)
~[spark-mllib_2.10-1.6.1.jar:1.6.1]


I've also seen similar errors such as:

java.lang.AssertionError: assertion failed: List(package linalg, package linalg)
at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1678)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:44)
~[scala-reflect-2.10.5.jar:?]
at 
scala.reflect.internal.Mirrors

Trying to serialize/deserialize Spark ML Pipeline (RandomForest) Spark 1.6

2016-03-13 Thread Mario Lazaro
Hi!

I have a pipelineModel (use RandomForestClassifier) that I am trying to
save locally. I can save it using:

//save locally
val fileOut = new FileOutputStream("file:///home/user/forest.model")
val out  = new ObjectOutputStream(fileOut)
out.writeObject(model)
out.close()
fileOut.close()

 Then I deserialize it using:

val fileIn = new FileInputStream("/home/forest.model")
val in  = new ObjectInputStream(fileIn)
val cvModel =
in.readObject().asInstanceOf[org.apache.spark.ml.PipelineModel]
in.close()
fileIn.close()

but when I try to use it:

val predictions2 = cvModel.transform(testingData)

It throws an exception:

java.lang.IllegalArgumentException: Field "browser_index" does not exist.

at
org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:212)
at
org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:212)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) at
scala.collection.AbstractMap.getOrElse(Map.scala:58) at
org.apache.spark.sql.types.StructType.apply(StructType.scala:211) at
org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:111)
at
org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:111)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at
org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:111)
at
org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:301)
at
org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:301)
at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108) at
org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:301) at
org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:68) at
org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:296) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60) at
$iwC$$iwC$$iwC$$iwC$$iwC.(:62) at
$iwC$$iwC$$iwC$$iwC.(:64) at
$iwC$$iwC$$iwC.(:66) at $iwC$$iwC.(:68) at
$iwC.(:70) at (:72) at .(:76)
at .() at .(:7) at .() at
$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at
org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:664)
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:629)
at
org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:622)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:276)
at org.apache.zeppelin.scheduler.Job.run(Job.java:170) at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:118)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


I am not using .save and .load because they do not work in Spark 1.6 for
RandomForest.

Any idea how to do this? any alternatives?

Thanks!
-- 
*Mario Lazaro*  |  Software Engineer, Big Data
*GumGum*   |  *Ads that stick*
310-985-3792 |  ma...@gumgum.com


Re: Logistic Regression using ML Pipeline

2016-02-19 Thread Ajinkya Kale
Please take a look at the example here
http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline

On Thu, Feb 18, 2016 at 9:27 PM Arunkumar Pillai <arunkumar1...@gmail.com>
wrote:

> Hi
>
> I'm trying to build logistic regression using ML Pipeline
>
>  val lr = new LogisticRegression()
>
> lr.setFitIntercept(true)
> lr.setMaxIter(100)
> val model = lr.fit(data)
>
> println(model.summary)
>
> I'm getting coefficients but not able to get the predicted and probability
> values.
>
> Please help
>
> --
> Thanks and Regards
> Arun
>


Logistic Regression using ML Pipeline

2016-02-18 Thread Arunkumar Pillai
Hi

I'm trying to build logistic regression using ML Pipeline

 val lr = new LogisticRegression()

lr.setFitIntercept(true)
lr.setMaxIter(100)
val model = lr.fit(data)

println(model.summary)

I'm getting coefficients but not able to get the predicted and probability
values.

Please help

-- 
Thanks and Regards
Arun


Re: AIC in Linear Regression in ml pipeline

2016-01-15 Thread Yanbo Liang
Hi Arunkumar,

It does not support output AIC value for Linear Regression currently. This
feature is under development and will be released at Spark 2.0.

Thanks
Yanbo

2016-01-15 17:20 GMT+08:00 Arunkumar Pillai <arunkumar1...@gmail.com>:

> Hi
>
> Is it possible to get AIC value in Linear Regression using ml pipeline ?
> Is so please help me
>
> --
> Thanks and Regards
> Arun
>


AIC in Linear Regression in ml pipeline

2016-01-15 Thread Arunkumar Pillai
Hi

Is it possible to get AIC value in Linear Regression using ml pipeline ? Is
so please help me

-- 
Thanks and Regards
Arun


Re: LogisticsRegression in ML pipeline help page

2016-01-06 Thread Wen Pei Yu

You can get old resource under
http://spark.apache.org/documentation.html

And linear doc here for 1.5.2

http://spark.apache.org/docs/1.5.2/mllib-linear-methods.html#logistic-regression
http://spark.apache.org/docs/1.5.2/ml-linear-methods.html


Regards.
Yu Wenpei.


From:   Arunkumar Pillai <arunkumar1...@gmail.com>
To: user@spark.apache.org
Date:   01/07/2016 12:54 PM
Subject:LogisticsRegression in ML pipeline help page



Hi

I need help page for Logistics Regression in ML pipeline. when i browsed
I'm getting the 1.6 help please help me.

--
Thanks and Regards
        Arun


LogisticsRegression in ML pipeline help page

2016-01-06 Thread Arunkumar Pillai
Hi

I need help page for Logistics Regression in ML pipeline. when i browsed
I'm getting the 1.6 help please help me.

-- 
Thanks and Regards
Arun


Re: GLM I'm ml pipeline

2016-01-03 Thread Yanbo Liang
AFAIK, Spark MLlib will improve and support most GLM functions in the next
release(Spark 2.0).

2016-01-03 23:02 GMT+08:00 :

> keyStoneML could be an alternative.
>
> Ardo.
>
> On 03 Jan 2016, at 15:50, Arunkumar Pillai 
> wrote:
>
> Is there any road map for glm in pipeline?
>
>


Re: GLM I'm ml pipeline

2016-01-03 Thread Arunkumar Pillai
Thanks so eagerly waiting for next Spark release

On Mon, Jan 4, 2016 at 7:36 AM, Yanbo Liang  wrote:

> AFAIK, Spark MLlib will improve and support most GLM functions in the next
> release(Spark 2.0).
>
> 2016-01-03 23:02 GMT+08:00 :
>
>> keyStoneML could be an alternative.
>>
>> Ardo.
>>
>> On 03 Jan 2016, at 15:50, Arunkumar Pillai 
>> wrote:
>>
>> Is there any road map for glm in pipeline?
>>
>>
>


-- 
Thanks and Regards
Arun


GLM I'm ml pipeline

2016-01-03 Thread Arunkumar Pillai
Is there any road map for glm in pipeline?


Re: GLM I'm ml pipeline

2016-01-03 Thread ndjido
keyStoneML could be an alternative. 

Ardo.

> On 03 Jan 2016, at 15:50, Arunkumar Pillai  wrote:
> 
> Is there any road map for glm in pipeline?


No documentation for how to write custom Transformer in ml pipeline ?

2015-11-30 Thread Jeff Zhang
Although writing a custom UnaryTransformer is not difficult, but writing a
non-UnaryTransformer is a little tricky (have to check the source code).
And I don't find any document about how to write custom Transformer in ml
pipeline, but writing custom Transformer is a very basic requirement. Is
this because the interface is still unstable now ?


-- 
Best Regards

Jeff Zhang


pyspark ML pipeline with shared data

2015-11-17 Thread Dominik Dahlem
Hi all,

I'm working on a pipeline for collaborative filtering. Taking the movielens
example, I have a data frame with the columns 'userID', 'movieID', and
'rating'. I would like to transform the ratings before calling ALS and
denormalise after. I implemented two transformers to do this, but I'm
wondering whether there is a better way than using a global variable to
hold the rowMeans of the utility matrix to share the normalisation vector
across both transformers? The pipeline gets more complicated if the
normalisation is done over columns and clustering of the userIDs is used
prior to calling ALS. Any help is greatly appreciated.

Thank you,
Dominik


The code snippets are as follows below. I'm using the pyspark.ml APIs:

# global variable
rowMeans = None

# Transformers
class Normaliser(Transformer, HasInputCol, HasOutputCol):
@keyword_only
def __init__(self, inputCol='rating', outputCol='normalised'):
super(Normaliser, self).__init__()
kwargs = self.__init__._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol=None, outputCol=None):
kwargs = self.setParams._input_kwargs
return self._set(**kwargs)

def _transform(self, df):
global rowMeans
rowMeans = df.groupBy('userID') \
 .agg(F.mean(self.inputCol).alias('mean'))
return df.join(rowMeans, 'userID') \
 .select(df['*'], (df[self.inputCol] -
rowMeans['mean']).alias(self.outputCol))


class DeNormaliser(Transformer, HasInputCol, HasOutputCol):
@keyword_only
def __init__(self, inputCol='normalised', outputCol='rating'):
super(DeNormaliser, self).__init__()
kwargs = self.__init__._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol=None, outputCol=None):
kwargs = self.setParams._input_kwargs
return self._set(**kwargs)

def _transform(self, df):
return df.join(rowMeans, 'userID') \
 .select(df['*'], (df[self.getInputCol()] +
rowMeans['mean']) \
 .alias(self.getOutputCol()))


# setting up the ML pipeline
rowNormaliser = Normaliser(inputCol='rating', outputCol='rowNorm')
als = ALS(userCol='userID', itemCol='movieID', ratingCol='rowNorm')
rowDeNormaliser = DeNormaliser(inputCol='prediction',
outputCol='denormPrediction')

pipeline = Pipeline(stages=[rowNormaliser, als, rowDeNormaliser])
evaluator = RegressionEvaluator(predictionCol='denormPrediction',
labelCol='rating')


ML Pipeline

2015-09-28 Thread Yasemin Kaya
Hi,

I am using Spar 1.5 and ML Pipeline. I create the model then give the model
unlabeled data to find the probabilites and predictions. When I want to see
the results, it returns me error.

//creating model
final PipelineModel model = pipeline.fit(trainingData);

JavaRDD rowRDD1 = unlabeledTest
.map(new Function<Tuple2<String, Vector>, Row>() {

@Override
public Row call(Tuple2<String, Vector> arg0)
throws Exception {
return RowFactory.create(arg0._1(), arg0._2());
}
});
// creating dataframe from row
DataFrame production = sqlContext.createDataFrame(rowRDD1,
new StructType(new StructField[] {
new StructField("id", DataTypes.StringType, false,
Metadata.empty()),
new StructField("features", (new VectorUDT()), false,
Metadata.empty()) }));

DataFrame predictionsProduction = model.transform(production);
*//produces the error*
*predictionsProduction.select("id","features","probability").show(5);*

Here is my code, am I wrong at creating rowRDD1 or production ?
error : java.util.NoSuchElementException: key not found: 1.0
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
.

How can I solve it ? Thanks.

Have a nice day,
yasemin

-- 
hiç ender hiç


Re: spark 1.5, ML Pipeline Decision Tree Dataframe Problem

2015-09-18 Thread Yasemin Kaya
Thanks, I try to make but i can't.
JavaPairRDD<String, Vector> unlabeledTest, the vector is Dence vector. I
add import org.apache.spark.sql.SQLContext.implicits$   but there is no
method toDf(), I am using Java not Scala.

2015-09-18 20:02 GMT+03:00 Feynman Liang <fli...@databricks.com>:

> What is the type of unlabeledTest?
>
> SQL should be using the VectorUDT we've defined for Vectors
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L183>
>  so
> you should be able to just "import sqlContext.implicits._" and then call
> "rdd.toDf()" on your RDD to convert it into a dataframe.
>
> On Fri, Sep 18, 2015 at 7:32 AM, Yasemin Kaya <godo...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using *spark 1.5, ML Pipeline Decision Tree
>> <http://spark.apache.org/docs/latest/ml-decision-tree.html#output-columns>*
>> to get tree's probability. But I have to convert my data to Dataframe type.
>> While creating model there is no problem but when I am using model on my
>> data there is a problem about converting to data frame type. My data type
>> is *JavaPairRDD<String, Vector>* , when I am creating dataframe
>>
>> DataFrame production = sqlContext.createDataFrame(
>> unlabeledTest.values(), Vector.class);
>>
>> *Error says me: *
>> Exception in thread "main" java.lang.ClassCastException:
>> org.apache.spark.mllib.linalg.VectorUDT cannot be cast to
>> org.apache.spark.sql.types.StructType
>>
>> I know if I give LabeledPoint type, there will be no problem. But the
>> data have no label, I wanna predict the label because of this reason I use
>> model on it.
>>
>> Is there way to handle my problem?
>> Thanks.
>>
>>
>> Best,
>> yasemin
>> --
>> hiç ender hiç
>>
>
>


-- 
hiç ender hiç


spark 1.5, ML Pipeline Decision Tree Dataframe Problem

2015-09-18 Thread Yasemin Kaya
Hi,

I am using *spark 1.5, ML Pipeline Decision Tree
<http://spark.apache.org/docs/latest/ml-decision-tree.html#output-columns>*
to get tree's probability. But I have to convert my data to Dataframe type.
While creating model there is no problem but when I am using model on my
data there is a problem about converting to data frame type. My data type
is *JavaPairRDD<String, Vector>* , when I am creating dataframe

DataFrame production = sqlContext.createDataFrame(
unlabeledTest.values(), Vector.class);

*Error says me: *
Exception in thread "main" java.lang.ClassCastException:
org.apache.spark.mllib.linalg.VectorUDT cannot be cast to
org.apache.spark.sql.types.StructType

I know if I give LabeledPoint type, there will be no problem. But the data
have no label, I wanna predict the label because of this reason I use model
on it.

Is there way to handle my problem?
Thanks.


Best,
yasemin
-- 
hiç ender hiç


Re: Caching intermediate results in Spark ML pipeline?

2015-09-18 Thread Jingchu Liu
Thanks buddy I'll try it out in my project.

Best,
Lewis

2015-09-16 13:29 GMT+08:00 Feynman Liang <fli...@databricks.com>:

> If you're doing hyperparameter grid search, consider using
> ml.tuning.CrossValidator which does cache the dataset
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L85>
> .
>
> Otherwise, perhaps you can elaborate more on your particular use case for
> caching intermediate results and if the current API doesn't support it we
> can create a JIRA for it.
>
> On Tue, Sep 15, 2015 at 10:26 PM, Jingchu Liu <liujing...@gmail.com>
> wrote:
>
>> Yeah I understand on the low-level we should do as you said.
>>
>> But since ML pipeline is a high-level API, it is pretty natural to expect
>> the ability to recognize overlapping parameters between successive runs.
>> (Actually, this happen A LOT when we have lots of hyper-params to search
>> for)
>>
>> I can also imagine the implementation by appending parameter information
>> to the cached results. Let's say if we implemented an "equal" method for
>> param1. By comparing param1 with the previous run, the program will know
>> data1 is reusable. And time used for generating data1 can be saved.
>>
>> Best,
>> Lewis
>>
>> 2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>
>>> Nope, and that's intentional. There is no guarantee that rawData did not
>>> change between intermediate calls to searchRun, so reusing a cached data1
>>> would be incorrect.
>>>
>>> If you want data1 to be cached between multiple runs, you have a few
>>> options:
>>> * cache it first and pass it in as an argument to searchRun
>>> * use a creational pattern like singleton to ensure only one
>>> instantiation
>>>
>>> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com>
>>> wrote:
>>>
>>>> Hey Feynman,
>>>>
>>>> I doubt DF persistence will work in my case. Let's use the following
>>>> code:
>>>> ==
>>>> def searchRun( params = [param1, param2] )
>>>>   data1 = hashing1.transform(rawData, param1)
>>>>   data1.cache()
>>>>   data2 = hashing2.transform(data1, param2)
>>>>   data2.someAction()
>>>> ==
>>>> Say if we run "searchRun()" for 2 times with the same "param1" but
>>>> different "param2". Will spark recognize that the two local variables
>>>> "data1" in consecutive runs has the same content?
>>>>
>>>>
>>>> Best,
>>>> Lewis
>>>>
>>>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>>
>>>>> You can persist the transformed Dataframes, for example
>>>>>
>>>>> val data : DF = ...
>>>>> val hashedData = hashingTF.transform(data)
>>>>> hashedData.cache() // to cache DataFrame in memory
>>>>>
>>>>> Future usage of hashedData read from an in-memory cache now.
>>>>>
>>>>> You can also persist to disk, eg:
>>>>>
>>>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>>>>> format to disk
>>>>> ...
>>>>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>>>>
>>>>> Future uses of hash
>>>>>
>>>>> Like my earlier response, this will still require you call each
>>>>> PipelineStage's `transform` method (i.e. to NOT use the overall
>>>>> Pipeline.setStages API)
>>>>>
>>>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Feynman,
>>>>>>
>>>>>> Thanks for your response, but I'm afraid "model save/load" is not
>>>>>> exactly the feature I'm looking for.
>>>>>>
>>>>>> What I need to cache and reuse are the intermediate outputs of
>>>>>> transformations, not transformer themselves. Do you know any related dev.
>>>>>> activities or plans?
>>>>>>
>>>>>> Best,
>>>>>> Lewis
>>>>>>
>>>>>> 2015-09-15 13:03 GMT+08:00 Feyn

Re: spark 1.5, ML Pipeline Decision Tree Dataframe Problem

2015-09-18 Thread Feynman Liang
What is the type of unlabeledTest?

SQL should be using the VectorUDT we've defined for Vectors
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L183>
so
you should be able to just "import sqlContext.implicits._" and then call
"rdd.toDf()" on your RDD to convert it into a dataframe.

On Fri, Sep 18, 2015 at 7:32 AM, Yasemin Kaya <godo...@gmail.com> wrote:

> Hi,
>
> I am using *spark 1.5, ML Pipeline Decision Tree
> <http://spark.apache.org/docs/latest/ml-decision-tree.html#output-columns>*
> to get tree's probability. But I have to convert my data to Dataframe type.
> While creating model there is no problem but when I am using model on my
> data there is a problem about converting to data frame type. My data type
> is *JavaPairRDD<String, Vector>* , when I am creating dataframe
>
> DataFrame production = sqlContext.createDataFrame(
> unlabeledTest.values(), Vector.class);
>
> *Error says me: *
> Exception in thread "main" java.lang.ClassCastException:
> org.apache.spark.mllib.linalg.VectorUDT cannot be cast to
> org.apache.spark.sql.types.StructType
>
> I know if I give LabeledPoint type, there will be no problem. But the data
> have no label, I wanna predict the label because of this reason I use model
> on it.
>
> Is there way to handle my problem?
> Thanks.
>
>
> Best,
> yasemin
> --
> hiç ender hiç
>


Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
If you're doing hyperparameter grid search, consider using
ml.tuning.CrossValidator which does cache the dataset
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L85>
.

Otherwise, perhaps you can elaborate more on your particular use case for
caching intermediate results and if the current API doesn't support it we
can create a JIRA for it.

On Tue, Sep 15, 2015 at 10:26 PM, Jingchu Liu <liujing...@gmail.com> wrote:

> Yeah I understand on the low-level we should do as you said.
>
> But since ML pipeline is a high-level API, it is pretty natural to expect
> the ability to recognize overlapping parameters between successive runs.
> (Actually, this happen A LOT when we have lots of hyper-params to search
> for)
>
> I can also imagine the implementation by appending parameter information
> to the cached results. Let's say if we implemented an "equal" method for
> param1. By comparing param1 with the previous run, the program will know
> data1 is reusable. And time used for generating data1 can be saved.
>
> Best,
> Lewis
>
> 2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>
>> Nope, and that's intentional. There is no guarantee that rawData did not
>> change between intermediate calls to searchRun, so reusing a cached data1
>> would be incorrect.
>>
>> If you want data1 to be cached between multiple runs, you have a few
>> options:
>> * cache it first and pass it in as an argument to searchRun
>> * use a creational pattern like singleton to ensure only one instantiation
>>
>> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com>
>> wrote:
>>
>>> Hey Feynman,
>>>
>>> I doubt DF persistence will work in my case. Let's use the following
>>> code:
>>> ==
>>> def searchRun( params = [param1, param2] )
>>>   data1 = hashing1.transform(rawData, param1)
>>>   data1.cache()
>>>   data2 = hashing2.transform(data1, param2)
>>>   data2.someAction()
>>> ==
>>> Say if we run "searchRun()" for 2 times with the same "param1" but
>>> different "param2". Will spark recognize that the two local variables
>>> "data1" in consecutive runs has the same content?
>>>
>>>
>>> Best,
>>> Lewis
>>>
>>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>
>>>> You can persist the transformed Dataframes, for example
>>>>
>>>> val data : DF = ...
>>>> val hashedData = hashingTF.transform(data)
>>>> hashedData.cache() // to cache DataFrame in memory
>>>>
>>>> Future usage of hashedData read from an in-memory cache now.
>>>>
>>>> You can also persist to disk, eg:
>>>>
>>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>>>> format to disk
>>>> ...
>>>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>>>
>>>> Future uses of hash
>>>>
>>>> Like my earlier response, this will still require you call each
>>>> PipelineStage's `transform` method (i.e. to NOT use the overall
>>>> Pipeline.setStages API)
>>>>
>>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Feynman,
>>>>>
>>>>> Thanks for your response, but I'm afraid "model save/load" is not
>>>>> exactly the feature I'm looking for.
>>>>>
>>>>> What I need to cache and reuse are the intermediate outputs of
>>>>> transformations, not transformer themselves. Do you know any related dev.
>>>>> activities or plans?
>>>>>
>>>>> Best,
>>>>> Lewis
>>>>>
>>>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>>>
>>>>>> Lewis,
>>>>>>
>>>>>> Many pipeline stages implement save/load methods, which can be used
>>>>>> if you instantiate and call the underlying pipeline stages `transform`
>>>>>> methods individually (instead of using the Pipeline.setStages API). See
>>>>>> associated JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>>>>
>>>>>> 

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Jingchu Liu
Yeah I understand on the low-level we should do as you said.

But since ML pipeline is a high-level API, it is pretty natural to expect
the ability to recognize overlapping parameters between successive runs.
(Actually, this happen A LOT when we have lots of hyper-params to search
for)

I can also imagine the implementation by appending parameter information to
the cached results. Let's say if we implemented an "equal" method for
param1. By comparing param1 with the previous run, the program will know
data1 is reusable. And time used for generating data1 can be saved.

Best,
Lewis

2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>:

> Nope, and that's intentional. There is no guarantee that rawData did not
> change between intermediate calls to searchRun, so reusing a cached data1
> would be incorrect.
>
> If you want data1 to be cached between multiple runs, you have a few
> options:
> * cache it first and pass it in as an argument to searchRun
> * use a creational pattern like singleton to ensure only one instantiation
>
> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com>
> wrote:
>
>> Hey Feynman,
>>
>> I doubt DF persistence will work in my case. Let's use the following code:
>> ==
>> def searchRun( params = [param1, param2] )
>>   data1 = hashing1.transform(rawData, param1)
>>   data1.cache()
>>   data2 = hashing2.transform(data1, param2)
>>   data2.someAction()
>> ==
>> Say if we run "searchRun()" for 2 times with the same "param1" but
>> different "param2". Will spark recognize that the two local variables
>> "data1" in consecutive runs has the same content?
>>
>>
>> Best,
>> Lewis
>>
>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>
>>> You can persist the transformed Dataframes, for example
>>>
>>> val data : DF = ...
>>> val hashedData = hashingTF.transform(data)
>>> hashedData.cache() // to cache DataFrame in memory
>>>
>>> Future usage of hashedData read from an in-memory cache now.
>>>
>>> You can also persist to disk, eg:
>>>
>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>>> format to disk
>>> ...
>>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>>
>>> Future uses of hash
>>>
>>> Like my earlier response, this will still require you call each
>>> PipelineStage's `transform` method (i.e. to NOT use the overall
>>> Pipeline.setStages API)
>>>
>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>>> wrote:
>>>
>>>> Hey Feynman,
>>>>
>>>> Thanks for your response, but I'm afraid "model save/load" is not
>>>> exactly the feature I'm looking for.
>>>>
>>>> What I need to cache and reuse are the intermediate outputs of
>>>> transformations, not transformer themselves. Do you know any related dev.
>>>> activities or plans?
>>>>
>>>> Best,
>>>> Lewis
>>>>
>>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>>
>>>>> Lewis,
>>>>>
>>>>> Many pipeline stages implement save/load methods, which can be used if
>>>>> you instantiate and call the underlying pipeline stages `transform` 
>>>>> methods
>>>>> individually (instead of using the Pipeline.setStages API). See associated
>>>>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>>>
>>>>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>>>>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>>>>
>>>>> Feynman
>>>>>
>>>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a question regarding the ability of ML pipeline to cache
>>>>>> intermediate results. I've posted this question on stackoverflow
>>>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>>>>> but got no answer, hope someone here can help me out.
>>>>>>
>>>>>> ===
>>>>>> Late

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
Nope, and that's intentional. There is no guarantee that rawData did not
change between intermediate calls to searchRun, so reusing a cached data1
would be incorrect.

If you want data1 to be cached between multiple runs, you have a few
options:
* cache it first and pass it in as an argument to searchRun
* use a creational pattern like singleton to ensure only one instantiation

On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com> wrote:

> Hey Feynman,
>
> I doubt DF persistence will work in my case. Let's use the following code:
> ==
> def searchRun( params = [param1, param2] )
>   data1 = hashing1.transform(rawData, param1)
>   data1.cache()
>   data2 = hashing2.transform(data1, param2)
>   data2.someAction()
> ==
> Say if we run "searchRun()" for 2 times with the same "param1" but
> different "param2". Will spark recognize that the two local variables
> "data1" in consecutive runs has the same content?
>
>
> Best,
> Lewis
>
> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>
>> You can persist the transformed Dataframes, for example
>>
>> val data : DF = ...
>> val hashedData = hashingTF.transform(data)
>> hashedData.cache() // to cache DataFrame in memory
>>
>> Future usage of hashedData read from an in-memory cache now.
>>
>> You can also persist to disk, eg:
>>
>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>> format to disk
>> ...
>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>
>> Future uses of hash
>>
>> Like my earlier response, this will still require you call each
>> PipelineStage's `transform` method (i.e. to NOT use the overall
>> Pipeline.setStages API)
>>
>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>> wrote:
>>
>>> Hey Feynman,
>>>
>>> Thanks for your response, but I'm afraid "model save/load" is not
>>> exactly the feature I'm looking for.
>>>
>>> What I need to cache and reuse are the intermediate outputs of
>>> transformations, not transformer themselves. Do you know any related dev.
>>> activities or plans?
>>>
>>> Best,
>>> Lewis
>>>
>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>
>>>> Lewis,
>>>>
>>>> Many pipeline stages implement save/load methods, which can be used if
>>>> you instantiate and call the underlying pipeline stages `transform` methods
>>>> individually (instead of using the Pipeline.setStages API). See associated
>>>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>>
>>>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>>>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>>>
>>>> Feynman
>>>>
>>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a question regarding the ability of ML pipeline to cache
>>>>> intermediate results. I've posted this question on stackoverflow
>>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>>>> but got no answer, hope someone here can help me out.
>>>>>
>>>>> ===
>>>>> Lately I'm planning to migrate my standalone python ML code to spark.
>>>>> The ML pipeline in spark.ml turns out quite handy, with streamlined
>>>>> API for chaining up algorithm stages and hyper-parameter grid search.
>>>>>
>>>>> Still, I found its support for one important feature obscure in
>>>>> existing documents: caching of intermediate results. The importance of 
>>>>> this
>>>>> feature arise when the pipeline involves computation intensive stages.
>>>>>
>>>>> For example, in my case I use a huge sparse matrix to perform multiple
>>>>> moving averages on time series data in order to form input features. The
>>>>> structure of the matrix is determined by some hyper-parameter. This step
>>>>> turns out to be a bottleneck for the entire pipeline because I have to
>>>>> construct the matrix in runtime.
>>>>>
>>>>> During parameter search, I usually have other parameters to examine in
>>>>> addition to this "structure parameter". So if I can reuse the huge matrix
>>>>> when the "structure parameter" is unchanged, I can save tons of time. For
>>>>> this reason, I intentionally formed my code to cache and reuse these
>>>>> intermediate results.
>>>>>
>>>>> So my question is: can Spark's ML pipeline handle intermediate caching
>>>>> automatically? Or do I have to manually form code to do so? If so, is 
>>>>> there
>>>>> any best practice to learn from?
>>>>>
>>>>> P.S. I have looked into the official document and some other material,
>>>>> but none of them seems to discuss this topic.
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>> Lewis
>>>>>
>>>>
>>>>
>>>
>>
>


Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Jingchu Liu
Hi all,

I have a question regarding the ability of ML pipeline to cache
intermediate results. I've posted this question on stackoverflow
<http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
but got no answer, hope someone here can help me out.

===
Lately I'm planning to migrate my standalone python ML code to spark. The
ML pipeline in spark.ml turns out quite handy, with streamlined API for
chaining up algorithm stages and hyper-parameter grid search.

Still, I found its support for one important feature obscure in existing
documents: caching of intermediate results. The importance of this feature
arise when the pipeline involves computation intensive stages.

For example, in my case I use a huge sparse matrix to perform multiple
moving averages on time series data in order to form input features. The
structure of the matrix is determined by some hyper-parameter. This step
turns out to be a bottleneck for the entire pipeline because I have to
construct the matrix in runtime.

During parameter search, I usually have other parameters to examine in
addition to this "structure parameter". So if I can reuse the huge matrix
when the "structure parameter" is unchanged, I can save tons of time. For
this reason, I intentionally formed my code to cache and reuse these
intermediate results.

So my question is: can Spark's ML pipeline handle intermediate caching
automatically? Or do I have to manually form code to do so? If so, is there
any best practice to learn from?

P.S. I have looked into the official document and some other material, but
none of them seems to discuss this topic.



Best,
Lewis


Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
Lewis,

Many pipeline stages implement save/load methods, which can be used if you
instantiate and call the underlying pipeline stages `transform` methods
individually (instead of using the Pipeline.setStages API). See associated
JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.

Pipeline persistence is on the 1.6 roadmap, JIRA here
<https://issues.apache.org/jira/browse/SPARK-6725>.

Feynman

On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> wrote:

> Hi all,
>
> I have a question regarding the ability of ML pipeline to cache
> intermediate results. I've posted this question on stackoverflow
> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
> but got no answer, hope someone here can help me out.
>
> ===
> Lately I'm planning to migrate my standalone python ML code to spark. The
> ML pipeline in spark.ml turns out quite handy, with streamlined API for
> chaining up algorithm stages and hyper-parameter grid search.
>
> Still, I found its support for one important feature obscure in existing
> documents: caching of intermediate results. The importance of this feature
> arise when the pipeline involves computation intensive stages.
>
> For example, in my case I use a huge sparse matrix to perform multiple
> moving averages on time series data in order to form input features. The
> structure of the matrix is determined by some hyper-parameter. This step
> turns out to be a bottleneck for the entire pipeline because I have to
> construct the matrix in runtime.
>
> During parameter search, I usually have other parameters to examine in
> addition to this "structure parameter". So if I can reuse the huge matrix
> when the "structure parameter" is unchanged, I can save tons of time. For
> this reason, I intentionally formed my code to cache and reuse these
> intermediate results.
>
> So my question is: can Spark's ML pipeline handle intermediate caching
> automatically? Or do I have to manually form code to do so? If so, is there
> any best practice to learn from?
>
> P.S. I have looked into the official document and some other material, but
> none of them seems to discuss this topic.
>
>
>
> Best,
> Lewis
>


Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
You can persist the transformed Dataframes, for example

val data : DF = ...
val hashedData = hashingTF.transform(data)
hashedData.cache() // to cache DataFrame in memory

Future usage of hashedData read from an in-memory cache now.

You can also persist to disk, eg:

hashedData.write.parquet(FilePath) // to write DataFrame in Parquet format
to disk
...
val savedHashedData = sqlContext.read.parquet(FilePath)

Future uses of hash

Like my earlier response, this will still require you call each
PipelineStage's `transform` method (i.e. to NOT use the overall
Pipeline.setStages API)

On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com> wrote:

> Hey Feynman,
>
> Thanks for your response, but I'm afraid "model save/load" is not exactly
> the feature I'm looking for.
>
> What I need to cache and reuse are the intermediate outputs of
> transformations, not transformer themselves. Do you know any related dev.
> activities or plans?
>
> Best,
> Lewis
>
> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>
>> Lewis,
>>
>> Many pipeline stages implement save/load methods, which can be used if
>> you instantiate and call the underlying pipeline stages `transform` methods
>> individually (instead of using the Pipeline.setStages API). See associated
>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>
>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>
>> Feynman
>>
>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a question regarding the ability of ML pipeline to cache
>>> intermediate results. I've posted this question on stackoverflow
>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>> but got no answer, hope someone here can help me out.
>>>
>>> ===
>>> Lately I'm planning to migrate my standalone python ML code to spark.
>>> The ML pipeline in spark.ml turns out quite handy, with streamlined API
>>> for chaining up algorithm stages and hyper-parameter grid search.
>>>
>>> Still, I found its support for one important feature obscure in existing
>>> documents: caching of intermediate results. The importance of this feature
>>> arise when the pipeline involves computation intensive stages.
>>>
>>> For example, in my case I use a huge sparse matrix to perform multiple
>>> moving averages on time series data in order to form input features. The
>>> structure of the matrix is determined by some hyper-parameter. This step
>>> turns out to be a bottleneck for the entire pipeline because I have to
>>> construct the matrix in runtime.
>>>
>>> During parameter search, I usually have other parameters to examine in
>>> addition to this "structure parameter". So if I can reuse the huge matrix
>>> when the "structure parameter" is unchanged, I can save tons of time. For
>>> this reason, I intentionally formed my code to cache and reuse these
>>> intermediate results.
>>>
>>> So my question is: can Spark's ML pipeline handle intermediate caching
>>> automatically? Or do I have to manually form code to do so? If so, is there
>>> any best practice to learn from?
>>>
>>> P.S. I have looked into the official document and some other material,
>>> but none of them seems to discuss this topic.
>>>
>>>
>>>
>>> Best,
>>> Lewis
>>>
>>
>>
>


Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Jingchu Liu
Hey Feynman,

Thanks for your response, but I'm afraid "model save/load" is not exactly
the feature I'm looking for.

What I need to cache and reuse are the intermediate outputs of
transformations, not transformer themselves. Do you know any related dev.
activities or plans?

Best,
Lewis

2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:

> Lewis,
>
> Many pipeline stages implement save/load methods, which can be used if you
> instantiate and call the underlying pipeline stages `transform` methods
> individually (instead of using the Pipeline.setStages API). See associated
> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>
> Pipeline persistence is on the 1.6 roadmap, JIRA here
> <https://issues.apache.org/jira/browse/SPARK-6725>.
>
> Feynman
>
> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have a question regarding the ability of ML pipeline to cache
>> intermediate results. I've posted this question on stackoverflow
>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>> but got no answer, hope someone here can help me out.
>>
>> ===
>> Lately I'm planning to migrate my standalone python ML code to spark. The
>> ML pipeline in spark.ml turns out quite handy, with streamlined API for
>> chaining up algorithm stages and hyper-parameter grid search.
>>
>> Still, I found its support for one important feature obscure in existing
>> documents: caching of intermediate results. The importance of this feature
>> arise when the pipeline involves computation intensive stages.
>>
>> For example, in my case I use a huge sparse matrix to perform multiple
>> moving averages on time series data in order to form input features. The
>> structure of the matrix is determined by some hyper-parameter. This step
>> turns out to be a bottleneck for the entire pipeline because I have to
>> construct the matrix in runtime.
>>
>> During parameter search, I usually have other parameters to examine in
>> addition to this "structure parameter". So if I can reuse the huge matrix
>> when the "structure parameter" is unchanged, I can save tons of time. For
>> this reason, I intentionally formed my code to cache and reuse these
>> intermediate results.
>>
>> So my question is: can Spark's ML pipeline handle intermediate caching
>> automatically? Or do I have to manually form code to do so? If so, is there
>> any best practice to learn from?
>>
>> P.S. I have looked into the official document and some other material,
>> but none of them seems to discuss this topic.
>>
>>
>>
>> Best,
>> Lewis
>>
>
>


Re: Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-21 Thread Yanbo Liang
ML plans to make Machine Learning pipeline that users can make machine
learning more efficient.
It's more general to make StringIndexer chain with any kinds of Estimators.
I think we can make StringIndexer and reverse process automatic in the
future.
If you want to know your original labels, you can use IndexToString.

2015-08-11 6:56 GMT+08:00 pkphlam pkph...@gmail.com:

 Hi,

 If I understand the RandomForest model in the ML Pipeline implementation in
 the ml package correctly, I have to first run my outcome label variable
 through the StringIndexer, even if my labels are numeric. The StringIndexer
 then converts the labels into numeric indices based on frequency of the
 label.

 This could create situations where if I'm classifying binary outcomes where
 my original labels are simply 0 and 1, the StringIndexer may actually flip
 my labels such that 0s become 1s and 1s become 0s if my original 1s were
 more frequent. This transformation would then extend itself to the
 predictions. In the old mllib implementation, the RF does not require the
 labels to be changed and I could use 0/1 labels without worrying about them
 being transformed.

 I was wondering:
 1. Why is this the default implementation for the Pipeline RF? This seems
 like it could cause a lot of confusion in cases like the one I outlined
 above.
 2. Is there a way to avoid this by either controlling how the indices are
 created in StringIndexer or bypassing StringIndexer altogether?
 3. If 2 is not possible, is there an easy way to see how my original labels
 mapped onto the indices so that I can revert the predictions back to the
 original labels rather than the transformed labels? I suppose I could do
 this by counting the original labels and mapping by frequency, but it seems
 like there should be a more straightforward way to get it out of the
 StringIndexer.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-10 Thread pkphlam
Hi,

If I understand the RandomForest model in the ML Pipeline implementation in
the ml package correctly, I have to first run my outcome label variable
through the StringIndexer, even if my labels are numeric. The StringIndexer
then converts the labels into numeric indices based on frequency of the
label. 

This could create situations where if I'm classifying binary outcomes where
my original labels are simply 0 and 1, the StringIndexer may actually flip
my labels such that 0s become 1s and 1s become 0s if my original 1s were
more frequent. This transformation would then extend itself to the
predictions. In the old mllib implementation, the RF does not require the
labels to be changed and I could use 0/1 labels without worrying about them
being transformed.

I was wondering:
1. Why is this the default implementation for the Pipeline RF? This seems
like it could cause a lot of confusion in cases like the one I outlined
above.
2. Is there a way to avoid this by either controlling how the indices are
created in StringIndexer or bypassing StringIndexer altogether?
3. If 2 is not possible, is there an easy way to see how my original labels
mapped onto the indices so that I can revert the predictions back to the
original labels rather than the transformed labels? I suppose I could do
this by counting the original labels and mapping by frequency, but it seems
like there should be a more straightforward way to get it out of the
StringIndexer.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to implement an Evaluator for a ML pipeline?

2015-05-20 Thread Stefan H.

Thanks, Xiangrui, for clarifying the metric and creating that JIRA issue.

I made an error while composing my earlier mail: 
paramMap.get(als.regParam) in my Evaluator actually returns None. I 
just happended to use getOrElse(1.0) in my tests, which explains why 
negating the metric did not change anything.


-Stefan

PS: I got an error while sending my previous mail via the web interface, 
and did not think it got through to the list. So I did not follow up on 
my problem myself. Sorry for the confusion.



Am 19.05.2015 um 21:54 schrieb Xiangrui Meng:

The documentation needs to be updated to state that higher metric
values are better (https://issues.apache.org/jira/browse/SPARK-7740).
I don't know why if you negate the return value of the Evaluator you
still get the highest regularization parameter candidate. Maybe you
should check the log messages from CrossValidator and see the average
metric values during cross validation. -Xiangrui

On Sat, May 9, 2015 at 12:15 PM, Stefan H. twel...@gmx.de wrote:

Hello everyone,

I am stuck with the (experimental, I think) API for machine learning
pipelines. I have a pipeline with just one estimator (ALS) and I want it to
try different values for the regularization parameter. Therefore I need to
supply an Evaluator that returns a value of type Double. I guess this could
be something like accuracy or mean squared error? The only implementation I
found is BinaryClassificationEvaluator, and I did not understand the
computation there.

I could not find detailed documentation so I implemented a dummy Evaluator
that just returns the regularization parameter:

   new Evaluator {
 def evaluate(dataset: DataFrame, paramMap: ParamMap): Double =
   paramMap.get(als.regParam).getOrElse(throw new Exception)
   }

I just wanted to see whether the lower or higher value wins. On the
resulting model I inspected the chosen regularization parameter this way:

   cvModel.bestModel.fittingParamMap.get(als.regParam)

And it was the highest of my three regularization parameter candidates.
Strange thing is, if I negate the return value of the Evaluator, that line
still returns the highest regularization parameter candidate.

So I am probably working with false assumptions. I'd be grateful if someone
could point me to some documentation or examples, or has a few hints to
share.

Cheers,
Stefan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-an-Evaluator-for-a-ML-pipeline-tp22830.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to implement an Evaluator for a ML pipeline?

2015-05-19 Thread Xiangrui Meng
The documentation needs to be updated to state that higher metric
values are better (https://issues.apache.org/jira/browse/SPARK-7740).
I don't know why if you negate the return value of the Evaluator you
still get the highest regularization parameter candidate. Maybe you
should check the log messages from CrossValidator and see the average
metric values during cross validation. -Xiangrui

On Sat, May 9, 2015 at 12:15 PM, Stefan H. twel...@gmx.de wrote:
 Hello everyone,

 I am stuck with the (experimental, I think) API for machine learning
 pipelines. I have a pipeline with just one estimator (ALS) and I want it to
 try different values for the regularization parameter. Therefore I need to
 supply an Evaluator that returns a value of type Double. I guess this could
 be something like accuracy or mean squared error? The only implementation I
 found is BinaryClassificationEvaluator, and I did not understand the
 computation there.

 I could not find detailed documentation so I implemented a dummy Evaluator
 that just returns the regularization parameter:

   new Evaluator {
 def evaluate(dataset: DataFrame, paramMap: ParamMap): Double =
   paramMap.get(als.regParam).getOrElse(throw new Exception)
   }

 I just wanted to see whether the lower or higher value wins. On the
 resulting model I inspected the chosen regularization parameter this way:

   cvModel.bestModel.fittingParamMap.get(als.regParam)

 And it was the highest of my three regularization parameter candidates.
 Strange thing is, if I negate the return value of the Evaluator, that line
 still returns the highest regularization parameter candidate.

 So I am probably working with false assumptions. I'd be grateful if someone
 could point me to some documentation or examples, or has a few hints to
 share.

 Cheers,
 Stefan



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-an-Evaluator-for-a-ML-pipeline-tp22830.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to implement an Evaluator for a ML pipeline?

2015-05-09 Thread Stefan H.
Hello everyone,

I am stuck with the (experimental, I think) API for machine learning
pipelines. I have a pipeline with just one estimator (ALS) and I want it to
try different values for the regularization parameter. Therefore I need to
supply an Evaluator that returns a value of type Double. I guess this could
be something like accuracy or mean squared error? The only implementation I
found is BinaryClassificationEvaluator, and I did not understand the
computation there.

I could not find detailed documentation so I implemented a dummy Evaluator
that just returns the regularization parameter:

  new Evaluator {
def evaluate(dataset: DataFrame, paramMap: ParamMap): Double = 
  paramMap.get(als.regParam).getOrElse(throw new Exception)
  }

I just wanted to see whether the lower or higher value wins. On the
resulting model I inspected the chosen regularization parameter this way:

  cvModel.bestModel.fittingParamMap.get(als.regParam)

And it was the highest of my three regularization parameter candidates.
Strange thing is, if I negate the return value of the Evaluator, that line
still returns the highest regularization parameter candidate.

So I am probably working with false assumptions. I'd be grateful if someone
could point me to some documentation or examples, or has a few hints to
share.

Cheers,
Stefan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-an-Evaluator-for-a-ML-pipeline-tp22830.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-26 Thread Joseph Bradley
Hi Peter,

As far as setting the parallelism, I would recommend setting it as early as
possible.  Ideally, that would mean specifying the number of partitions
when loading the initial data (rather than repartitioning later on).

In general, working with Vector columns should be better since the Vector
can be stored as a native array, rather than a bunch of objects.

I suspect the OOM is from Parquet's very large default buffer sizes.  This
is a problem with ML model import/export as well.  I have a JIRA for that:
https://issues.apache.org/jira/browse/SPARK-7148
I'm not yet sure if there's a good way to set the buffer size
automatically, though.

Joseph

On Fri, Apr 24, 2015 at 8:20 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Hi i have a next problem. I have a dataset with 30 columns (15 numeric,
 15 categorical) and using ml transformers/estimators to transform each
 column (StringIndexer for categorical  MeanImputor for numeric). This
 creates 30 more columns in a dataframe. After i’m using VectorAssembler to
 combine 30 transformed columns into 1 vector.
 After when i do df.select(“vector, Label”).saveAsParquetFile it fails with
 OOM error.

 15/04/24 16:33:05 ERROR Executor: Exception in task 2.0 in stage 52.0 (TID 
 2238)
 15/04/24 16:33:05 DEBUG LocalActor: [actor] received message 
 StatusUpdate(2238,FAILED,java.nio.HeapByteBuffer[pos=0 lim=4167 cap=4167]) 
 from Actor[akka://sparkDriver/deadLetters]
 15/04/24 16:33:05 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
 thread Thread[Executor task launch worker-1,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3468)
 at 
 java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3275)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1792)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
 at 
 scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:142)
 at 
 scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:142)
 at scala.collection.mutable.HashTable$class.init(HashTable.scala:105)
 at scala.collection.mutable.HashMap.init(HashMap.scala:39)
 at scala.collection.mutable.HashMap.readObject(HashMap.scala:142)
 at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 15/04/24 16:33:05 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_52, 
 runningTasks: 3
 15/04/24 16:33:05 DEBUG Utils: Shutdown hook called
 15/04/24 16:33:05 DEBUG DiskBlockManager: Shutdown hook called
 15/04/24 16:33:05 DEBUG TaskSetManager: Moving to NODE_LOCAL after waiting 
 for 3000ms
 15/04/24 16:33:05 DEBUG TaskSetManager: Moving to ANY after waiting for 0ms
 15/04/24 16:33:05 INFO TaskSetManager: Starting task 4.0 in stage 52.0 (TID 
 2240, localhost, PROCESS_LOCAL, 1979 bytes)
 15/04/24 16:33:05 DEBUG LocalActor: [actor] handled message (12.488047 ms) 
 StatusUpdate(2238,FAILED,java.nio.HeapByteBuffer[pos=4167 lim=4167 cap=4167]) 
 from Actor[akka://sparkDriver/deadLetters]
 15/04/24 16:33:05 INFO Executor: Running task 4.0 in stage 52.0 (TID 2240)
 15/04/24 16:33:05 DEBUG LocalActor: [actor] received message 
 

[Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-24 Thread Peter Rudenko
Hi i have a next problem. I have a dataset with 30 columns (15 numeric, 
15 categorical) and using ml transformers/estimators to transform each 
column (StringIndexer for categorical  MeanImputor for numeric). This 
creates 30 more columns in a dataframe. After i’m using VectorAssembler 
to combine 30 transformed columns into 1 vector.
After when i do df.select(“vector, Label”).saveAsParquetFile it fails 
with OOM error.


|15/04/24 16:33:05 ERROR Executor: Exception in task 2.0 in stage 52.0 
(TID 2238) 15/04/24 16:33:05 DEBUG LocalActor: [actor] received message 
StatusUpdate(2238,FAILED,java.nio.HeapByteBuffer[pos=0 lim=4167 
cap=4167]) from Actor[akka://sparkDriver/deadLetters] 15/04/24 16:33:05 
ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread 
Thread[Executor task launch worker-1,5,main] java.lang.OutOfMemoryError: 
Java heap space at 
java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3468) 
at 
java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3275) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1792) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at 
scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:142) 
at 
scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:142) 
at scala.collection.mutable.HashTable$class.init(HashTable.scala:105) at 
scala.collection.mutable.HashMap.init(HashMap.scala:39) at 
scala.collection.mutable.HashMap.readObject(HashMap.scala:142) at 
sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:497) at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) 
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) 
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) 
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) 
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) 
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) 
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) 
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) 
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) 
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) 
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) 15/04/24 
16:33:05 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_52, 
runningTasks: 3 15/04/24 16:33:05 DEBUG Utils: Shutdown hook called 
15/04/24 16:33:05 DEBUG DiskBlockManager: Shutdown hook called 15/04/24 
16:33:05 DEBUG TaskSetManager: Moving to NODE_LOCAL after waiting for 
3000ms 15/04/24 16:33:05 DEBUG TaskSetManager: Moving to ANY after 
waiting for 0ms 15/04/24 16:33:05 INFO TaskSetManager: Starting task 4.0 
in stage 52.0 (TID 2240, localhost, PROCESS_LOCAL, 1979 bytes) 15/04/24 
16:33:05 DEBUG LocalActor: [actor] handled message (12.488047 ms) 
StatusUpdate(2238,FAILED,java.nio.HeapByteBuffer[pos=4167 lim=4167 
cap=4167]) from Actor[akka://sparkDriver/deadLetters] 15/04/24 16:33:05 
INFO Executor: Running task 4.0 in stage 52.0 (TID 2240) 15/04/24 
16:33:05 DEBUG LocalActor: [actor] received message 
StatusUpdate(2240,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) 
from Actor[akka://sparkDriver/deadLetters] 15/04/24 16:33:05 DEBUG 
Executor: Task 2240's epoch is 13 15/04/24 16:33:05 DEBUG BlockManager: 
Getting local block broadcast_53 ... 15/04/24 16:33:05 DEBUG 
BlockManager: Level for block broadcast_53 is StorageLevel(true, true, 
false, true, 1) 15/04/24 16:33:05 DEBUG BlockManager: Getting block 
broadcast_53 from memory 15/04/24 16:33:05 ERROR TaskSetManager: Task 2 
in stage 52.0 failed 1 times; aborting job 15/04/24 16:33:05 DEBUG 
LocalActor: [actor] handled message (7.195529 ms) 
StatusUpdate(2240,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) 
from Actor[akka://sparkDriver/deadLetters] 15/04/24 16:33:05 INFO 
TaskSchedulerImpl: Cancelling stage 52 |


If i after last step manually repartition data i get GC overhead error:

|java.lang.OutOfMemoryError: GC overhead limit exceeded 

Re: Spark ML Pipeline inaccessible types

2015-03-27 Thread Xiangrui Meng
Hi Martin,

Could you attach the code snippet and the stack trace? The default
implementation of some methods uses reflection, which may be the
cause.

Best,
Xiangrui

On Wed, Mar 25, 2015 at 3:18 PM,  zapletal-mar...@email.cz wrote:
 Thanks Peter,

 I ended up doing something similar. I however consider both the approaches
 you mentioned bad practices which is why I was looking for a solution
 directly supported by the current code.

 I can work with that now, but it does not seem to be the proper solution.

 Regards,
 Martin

 -- Původní zpráva --
 Od: Peter Rudenko petro.rude...@gmail.com
 Komu: zapletal-mar...@email.cz, Sean Owen so...@cloudera.com
 Datum: 25. 3. 2015 13:28:38


 Předmět: Re: Spark ML Pipeline inaccessible types


 Hi Martin, here’s 2 possibilities to overcome this:

 1) Put your logic into org.apache.spark package in your project - then
 everything would be accessible.
 2) Dirty trick:

  object SparkVector extends HashingTF {
   val VectorUDT: DataType = outputDataType
 }

 then you can do like this:

  StructType(vectorTypeColumn, SparkVector.VectorUDT, false))

 Thanks,
 Peter Rudenko

 On 2015-03-25 13:14, zapletal-mar...@email.cz wrote:

 Sean,

 thanks for your response. I am familiar with NoSuchMethodException in
 general, but I think it is not the case this time. The code actually
 attempts to get parameter by name using val m =
 this.getClass.getMethodName(paramName).

 This may be a bug, but it is only a side effect caused by the real problem I
 am facing. My issue is that VectorUDT is not accessible by user code and
 therefore it is not possible to use custom ML pipeline with the existing
 Predictors (see the last two paragraphs in my first email).

 Best Regards,
 Martin

 -- Původní zpráva --
 Od: Sean Owen so...@cloudera.com
 Komu: zapletal-mar...@email.cz
 Datum: 25. 3. 2015 11:05:54
 Předmět: Re: Spark ML Pipeline inaccessible types


 NoSuchMethodError in general means that your runtime and compile-time
 environments are different. I think you need to first make sure you
 don't have mismatching versions of Spark.

 On Wed, Mar 25, 2015 at 11:00 AM, zapletal-mar...@email.cz wrote:
 Hi,

 I have started implementing a machine learning pipeline using Spark 1.3.0
 and the new pipelining API and DataFrames. I got to a point where I have
 my
 training data set prepared using a sequence of Transformers, but I am
 struggling to actually train a model and use it for predictions.

 I am getting a java.lang.NoSuchMethodException:
 org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
 exception thrown at checkInputColumn method in Params trait when using a
 Predictor (LinearRegression in my case, but that should not matter). This
 looks like a bug - the exception is thrown when executing
 getParam(colName)
 when the require(actualDataType.equals(datatype), ...) requirement is not
 met so the expected requirement failed exception is not thrown and is
 hidden
 by the unexpected NoSuchMethodException instead. I can raise a bug if this
 really is an issue and I am not using something incorrectly.

 The problem I am facing however is that the Predictor expects features to
 have VectorUDT type as defined in Predictor class (protected def
 featuresDataType: DataType = new VectorUDT). But since this type is
 private[spark] my Transformer can not prepare features with this type
 which
 then correctly results in the exception above when I use a different type.

 Is there a way to define a custom Pipeline that would be able to use the
 existing Predictors without having to bypass the access modifiers or
 reimplement something or is the pipelining API not yet expected to be used
 in this way?

 Thanks,
 Martin



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark ML Pipeline inaccessible types

2015-03-27 Thread Joseph Bradley
Hi Martin,

In the short term: Would you be able to work with a different type other
than Vector?  If so, then you can override the *Predictor* class's *protected
def featuresDataType: DataType* with a DataFrame type which fits your
purpose.  If you need Vector, then you might have to do a hack like Peter
suggested.

In the long term: VectorUDT should indeed be made public, but that will
have to wait until the next release.

Thanks for the feedback,
Joseph

On Fri, Mar 27, 2015 at 11:12 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Martin,

 Could you attach the code snippet and the stack trace? The default
 implementation of some methods uses reflection, which may be the
 cause.

 Best,
 Xiangrui

 On Wed, Mar 25, 2015 at 3:18 PM,  zapletal-mar...@email.cz wrote:
  Thanks Peter,
 
  I ended up doing something similar. I however consider both the
 approaches
  you mentioned bad practices which is why I was looking for a solution
  directly supported by the current code.
 
  I can work with that now, but it does not seem to be the proper solution.
 
  Regards,
  Martin
 
  -- Původní zpráva --
  Od: Peter Rudenko petro.rude...@gmail.com
  Komu: zapletal-mar...@email.cz, Sean Owen so...@cloudera.com
  Datum: 25. 3. 2015 13:28:38
 
 
  Předmět: Re: Spark ML Pipeline inaccessible types
 
 
  Hi Martin, here’s 2 possibilities to overcome this:
 
  1) Put your logic into org.apache.spark package in your project - then
  everything would be accessible.
  2) Dirty trick:
 
   object SparkVector extends HashingTF {
val VectorUDT: DataType = outputDataType
  }
 
  then you can do like this:
 
   StructType(vectorTypeColumn, SparkVector.VectorUDT, false))
 
  Thanks,
  Peter Rudenko
 
  On 2015-03-25 13:14, zapletal-mar...@email.cz wrote:
 
  Sean,
 
  thanks for your response. I am familiar with NoSuchMethodException in
  general, but I think it is not the case this time. The code actually
  attempts to get parameter by name using val m =
  this.getClass.getMethodName(paramName).
 
  This may be a bug, but it is only a side effect caused by the real
 problem I
  am facing. My issue is that VectorUDT is not accessible by user code and
  therefore it is not possible to use custom ML pipeline with the existing
  Predictors (see the last two paragraphs in my first email).
 
  Best Regards,
  Martin
 
  -- Původní zpráva --
  Od: Sean Owen so...@cloudera.com
  Komu: zapletal-mar...@email.cz
  Datum: 25. 3. 2015 11:05:54
  Předmět: Re: Spark ML Pipeline inaccessible types
 
 
  NoSuchMethodError in general means that your runtime and compile-time
  environments are different. I think you need to first make sure you
  don't have mismatching versions of Spark.
 
  On Wed, Mar 25, 2015 at 11:00 AM, zapletal-mar...@email.cz wrote:
  Hi,
 
  I have started implementing a machine learning pipeline using Spark
 1.3.0
  and the new pipelining API and DataFrames. I got to a point where I have
  my
  training data set prepared using a sequence of Transformers, but I am
  struggling to actually train a model and use it for predictions.
 
  I am getting a java.lang.NoSuchMethodException:
  org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
  exception thrown at checkInputColumn method in Params trait when using a
  Predictor (LinearRegression in my case, but that should not matter).
 This
  looks like a bug - the exception is thrown when executing
  getParam(colName)
  when the require(actualDataType.equals(datatype), ...) requirement is
 not
  met so the expected requirement failed exception is not thrown and is
  hidden
  by the unexpected NoSuchMethodException instead. I can raise a bug if
 this
  really is an issue and I am not using something incorrectly.
 
  The problem I am facing however is that the Predictor expects features
 to
  have VectorUDT type as defined in Predictor class (protected def
  featuresDataType: DataType = new VectorUDT). But since this type is
  private[spark] my Transformer can not prepare features with this type
  which
  then correctly results in the exception above when I use a different
 type.
 
  Is there a way to define a custom Pipeline that would be able to use the
  existing Predictors without having to bypass the access modifiers or
  reimplement something or is the pipelining API not yet expected to be
 used
  in this way?
 
  Thanks,
  Martin
 
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread Peter Rudenko

Hi Martin, here’s 2 possibilities to overcome this:

1) Put your logic into org.apache.spark package in your project - then 
everything would be accessible.

2) Dirty trick:

|object SparkVector extends HashingTF { val VectorUDT: DataType = 
outputDataType } |


then you can do like this:

|StructType(vectorTypeColumn, SparkVector.VectorUDT, false)) |

Thanks,
Peter Rudenko

On 2015-03-25 13:14, zapletal-mar...@email.cz wrote:


Sean,

thanks for your response. I am familiar with /NoSuchMethodException/ 
in general, but I think it is not the case this time. The code 
actually attempts to get parameter by name using /val m = 
this.getClass.getMethodName(paramName)./


This may be a bug, but it is only a side effect caused by the real 
problem I am facing. My issue is that VectorUDT is not accessible by 
user code and therefore it is not possible to use custom ML pipeline 
with the existing Predictors (see the last two paragraphs in my first 
email).


Best Regards,
Martin

-- Původní zpráva --
Od: Sean Owen so...@cloudera.com
Komu: zapletal-mar...@email.cz
Datum: 25. 3. 2015 11:05:54
Předmět: Re: Spark ML Pipeline inaccessible types


NoSuchMethodError in general means that your runtime and compile-time
environments are different. I think you need to first make sure you
don't have mismatching versions of Spark.

On Wed, Mar 25, 2015 at 11:00 AM, zapletal-mar...@email.cz wrote:
 Hi,

 I have started implementing a machine learning pipeline using
Spark 1.3.0
 and the new pipelining API and DataFrames. I got to a point
where I have my
 training data set prepared using a sequence of Transformers, but
I am
 struggling to actually train a model and use it for predictions.

 I am getting a java.lang.NoSuchMethodException:

org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
 exception thrown at checkInputColumn method in Params trait when
using a
 Predictor (LinearRegression in my case, but that should not
matter). This
 looks like a bug - the exception is thrown when executing
getParam(colName)
 when the require(actualDataType.equals(datatype), ...)
requirement is not
 met so the expected requirement failed exception is not thrown
and is hidden
 by the unexpected NoSuchMethodException instead. I can raise a
bug if this
 really is an issue and I am not using something incorrectly.

 The problem I am facing however is that the Predictor expects
features to
 have VectorUDT type as defined in Predictor class (protected def
 featuresDataType: DataType = new VectorUDT). But since this type is
 private[spark] my Transformer can not prepare features with this
type which
 then correctly results in the exception above when I use a
different type.

 Is there a way to define a custom Pipeline that would be able to
use the
 existing Predictors without having to bypass the access modifiers or
 reimplement something or is the pipelining API not yet expected
to be used
 in this way?

 Thanks,
 Martin



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


​


Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
Thanks Peter,



I ended up doing something similar. I however consider both the approaches 
you mentioned bad practices which is why I was looking for a solution 
directly supported by the current code.




I can work with that now, but it does not seem to be the proper solution.




Regards,

Martin





-- Původní zpráva --
Od: Peter Rudenko petro.rude...@gmail.com
Komu: zapletal-mar...@email.cz, Sean Owen so...@cloudera.com
Datum: 25. 3. 2015 13:28:38
Předmět: Re: Spark ML Pipeline inaccessible types




 Hi Martin, here’s 2 possibilities to overcome this:

 1) Put your logic into org.apache.spark package in your project - then 
 everything would be accessible.
 2) Dirty trick:

 spanspanobject/span spanSparkVector/span 
spanspanextends/span/span spanHashingTF/span {/span
  spanspanval/span spanVectorUDT/span:/span spanDataType/span = 
outputDataType
}


 then you can do like this:

 spanStructType/span(spanvectorTypeColumn/span, 
spanSparkVector/span.spanVectorUDT/span, spanfalse/span))


 Thanks,
 Peter Rudenko

 On 2015-03-25 13:14, zapletal-mar...@email.cz
 (mailto:zapletal-mar...@email.cz) wrote:

 



Sean, 



thanks for your response. I am familiar with NoSuchMethodException in 
general, but I think it is not the case this time. The code actually 
attempts to get parameter by name using val m = this.getClass.getMethodName
(paramName).




This may be a bug, but it is only a side effect caused by the real problem I
am facing. My issue is that VectorUDT is not accessible by user code and 
therefore it is not possible to use custom ML pipeline with the existing 
Predictors (see the last two paragraphs in my first email).




Best Regards,

Martin



-- Původní zpráva --
Od: Sean Owen so...@cloudera.com(mailto:so...@cloudera.com)
Komu: zapletal-mar...@email.cz(mailto:zapletal-mar...@email.cz)
Datum: 25. 3. 2015 11:05:54
Předmět: Re: Spark ML Pipeline inaccessible types

NoSuchMethodError in general means that your runtime and compile-time
environments are different. I think you need to first make sure you
don't have mismatching versions of Spark.

On Wed, Mar 25, 2015 at 11:00 AM, zapletal-mar...@email.cz
(mailto:zapletal-mar...@email.cz) wrote:
 Hi,

 I have started implementing a machine learning pipeline using Spark 1.3.0
 and the new pipelining API and DataFrames. I got to a point where I have 
my
 training data set prepared using a sequence of Transformers, but I am
 struggling to actually train a model and use it for predictions.

 I am getting a java.lang.NoSuchMethodException:
 org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
 exception thrown at checkInputColumn method in Params trait when using a
 Predictor (LinearRegression in my case, but that should not matter). This
 looks like a bug - the exception is thrown when executing getParam
(colName)
 when the require(actualDataType.equals(datatype), ...) requirement is not
 met so the expected requirement failed exception is not thrown and is 
hidden
 by the unexpected NoSuchMethodException instead. I can raise a bug if this
 really is an issue and I am not using something incorrectly.

 The problem I am facing however is that the Predictor expects features to
 have VectorUDT type as defined in Predictor class (protected def
 featuresDataType: DataType = new VectorUDT). But since this type is
 private[spark] my Transformer can not prepare features with this type 
which
 then correctly results in the exception above when I use a different type.

 Is there a way to define a custom Pipeline that would be able to use the
 existing Predictors without having to bypass the access modifiers or
 reimplement something or is the pipelining API not yet expected to be used
 in this way?

 Thanks,
 Martin



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
(mailto:user-unsubscr...@spark.apache.org)
For additional commands, e-mail: user-h...@spark.apache.org
(mailto:user-h...@spark.apache.org) 
 



 

​





Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
Sean,



thanks for your response. I am familiar with NoSuchMethodException in 
general, but I think it is not the case this time. The code actually 
attempts to get parameter by name using val m = this.getClass.getMethodName
(paramName).




This may be a bug, but it is only a side effect caused by the real problem I
am facing. My issue is that VectorUDT is not accessible by user code and 
therefore it is not possible to use custom ML pipeline with the existing 
Predictors (see the last two paragraphs in my first email).




Best Regards,

Martin



-- Původní zpráva --
Od: Sean Owen so...@cloudera.com
Komu: zapletal-mar...@email.cz
Datum: 25. 3. 2015 11:05:54
Předmět: Re: Spark ML Pipeline inaccessible types

NoSuchMethodError in general means that your runtime and compile-time
environments are different. I think you need to first make sure you
don't have mismatching versions of Spark.

On Wed, Mar 25, 2015 at 11:00 AM, zapletal-mar...@email.cz wrote:
 Hi,

 I have started implementing a machine learning pipeline using Spark 1.3.0
 and the new pipelining API and DataFrames. I got to a point where I have 
my
 training data set prepared using a sequence of Transformers, but I am
 struggling to actually train a model and use it for predictions.

 I am getting a java.lang.NoSuchMethodException:
 org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
 exception thrown at checkInputColumn method in Params trait when using a
 Predictor (LinearRegression in my case, but that should not matter). This
 looks like a bug - the exception is thrown when executing getParam
(colName)
 when the require(actualDataType.equals(datatype), ...) requirement is not
 met so the expected requirement failed exception is not thrown and is 
hidden
 by the unexpected NoSuchMethodException instead. I can raise a bug if this
 really is an issue and I am not using something incorrectly.

 The problem I am facing however is that the Predictor expects features to
 have VectorUDT type as defined in Predictor class (protected def
 featuresDataType: DataType = new VectorUDT). But since this type is
 private[spark] my Transformer can not prepare features with this type 
which
 then correctly results in the exception above when I use a different type.

 Is there a way to define a custom Pipeline that would be able to use the
 existing Predictors without having to bypass the access modifiers or
 reimplement something or is the pipelining API not yet expected to be used
 in this way?

 Thanks,
 Martin



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
Hi,



I have started implementing a machine learning pipeline using Spark 1.3.0 
and the new pipelining API and DataFrames. I got to a point where I have my 
training data set prepared using a sequence of Transformers, but I am 
struggling to actually train a model and use it for predictions.




I am getting a java.lang.NoSuchMethodException: org.apache.spark.ml.
regression.LinearRegression.myFeaturesColumnName() exception thrown at 
checkInputColumn method in Params trait when using a Predictor (
LinearRegression in my case, but that should not matter). This looks like a 
bug - the exception is thrown when executing getParam(colName) when the 
require(actualDataType.equals(datatype), ...) requirement is not met so the 
expected requirement failed exception is not thrown and is hidden by the 
unexpected NoSuchMethodException instead. I can raise a bug if this really 
is an issue and I am not using something incorrectly.




The problem I am facing however is that the Predictor expects features to 
have VectorUDT type as defined in Predictor class (protected def 
featuresDataType: DataType = new VectorUDT). But since this type is private
[spark] my Transformer can not prepare features with this type which then 
correctly results in the exception above when I use a different type.




Is there a way to define a custom Pipeline that would be able to use the 
existing Predictors without having to bypass the access modifiers or 
reimplement something or is the pipelining API not yet expected to be used 
in this way?




Thanks,

Martin








Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread Sean Owen
NoSuchMethodError in general means that your runtime and compile-time
environments are different. I think you need to first make sure you
don't have mismatching versions of Spark.

On Wed, Mar 25, 2015 at 11:00 AM,  zapletal-mar...@email.cz wrote:
 Hi,

 I have started implementing a machine learning pipeline using Spark 1.3.0
 and the new pipelining API and DataFrames. I got to a point where I have my
 training data set prepared using a sequence of Transformers, but I am
 struggling to actually train a model and use it for predictions.

 I am getting a java.lang.NoSuchMethodException:
 org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
 exception thrown at checkInputColumn method in Params trait when using a
 Predictor (LinearRegression in my case, but that should not matter). This
 looks like a bug - the exception is thrown when executing getParam(colName)
 when the require(actualDataType.equals(datatype), ...) requirement is not
 met so the expected requirement failed exception is not thrown and is hidden
 by the unexpected NoSuchMethodException instead. I can raise a bug if this
 really is an issue and I am not using something incorrectly.

 The problem I am facing however is that the Predictor expects features to
 have VectorUDT type as defined in Predictor class (protected def
 featuresDataType: DataType = new VectorUDT). But since this type is
 private[spark] my Transformer can not prepare features with this type which
 then correctly results in the exception above when I use a different type.

 Is there a way to define a custom Pipeline that would be able to use the
 existing Predictors without having to bypass the access modifiers or
 reimplement something or is the pipelining API not yet expected to be used
 in this way?

 Thanks,
 Martin



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ML Pipeline question about caching

2015-03-17 Thread Cesar Flores
Hello all:

I am using the ML Pipeline, which I consider very powerful. I have the next
use case:

   - I have three transformers, which I will call A,B,C, that basically
   extract features from text files, with no parameters.
   - I have a final stage D, which is the logistic regression estimator.
   - I am creating a pipeline with the sequence A,B,C,D.
   - Finally, I am using this pipeline as estimator parameter of the
   CrossValidator class.

I have some concerns about how data persistance inside the cross validator
works. For example, if only D has multiple parameters to tune using the
cross validator, my concern is that the transformation A-B-C is being
performed multiple times?. Is that the case, or it is Spark smart enough to
realize that it is possible to persist the output of C? Do it will be
better to leave A,B, and C outside the cross validator pipeline?

Thanks a lot
-- 
Cesar Flores


Re: ML Pipeline question about caching

2015-03-17 Thread Peter Rudenko

Hi Cesar,
I had a similar issue. Yes for now it’s better to do A,B,C outside a 
crossvalidator. Take a look to my comment 
https://issues.apache.org/jira/browse/SPARK-4766?focusedCommentId=14320038page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14320038 
and this jira https://issues.apache.org/jira/browse/SPARK-5844. The 
problem is that transformers could also have hyperparameters in the 
future (like word2vec transformer). Then crossvalidator would need to 
find need to find the best parameters for both transformer + estimator. 
It will blow number of combinations (num parameters for transformer 
/number parameters for estimator / number of folds).


Thanks,
Peter Rudenko

On 2015-03-18 00:26, Cesar Flores wrote:



Hello all:

I am using the ML Pipeline, which I consider very powerful. I have the 
next use case:


  * I have three transformers, which I will call A,B,C, that basically
extract features from text files, with no parameters.
  * I have a final stage D, which is the logistic regression estimator.
  * I am creating a pipeline with the sequence A,B,C,D.
  * Finally, I am using this pipeline as estimator parameter of the
CrossValidator class.

I have some concerns about how data persistance inside the cross 
validator works. For example, if only D has multiple parameters to 
tune using the cross validator, my concern is that the transformation 
A-B-C is being performed multiple times?. Is that the case, or it is 
Spark smart enough to realize that it is possible to persist the 
output of C? Do it will be better to leave A,B, and C outside the 
cross validator pipeline?


Thanks a lot
--
Cesar Flores


​


Re: Need some help to create user defined type for ML pipeline

2015-02-23 Thread Jaonary Rabarisoa
Hi Joseph,

Thank you for you feedback. I've managed to define an image type by
following VectorUDT implementation.

I have another question about the definition of a user defined transformer.
The unary tranfromer is private to spark ml. Do you plan
to give a developer api for transformers ?



On Sun, Jan 25, 2015 at 2:26 AM, Joseph Bradley jos...@databricks.com
wrote:

 Hi Jao,

 You're right that defining serialize and deserialize is the main task in
 implementing a UDT.  They are basically translating between your native
 representation (ByteImage) and SQL DataTypes.  The sqlType you defined
 looks correct, and you're correct to use a row of length 4.  Other than
 that, it should just require copying data to and from SQL Rows.  There are
 quite a few examples of that in the codebase; I'd recommend searching based
 on the particular DataTypes you're using.

 Are there particular issues you're running into?

 Joseph

 On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Hi all,

 I'm trying to implement a pipeline for computer vision based on the
 latest ML package in spark. The first step of my pipeline is to decode
 image (jpeg for instance) stored in a parquet file.
 For this, I begin to create a UserDefinedType that represents a decoded
 image stored in a array of byte. Here is my first attempt :























 *@SQLUserDefinedType(udt = classOf[ByteImageUDT])class ByteImage(channels: 
 Int, width: Int, height: Int, data: Array[Byte])private[spark] class 
 ByteImageUDT extends UserDefinedType[ByteImage] {  override def sqlType: 
 StructType = {// type: 0 = sparse, 1 = dense// We only use values 
 for dense vectors, and size, indices, and values for sparse// 
 vectors. The values field is nullable because we might want to add binary 
 vectors later,// which uses size and indices, but not values.
 StructType(Seq(  StructField(channels, IntegerType, nullable = false), 
  StructField(width, IntegerType, nullable = false),  
 StructField(height, IntegerType, nullable = false),  
 StructField(data, BinaryType, nullable = false)  }  override def 
 serialize(obj: Any): Row = {val row = new GenericMutableRow(4)val 
 img = obj.asInstanceOf[ByteImage]*






 *...  }  override def deserialize(datum: Any): Vector = {  *

 **








 *}  }  override def pyUDT: String = pyspark.mllib.linalg.VectorUDT  
 override def userClass: Class[Vector] = classOf[Vector]}*


 I take the VectorUDT as a starting point but there's a lot of thing that I 
 don't really understand. So any help on defining serialize and deserialize 
 methods will be appreciated.

 Best Regards,

 Jao





Re: Need some help to create user defined type for ML pipeline

2015-02-23 Thread Xiangrui Meng
Yes, we are going to expose the developer API. There was a long
discussion in the PR: https://github.com/apache/spark/pull/3637. So we
marked them package private and look for feedback on how to improve
it. Please implement your classes under `spark.ml` for now and let us
know your feedback. Thanks! -Xiangrui

On Mon, Feb 23, 2015 at 8:10 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
 Hi Joseph,

 Thank you for you feedback. I've managed to define an image type by
 following VectorUDT implementation.

 I have another question about the definition of a user defined transformer.
 The unary tranfromer is private to spark ml. Do you plan
 to give a developer api for transformers ?



 On Sun, Jan 25, 2015 at 2:26 AM, Joseph Bradley jos...@databricks.com
 wrote:

 Hi Jao,

 You're right that defining serialize and deserialize is the main task in
 implementing a UDT.  They are basically translating between your native
 representation (ByteImage) and SQL DataTypes.  The sqlType you defined looks
 correct, and you're correct to use a row of length 4.  Other than that, it
 should just require copying data to and from SQL Rows.  There are quite a
 few examples of that in the codebase; I'd recommend searching based on the
 particular DataTypes you're using.

 Are there particular issues you're running into?

 Joseph

 On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Hi all,

 I'm trying to implement a pipeline for computer vision based on the
 latest ML package in spark. The first step of my pipeline is to decode image
 (jpeg for instance) stored in a parquet file.
 For this, I begin to create a UserDefinedType that represents a decoded
 image stored in a array of byte. Here is my first attempt :


 @SQLUserDefinedType(udt = classOf[ByteImageUDT])
 class ByteImage(channels: Int, width: Int, height: Int, data:
 Array[Byte])


 private[spark] class ByteImageUDT extends UserDefinedType[ByteImage] {

   override def sqlType: StructType = {
 // type: 0 = sparse, 1 = dense
 // We only use values for dense vectors, and size, indices, and
 values for sparse
 // vectors. The values field is nullable because we might want to
 add binary vectors later,
 // which uses size and indices, but not values.
 StructType(Seq(
   StructField(channels, IntegerType, nullable = false),
   StructField(width, IntegerType, nullable = false),
   StructField(height, IntegerType, nullable = false),
   StructField(data, BinaryType, nullable = false)
   }

   override def serialize(obj: Any): Row = {

 val row = new GenericMutableRow(4)
 val img = obj.asInstanceOf[ByteImage]


 ...
   }

   override def deserialize(datum: Any): Vector = {


 


 }
   }

   override def pyUDT: String = pyspark.mllib.linalg.VectorUDT

   override def userClass: Class[Vector] = classOf[Vector]
 }


 I take the VectorUDT as a starting point but there's a lot of thing that
 I don't really understand. So any help on defining serialize and deserialize
 methods will be appreciated.

 Best Regards,

 Jao




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark ML pipeline

2015-02-11 Thread Reynold Xin
Yes. Next release (Spark 1.3) is coming out end of Feb / early Mar.

On Wed, Feb 11, 2015 at 7:22 AM, Jianguo Li flyingfromch...@gmail.com
wrote:

 Hi,

 I really like the pipeline in the spark.ml in Spark1.2 release. Will
 there be more machine learning algorithms implemented for the pipeline
 framework in the next major release? Any idea when the next major release
 comes out?

 Thanks,

 Jianguo



Re: Need some help to create user defined type for ML pipeline

2015-01-24 Thread Joseph Bradley
Hi Jao,

You're right that defining serialize and deserialize is the main task in
implementing a UDT.  They are basically translating between your native
representation (ByteImage) and SQL DataTypes.  The sqlType you defined
looks correct, and you're correct to use a row of length 4.  Other than
that, it should just require copying data to and from SQL Rows.  There are
quite a few examples of that in the codebase; I'd recommend searching based
on the particular DataTypes you're using.

Are there particular issues you're running into?

Joseph

On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Hi all,

 I'm trying to implement a pipeline for computer vision based on the latest
 ML package in spark. The first step of my pipeline is to decode image (jpeg
 for instance) stored in a parquet file.
 For this, I begin to create a UserDefinedType that represents a decoded
 image stored in a array of byte. Here is my first attempt :























 *@SQLUserDefinedType(udt = classOf[ByteImageUDT])class ByteImage(channels: 
 Int, width: Int, height: Int, data: Array[Byte])private[spark] class 
 ByteImageUDT extends UserDefinedType[ByteImage] {  override def sqlType: 
 StructType = {// type: 0 = sparse, 1 = dense// We only use values 
 for dense vectors, and size, indices, and values for sparse// 
 vectors. The values field is nullable because we might want to add binary 
 vectors later,// which uses size and indices, but not values.
 StructType(Seq(  StructField(channels, IntegerType, nullable = false),  
 StructField(width, IntegerType, nullable = false),  
 StructField(height, IntegerType, nullable = false),  
 StructField(data, BinaryType, nullable = false)  }  override def 
 serialize(obj: Any): Row = {val row = new GenericMutableRow(4)val img 
 = obj.asInstanceOf[ByteImage]*






 *...  }  override def deserialize(datum: Any): Vector = {  *

 **








 *}  }  override def pyUDT: String = pyspark.mllib.linalg.VectorUDT  
 override def userClass: Class[Vector] = classOf[Vector]}*


 I take the VectorUDT as a starting point but there's a lot of thing that I 
 don't really understand. So any help on defining serialize and deserialize 
 methods will be appreciated.

 Best Regards,

 Jao




Need some help to create user defined type for ML pipeline

2015-01-19 Thread Jaonary Rabarisoa
Hi all,

I'm trying to implement a pipeline for computer vision based on the latest
ML package in spark. The first step of my pipeline is to decode image (jpeg
for instance) stored in a parquet file.
For this, I begin to create a UserDefinedType that represents a decoded
image stored in a array of byte. Here is my first attempt :























*@SQLUserDefinedType(udt = classOf[ByteImageUDT])class
ByteImage(channels: Int, width: Int, height: Int, data:
Array[Byte])private[spark] class ByteImageUDT extends
UserDefinedType[ByteImage] {  override def sqlType: StructType = {
// type: 0 = sparse, 1 = dense// We only use values for dense
vectors, and size, indices, and values for sparse// vectors.
The values field is nullable because we might want to add binary
vectors later,// which uses size and indices, but not
values.StructType(Seq(  StructField(channels, IntegerType,
nullable = false),  StructField(width, IntegerType, nullable =
false),  StructField(height, IntegerType, nullable = false),
 StructField(data, BinaryType, nullable = false)  }  override def
serialize(obj: Any): Row = {val row = new GenericMutableRow(4)
val img = obj.asInstanceOf[ByteImage]*






*...  }  override def deserialize(datum: Any): Vector = {  *

**








*}  }  override def pyUDT: String =
pyspark.mllib.linalg.VectorUDT  override def userClass:
Class[Vector] = classOf[Vector]}*


I take the VectorUDT as a starting point but there's a lot of thing
that I don't really understand. So any help on defining serialize and
deserialize methods will be appreciated.

Best Regards,

Jao