Re: How To Save TF-IDF Model In PySpark

2016-01-15 Thread Andy Davidson
Are you using 1.6.0 or an older version?

I think I remember something in 1.5.1 saying save was not implemented in
python.


The current doc does not say anything about save()
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

http://spark.apache.org/docs/latest/ml-guide.html#saving-and-loading-pipelin
es
"Often times it is worth it to save a model or a pipeline to disk for later
use. In Spark 1.6, a model import/export functionality was added to the
Pipeline API. Most basic transformers are supported as well as some of the
more basic ML models. Please refer to the algorithm¹s API documentation to
see if saving and loading is supported."

andy




From:  Asim Jalis <asimja...@gmail.com>
Date:  Friday, January 15, 2016 at 4:02 PM
To:  "user @spark" <user@spark.apache.org>
Subject:  How To Save TF-IDF Model In PySpark

> Hi,
> 
> I am trying to save a TF-IDF model in PySpark. Looks like this is not
> supported. 
> 
> Using `model.save()` causes:
> 
> AttributeError: 'IDFModel' object has no attribute 'save'
> 
> Using `pickle` causes:
> 
> TypeError: can't pickle lock objects
> 
> Does anyone have suggestions
> 
> Thanks!
> 
> Asim
> 
> Here is the full repro. Start pyspark shell and then run this code in
> it.
> 
> ```
> # Imports
> from pyspark import SparkContext
> from pyspark.mllib.feature import HashingTF
> 
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.regression import Vectors
> from pyspark.mllib.feature import IDF
> 
> # Create some data
> n = 4
> freqs = [
> Vectors.sparse(n, (1, 3), (1.0, 2.0)),
> Vectors.dense([0.0, 1.0, 2.0, 3.0]),
> Vectors.sparse(n, [1], [1.0])]
> data = sc.parallelize(freqs)
> idf = IDF()
> model = idf.fit(data)
> tfidf = model.transform(data)
> 
> # View
> for r in tfidf.collect(): print(r)
> 
> # Try to save it
> model.save("foo.model")
> 
> # Try to save it with Pickle
> import pickle
> pickle.dump(model, open("model.p", "wb"))
> pickle.dumps(model)
> ```




Re: How To Save TF-IDF Model In PySpark

2016-01-15 Thread Jerry Lam
Can you save it to parquet with the vector in one field?

Sent from my iPhone

> On 15 Jan, 2016, at 7:33 pm, Andy Davidson <a...@santacruzintegration.com> 
> wrote:
> 
> Are you using 1.6.0 or an older version?
> 
> I think I remember something in 1.5.1 saying save was not implemented in 
> python.
> 
> 
> The current doc does not say anything about save()
> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
> 
> http://spark.apache.org/docs/latest/ml-guide.html#saving-and-loading-pipelines
> "Often times it is worth it to save a model or a pipeline to disk for later 
> use. In Spark 1.6, a model import/export functionality was added to the 
> Pipeline API. Most basic transformers are supported as well as some of the 
> more basic ML models. Please refer to the algorithm’s API documentation to 
> see if saving and loading is supported."
> 
> andy
> 
> 
> 
> 
> From: Asim Jalis <asimja...@gmail.com>
> Date: Friday, January 15, 2016 at 4:02 PM
> To: "user @spark" <user@spark.apache.org>
> Subject: How To Save TF-IDF Model In PySpark
> 
> Hi,
> 
> I am trying to save a TF-IDF model in PySpark. Looks like this is not
> supported. 
> 
> Using `model.save()` causes:
> 
> AttributeError: 'IDFModel' object has no attribute 'save'
> 
> Using `pickle` causes:
> 
> TypeError: can't pickle lock objects
> 
> Does anyone have suggestions 
> 
> Thanks!
> 
> Asim
> 
> Here is the full repro. Start pyspark shell and then run this code in
> it.
> 
> ```
> # Imports
> from pyspark import SparkContext
> from pyspark.mllib.feature import HashingTF
> 
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.regression import Vectors
> from pyspark.mllib.feature import IDF
> 
> # Create some data
> n = 4
> freqs = [
> Vectors.sparse(n, (1, 3), (1.0, 2.0)), 
> Vectors.dense([0.0, 1.0, 2.0, 3.0]), 
> Vectors.sparse(n, [1], [1.0])]
> data = sc.parallelize(freqs)
> idf = IDF()
> model = idf.fit(data)
> tfidf = model.transform(data)
> 
> # View
> for r in tfidf.collect(): print(r)
> 
> # Try to save it
> model.save("foo.model")
> 
> # Try to save it with Pickle
> import pickle
> pickle.dump(model, open("model.p", "wb"))
> pickle.dumps(model)
> ```


How To Save TF-IDF Model In PySpark

2016-01-15 Thread Asim Jalis
Hi,

I am trying to save a TF-IDF model in PySpark. Looks like this is not
supported.

Using `model.save()` causes:

AttributeError: 'IDFModel' object has no attribute 'save'

Using `pickle` causes:

TypeError: can't pickle lock objects

Does anyone have suggestions

Thanks!

Asim

Here is the full repro. Start pyspark shell and then run this code in
it.

```
# Imports
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import Vectors
from pyspark.mllib.feature import IDF

# Create some data
n = 4
freqs = [
Vectors.sparse(n, (1, 3), (1.0, 2.0)),
Vectors.dense([0.0, 1.0, 2.0, 3.0]),
Vectors.sparse(n, [1], [1.0])]
data = sc.parallelize(freqs)
idf = IDF()
model = idf.fit(data)
tfidf = model.transform(data)

# View
for r in tfidf.collect(): print(r)

# Try to save it
model.save("foo.model")

# Try to save it with Pickle
import pickle
pickle.dump(model, open("model.p", "wb"))
pickle.dumps(model)
```