Hi,
I am trying to save a TF-IDF model in PySpark. Looks like this is not
supported.
Using `model.save()` causes:
AttributeError: 'IDFModel' object has no attribute 'save'
Using `pickle` causes:
TypeError: can't pickle lock objects
Does anyone have suggestions
Thanks!
Asim
Here is the
pache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala>
> .
>
> Yanbo
>
> 2015-12-17 13:41 GMT+08:00 Asim Jalis <asimja...@gmail.com>:
>
>> I wanted to use get feature importances related to a Random Forest as
&
I wanted to use get feature importances related to a Random Forest as
described in this JIRA: https://issues.apache.org/jira/browse/SPARK-5133
However, I don’t see how to call this. I don't see any methods exposed on
org.apache.spark.mllib.tree.RandomForest
How can I get featureImportances when
When I use reduceByKeyAndWindow with func and invFunc (in PySpark) the size
of the window keeps growing. I am appending the code that reproduces this
issue. This prints out the count() of the dstream which goes up every batch
by 10 elements.
Is this a bug in the Python version of Scala or is this
I want to test some Spark Streaming code that is using
reduceByKeyAndWindow. If I do not enable checkpointing, I get the error:
java.lang.IllegalArgumentException: requirement failed: The checkpoint
directory has not been set. Please set it by StreamingContext.checkpoint().
But if I enable
something similar for Spark Streaming unit tests.
TD
On Fri, Aug 14, 2015 at 1:04 PM, Asim Jalis asimja...@gmail.com wrote:
I want to test some Spark Streaming code that is using
reduceByKeyAndWindow. If I do not enable checkpointing, I get the error:
java.lang.IllegalArgumentException
Another fix might be to remove the exception that is thrown when windowing
and other stateful operations are used without checkpointing.
On Fri, Aug 14, 2015 at 5:43 PM, Asim Jalis asimja...@gmail.com wrote:
I feel the real fix here is to remove the exception from QueueInputDStream
class
In Spark Streaming apps if I enable ssc.checkpoint(dir) does this
checkpoint all RDDs? Or is it just checkpointing windowing and state RDDs?
For example, if in a DStream I am using an iterative algorithm on a
non-state non-window RDD, do I have to checkpoint it explicitly myself, or
can I assume
Since checkpointing in streaming apps happens every checkpoint duration, in
the event of failure, how is the system able to recover the state changes
that happened after the last checkpoint?
Is there a way to join non-DStream RDDs with DStream RDDs?
Here is the use case. I have a lookup table stored in HDFS that I want to
read as an RDD. Then I want to join it with the RDDs that are coming in
through the DStream. How can I do this?
Thanks.
Asim
In Spark Streaming, is there a way to initialize the state
of updateStateByKey before it starts processing RDDs? I noticed that there
is an overload of updateStateByKey that takes an initialRDD in the latest
sources (although not in the 1.2.0 release). Is there another way to do
this until this
Another option is to make a copy of log4j.properties in the current
directory where you start spark-shell from, and modify
log4j.rootCategory=INFO,
console to log4j.rootCategory=ERROR, console. Then start the shell.
On Wed, Jan 7, 2015 at 3:39 AM, Akhil ak...@sigmoidanalytics.com wrote:
Edit
/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L43
On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis asimja...@gmail.com wrote:
Is there an easy way to do a moving average across a single RDD (in a
non-streaming app). Here is the use case. I have an RDD made up of stock
prices. I want
this most generic method, like the
aggregate methods.
On Tue, Jan 6, 2015 at 8:34 PM, Asim Jalis asimja...@gmail.com wrote:
Thanks. Another question. I have event data with timestamps. I want to
create a sliding window using timestamps. Some windows will have a lot of
events in them others
implementing 1 minute buckets, sliding by 10
seconds:
tickerRDD.flatMap(ticker =
(ticker.timestamp - 6 to ticker.timestamp by 15000).map(ts = (ts,
ticker))
).map { case(ts, ticker) =
((ts / 6) * 6, ticker)
}.groupByKey
On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis asimja
, not one value.
On Wed, Jan 7, 2015 at 12:10 AM, Paolo Platter paolo.plat...@agilelab.it
wrote:
In my opinion you should use fold pattern. Obviously after an sort by
trasformation.
Paolo
Inviata dal mio Windows Phone
--
Da: Asim Jalis asimja...@gmail.com
Inviato
I guess I can use a similar groupBy approach. Map each event to all the
windows that it can belong to. Then do a groupBy, etc. I was wondering if
there was a more elegant approach.
On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis asimja...@gmail.com wrote:
Except I want it to be a sliding window. So
Is there an easy way to do a moving average across a single RDD (in a
non-streaming app). Here is the use case. I have an RDD made up of stock
prices. I want to calculate a moving average using a window size of N.
Thanks.
Asim
Q: In Spark Streaming if your DStream transformation and output action take
longer than the batch duration will the system process the next batch in
another thread? Or will it just wait until the first batch’s RDD is
processed? In other words does it build up a queue of buffered RDDs
awaiting
Is there a way I can have a JDBC connection open through a streaming job. I
have a foreach which is running once per batch. However, I don’t want to
open the connection for each batch but would rather have a persistent
connection that I can reuse. How can I do this?
Thanks.
Asim
20 matches
Mail list logo