Re: PySpark + Streaming + DataFrames

Jason White Mon, 19 Oct 2015 15:09:13 -0700

Ah, that makes sense then, thanks TD.

The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if you 
provide the schema, so I was avoiding back-and-forth conversions. I’ll see if I 
can create a ‘trusted’ conversion that doesn’t involve the `take`.


-- 
Jason

On October 19, 2015 at 5:23:59 PM, Tathagata Das (t...@databricks.com) wrote:

RDD and DF are not compatible data types. So you cannot return a DF when you 
have to return an RDD. What rather you can do is return the underlying RDD of 
the dataframe by dataframe.rdd(). 


On Fri, Oct 16, 2015 at 12:07 PM, Jason White <jason.wh...@shopify.com> wrote:
Hi Ken, thanks for replying.

Unless I'm misunderstanding something, I don't believe that's correct.
Dstream.transform() accepts a single argument, func. func should be a
function that accepts a single RDD, and returns a single RDD. That's what
transform_to_df does, except the RDD it returns is a DF.

I've used Dstream.transform() successfully in the past when transforming
RDDs, so I don't think my problem is there.

I haven't tried this in Scala yet, and all of the examples I've seen on the
website seem to use foreach instead of transform. Does this approach work in
Scala?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Streaming-DataFrames-tp25095p25099.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: PySpark + Streaming + DataFrames

Reply via email to