Re: Contributed to spark

2017-04-08 Thread Shuai Lin
Links that was helpful to me during learning about the spark source code: - Articles with "spark" tag in this blog: http://hydronitrogen.com/tag/spark.html - Jacek's "mastering apache spark" git book: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ Hope those can help. On Sat, Apr 8,

Re: Structured streaming and writing output to Cassandra

2017-04-08 Thread shyla deshpande
Thanks Jules. It was helpful. On Fri, Apr 7, 2017 at 8:32 PM, Jules Damji wrote: > This blog that shows how to write a custom sink: https://databricks.com/ > blog/2017/04/04/real-time-end-to-end-integration-with- > apache-kafka-in-apache-sparks-structured-streaming.html > >

Re: Why dataframe can be more efficient than dataset?

2017-04-08 Thread Koert Kuipers
how would you use only relational transformations on dataset? On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan wrote: > Hi Spark-users, > I came across a few sources which mentioned DataFrame can be more > efficient than Dataset. I can understand this is true because Dataset >

Re: Why dataframe can be more efficient than dataset?

2017-04-08 Thread Koert Kuipers
let me try that again. i left some crap at the bottom of my previous email as i was editing it. sorry about that. here it goes: it is because you use Dataset[X] but the actual computations are still done in Dataset[Row] (so DataFrame). well... the actual computations are done in RDD[InternalRow]

Re: Why dataframe can be more efficient than dataset?

2017-04-08 Thread Koert Kuipers
it is because you use Dataset[X] but the actual computations are still done in Dataset[Row] (so DataFrame). well... the actual computations are done in RDD[InternalRow] with spark's internal types to represent String, Map, Seq, structs, etc. so for example if you do: scala> val x:

Re: Why dataframe can be more efficient than dataset?

2017-04-08 Thread Jörn Franke
As far as I am aware in newer Spark versions a DataFrame is the same as Dataset[Row]. In fact, performance depends on so many factors, so I am not sure such a comparison makes sense. > On 8. Apr 2017, at 20:15, Shiyuan wrote: > > Hi Spark-users, > I came across a few

Re: Assigning a unique row ID

2017-04-08 Thread Everett Anderson
On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram wrote: > Hi, > > We use monotonically_increasing_id() as well, but just cache the table > first like Ankur suggested. With that method, we get the same keys in all > derived tables. > Ah, okay, awesome. Let me give that a

Why dataframe can be more efficient than dataset?

2017-04-08 Thread Shiyuan
Hi Spark-users, I came across a few sources which mentioned DataFrame can be more efficient than Dataset. I can understand this is true because Dataset allows functional transformation which Catalyst cannot look into and hence cannot optimize well. But can DataFrame be more efficient than