Links that was helpful to me during learning about the spark source code:
- Articles with "spark" tag in this blog:
http://hydronitrogen.com/tag/spark.html
- Jacek's "mastering apache spark" git book:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Hope those can help.
On Sat, Apr 8,
Thanks Jules. It was helpful.
On Fri, Apr 7, 2017 at 8:32 PM, Jules Damji wrote:
> This blog that shows how to write a custom sink: https://databricks.com/
> blog/2017/04/04/real-time-end-to-end-integration-with-
> apache-kafka-in-apache-sparks-structured-streaming.html
>
>
how would you use only relational transformations on dataset?
On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan wrote:
> Hi Spark-users,
> I came across a few sources which mentioned DataFrame can be more
> efficient than Dataset. I can understand this is true because Dataset
>
let me try that again. i left some crap at the bottom of my previous email
as i was editing it. sorry about that. here it goes:
it is because you use Dataset[X] but the actual computations are still done
in Dataset[Row] (so DataFrame). well... the actual computations are done in
RDD[InternalRow]
it is because you use Dataset[X] but the actual computations are still done
in Dataset[Row] (so DataFrame). well... the actual computations are done in
RDD[InternalRow] with spark's internal types to represent String, Map, Seq,
structs, etc.
so for example if you do:
scala> val x:
As far as I am aware in newer Spark versions a DataFrame is the same as
Dataset[Row].
In fact, performance depends on so many factors, so I am not sure such a
comparison makes sense.
> On 8. Apr 2017, at 20:15, Shiyuan wrote:
>
> Hi Spark-users,
> I came across a few
On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram
wrote:
> Hi,
>
> We use monotonically_increasing_id() as well, but just cache the table
> first like Ankur suggested. With that method, we get the same keys in all
> derived tables.
>
Ah, okay, awesome. Let me give that a
Hi Spark-users,
I came across a few sources which mentioned DataFrame can be more
efficient than Dataset. I can understand this is true because Dataset
allows functional transformation which Catalyst cannot look into and hence
cannot optimize well. But can DataFrame be more efficient than