Awesome! I just read design docs. That is EXACTLY what I was talking about!
Looking forward to it!
Thanks
On Wed, Feb 3, 2016 at 7:40 AM, Koert Kuipers wrote:
> yeah there was some discussion about adding them to RDD, but it would
> break a lot. so Dataset was born.
>
> yes
On Wed, Feb 3, 2016 at 1:42 PM, Nirav Patel wrote:
> Awesome! I just read design docs. That is EXACTLY what I was talking
> about! Looking forward to it!
>
Great :)
Most of the API is there in 1.6. For the next release I would like to
unify DataFrame <-> Dataset and do
I think spark dataframe supports more than just SQL. It is more like pandas
dataframe.( I rarely use the SQL feature. )
There are a lot of novelties in dataframe so I think it is quite optimize
for many tasks. The in-memory data structure is very memory efficient. I
just change a very slow RDD
>
> A principal difference between RDDs and DataFrames/Datasets is that the
> latter have a schema associated to them. This means that they support only
> certain types (primitives, case classes and more) and that they are
> uniform, whereas RDDs can contain any serializable object and must not
>
Sure, having a common distributed query and compute engine for all kind of
data source is alluring concept to market and advertise and to attract
potential customers (non engineers, analyst, data scientist). But it's
nothing new!..but darn old school. it's taking bits and pieces from
existing sql
To address one specific question:
> Docs says it usues sun.misc.unsafe to convert physical rdd structure into
byte array at some point for optimized GC and memory. My question is why is
it only applicable to SQL/Dataframe and not RDD? RDD has types too!
A principal difference between RDDs and
Hi Michael,
Is there a section in the spark documentation demonstrate how to serialize
arbitrary objects in Dataframe? The last time I did was using some User
Defined Type (copy from VectorUDT).
Best Regards,
Jerry
On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust
Hi Jerry,
Yes I read that benchmark. And doesn't help in most cases. I'll give you
example of one of our application. It's a memory hogger by nature since it
works on groupByKey and performs combinatorics on Iterator. So it maintain
few structures inside task. It works on mapreduce with half the
with respect to joins, unfortunately not all implementations are available.
for example i would like to use joins where one side is streaming (and the
other cached). this seems to be available for DataFrame but not for RDD.
On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel
so latest optimizations done on spark 1.4 and 1.5 releases are mostly from
project Tungsten. Docs says it usues sun.misc.unsafe to convert physical
rdd structure into byte array at some point for optimized GC and memory. My
question is why is it only applicable to SQL/Dataframe and not RDD? RDD
Dataset will have access to some of the catalyst/tungsten optimizations
while also giving you scala and types. However that is currently
experimental and not yet as efficient as it could be.
On Feb 2, 2016 7:50 PM, "Nirav Patel" wrote:
> Sure, having a common distributed
I dont understand why one thinks RDD of case object doesn't have
types(schema) ? If spark can convert RDD to DataFrame which means it
understood the schema. SO then from that point why one has to use SQL
features to do further processing? If all spark need for optimizations is
schema then what
Hi Nirav,
I'm sure you read this?
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
There is a benchmark in the article to show that dataframe "can" outperform
RDD implementation by 2 times. Of course, benchmarks can be "made". But
from the
Hi,
Perhaps I should write a blog about this that why spark is focusing more on
writing easier spark jobs and hiding underlaying performance optimization
details from a seasoned spark users. It's one thing to provide such
abstract framework that does optimization for you so you don't have to
What do you think is preventing you from optimizing your own RDD-level
transformations and actions? AFAIK, nothing that has been added in
Catalyst precludes you from doing that. The fact of the matter is, though,
that there is less type and semantic information available to Spark from
the raw
I haven't gone through much details of spark catalyst optimizer and
tungston project but we have been advised by databricks support to use
DataFrame to resolve issues with OOM error that we are getting during Join
and GroupBy operations. We use spark 1.3.1 and looks like it can not
perform
16 matches
Mail list logo