During the summit, I also had a lot of discussions over similar topics with
multiple Committers and active users. I heard many fantastic ideas. I
believe Spark improvement proposals are good channels to collect the
requirements/designs.
IMO, we also need to consider the priority when working on
At the spark summit this week, everyone from PMC members to users I had
never met before were asking me about the Spark improvement proposals
idea. It's clear that it's a real community need.
But it's been almost half a year, and nothing visible has been done.
Reynold, are you going to do this?
Thanks Jorn for the input, our users want to run queries that perform large
aggregations of data from different tables as well as simple ad hockey queries
over 1 table. The tables are all in orc format, they're currently using the
hive plus tez architecture that you mention but experiencing
I think this is a rather simplistic view. All the tools to computation
in-memory in the end. For certain type of computation and usage patterns it
makes sense to keep them in memory. For example, most of the machine learning
approaches require to include the same data in several iterative