I think it's very likely we'll be able to contribute the Spark adapter back to 
Calcite. If we don't, the adapter in Calcite will undoubtedly be rewritten 
eventually anyways considering how much interest there's been in it. I'll try 
to see what I can do this week.

> On Jan 20, 2017, at 4:41 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> I agree with Jacques.
> 
> Jordan, If your company decides to open-source your work, I would be
> willing - nay, delighted - to replace Calcite's existing Spark
> adapter. As you say, the current adapter is unusable. There is
> considerable appetite for a real distributed compute engine for
> Calcite (not counting Drill and Hive, because they embed Calcite, not
> the other way around) and that would convert into people using,
> hardening and extending your code.
> 
> Julian
> 
> 
>> On Fri, Jan 20, 2017 at 4:24 PM, Jacques Nadeau <jacq...@apache.org> wrote:
>> Jordan, super interesting work you've shared. It would be very cool to get
>> this incorporated back into Spark mainline. That would continue to broaden
>> Calcite's reach :)
>> 
>> On Fri, Jan 20, 2017 at 1:36 PM, jordan.halter...@gmail.com <
>> jordan.halter...@gmail.com> wrote:
>> 
>>> So, AFAIK the Spark adapter that's inside Calcite is in an unusable state
>>> right now. It's still using Spark 1.x and last time I tried it I couldn't
>>> get it to run. It probably needs to either be removed or completely
>>> rewritten. But I can certainly offer some guidance on working with Spark
>>> and Calcite.
>>> 
>>> As we were discussing on the other thread, I've been doing research on
>>> optimizing Spark queries with Calcite at my company. It may or may not be
>>> open sourced some time in the near future, I don't know yet.
>>> 
>>> So, there are really a couple ways to go about optimizing Spark queries
>>> using Calcite. The first option is the approach the current code in Calcite
>>> takes: use Calcite on RDDs. The code that you see in Calcite seems likely
>>> to have been developed prior to Spark SQL existing or at least as an
>>> alternative to Spark SQL. It allows you to run Calcite SQL queries on Spark
>>> by converting optimized Calcite plans into Spark RDD operations, using RDD
>>> methods for relational expressions and Calcite's Enumerables for row
>>> expressions.
>>> 
>>> Alternatively, what we wanted to do when we started our project was
>>> integrate Calcite directly into Spark SQL. Spark SQL/DataFrames/Datasets
>>> are widely used APIs, and we wanted to see if we could apply Calcite's
>>> significantly better optimization techniques to Spark's plans without
>>> breaking the API. So, that's the second way to go about it. What we did is
>>> essentially implemented a custom Optimizer (a Spark interface) that
>>> converted from Spark logical plans to Calcite logical plans, used Calcite
>>> to optimize the plan, and then converted from Calcite back to Spark.
>>> Essentially, this is a complete replacement of the optimization phase of
>>> Catalyst (Spark's optimizer). But converting from Spark plans to Calcite
>>> plans and back is admittedly a major challenge that has taken months to
>>> perfect for more complex expressions like aggregations/grouping sets.
>>> 
>>> So, the two options are really: replace Spark SQL with Calcite, or
>>> integrate Calcite into Spark SQL. The former is a fairly straightforward
>>> use case for Calcite. The latter requires a deep understanding of both
>>> Calcite's and Spark's relational algebra and writing algorithms to convert
>>> between the two. But I can say that it has been very successful. We've been
>>> able to improve Spark's performance quite significantly on all different
>>> types of data - including flat files - and have seen 1-2 orders of
>>> magnitude improvements in Spark's performance against databases like
>>> Postgres, Redshift, Mongo, etc in TPC-DS benchmarks.
>>> 
>>>> On Jan 18, 2017, at 12:25 PM, Riccardo Tommasini <
>>> riccardo.tommas...@polimi.it> wrote:
>>>> 
>>>> Hello,
>>>> I'm trying to understand how to use the spark adapter.
>>>> 
>>>> Does anyone have any example?
>>>> 
>>>> Thanks in advance
>>>> 
>>>> Riccardo Tommasini
>>>> Master Degree Computer Science
>>>> PhD Student at Politecnico di Milano (Italy)
>>>> streamreasoning.org<http://streamreasoning.org/>
>>>> 
>>>> Submitted from an iPhone, I apologise for typos.
>>> 

Reply via email to