Hi Xavier,

Regarding Arrow usage in Spark, using Arrow format to transfer data between
Python and Java has been the focus so far because this area stood to
benefit the most.  It's possible that the scope of Arrow could broaden in
the future, but there still needs to be discussions about this.

Bryan

On Mon, May 14, 2018 at 9:55 AM, Pierce Lamb <richard.pierce.l...@gmail.com>
wrote:

> Hi Xavier,
>
> Along the lines of connecting to multiple sources of data and replacing
> ETL tools you may want to check out Confluent's blog on building a
> real-time streaming ETL pipeline on Kafka
> <https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/>
> as well as SnappyData's blog on Real-Time Streaming ETL with SnappyData
> <http://www.snappydata.io/blog/real-time-streaming-etl-with-snappydata> where
> Spark is central to connecting to multiple data sources, executing SQL on
> streams etc. These should provide nice comparisons to your ideas about
> Dremio + Spark as ETL tools.
>
> Disclaimer: I am a SnappyData employee
>
> Hope this helps,
>
> Pierce
>
> On Mon, May 14, 2018 at 2:24 AM, xmehaut <xavier.meh...@gmail.com> wrote:
>
>> Hi Michaƫl,
>>
>> I'm not an expert of Dremio, i just try to evaluate the potential of this
>> techno and what impacts it could have on spark, and how they can work
>> together, or how spark could use even further arrow internally along the
>> existing algorithms.
>>
>> Dremio has already a quite rich api set enabling to access for instance to
>> metadata, sql queries, or even to create virtual datasets
>> programmatically.
>> They also have a lot of predefined functions, and I imagine there will be
>> more an more fucntions in the future, eg machine learning functions like
>> the
>> ones we may find in azure sql server which enables to mix sql and ml
>> functions.  Acces to dremio is made through jdbc, and we may imagine to
>> access virtual datasets through spark and create dynamically new datasets
>> from the api connected to parquets files stored dynamycally by spark on
>> hdfs, azure datalake or s3... Of course a more thight integration between
>> both should be better with a spark read/write connector to dremio :)
>>
>> regards
>> xavier
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Reply via email to