Hi Xavier, Along the lines of connecting to multiple sources of data and replacing ETL tools you may want to check out Confluent's blog on building a real-time streaming ETL pipeline on Kafka <https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/> as well as SnappyData's blog on Real-Time Streaming ETL with SnappyData <http://www.snappydata.io/blog/real-time-streaming-etl-with-snappydata> where Spark is central to connecting to multiple data sources, executing SQL on streams etc. These should provide nice comparisons to your ideas about Dremio + Spark as ETL tools.
Disclaimer: I am a SnappyData employee Hope this helps, Pierce On Mon, May 14, 2018 at 2:24 AM, xmehaut <xavier.meh...@gmail.com> wrote: > Hi Michaƫl, > > I'm not an expert of Dremio, i just try to evaluate the potential of this > techno and what impacts it could have on spark, and how they can work > together, or how spark could use even further arrow internally along the > existing algorithms. > > Dremio has already a quite rich api set enabling to access for instance to > metadata, sql queries, or even to create virtual datasets programmatically. > They also have a lot of predefined functions, and I imagine there will be > more an more fucntions in the future, eg machine learning functions like > the > ones we may find in azure sql server which enables to mix sql and ml > functions. Acces to dremio is made through jdbc, and we may imagine to > access virtual datasets through spark and create dynamically new datasets > from the api connected to parquets files stored dynamycally by spark on > hdfs, azure datalake or s3... Of course a more thight integration between > both should be better with a spark read/write connector to dremio :) > > regards > xavier > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >