What we will do for that? I think, we can start to write requirements and flows.
On 2020/03/05 20:19:38, Julian Jaffe <jja...@pinterest.com.INVALID> wrote: > Yeah, I think the primary objective here is a standalone writer from Spark > to Druid. > > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe <itai.ya...@gmail.com> wrote: > > > Thanks Julian! > > I'm actually targeting for this connector to allow write capabilities (at > > least as a first phase), rather than focusing on read capabilities. > > Having said that, I definitely see the value (even for the use-cases in my > > company) of having a reader that queries S3 segments directly! Funny, we > > too have implemented a mechanism (although a very simple one) to get the > > locations of the segments through SegmentMetadataQueries, to allow > > batch-oriented queries to work with against the deep storage :) > > > > Anyway, as I said, I think we can focus on write capabilities for now, and > > worry about read capabilities later (if that's OK). > > > > On 2020/03/05 18:29:09, Julian Jaffe <jja...@pinterest.com.INVALID> > > wrote: > > > The spark-druid-connector you shared brings up another design decision we > > > should probably talk through. That connector effectively wraps an HTTP > > > query client with Spark plumbing. An alternative approach (and the one I > > > ended up building due to our business requirements) is to build a reader > > > that operates directly over the S3 segments, shifting load for what are > > > likely very large and non-interactive queries off Druid-specific hardware > > > (with the exception of a few SegmentMetadataQueries to get location > > info). > > > > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe <itai.ya...@gmail.com> wrote: > > > > > > > I'll let Julian answer, but in the meantime, I just wanted to point > > out we > > > > might be able to draw some inspiration from this Spark-Redshift > > connector ( > > > > https://github.com/databricks/spark-redshift#scala). > > > > Though it's somewhat outdated, it probably can be used as a reference > > for > > > > this new Spark-Druid connector we're planning. > > > > Another project to look at is > > > > https://github.com/SharpRay/spark-druid-connector. > > > > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r < > > sosyalmedya.oguz...@gmail.com> > > > > wrote: > > > > > I think second option would be better. Many people use spark for > > batch > > > > operations with isolated clusters. Me and my friends will taking time > > for > > > > that. Julian, can you share your experiences for that? After that, we > > can > > > > write our aims, requirements and flows easily. > > > > > > > > > > On 2020/02/26 13:26:13, itai yaffe <itai.ya...@gmail.com> wrote: > > > > > > Hey, > > > > > > Per Gian's proposal, and following this thread in Druid user group > > ( > > > > > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) > > and > > > > this > > > > > > thread in Druid Slack channel ( > > > > > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), > > I'd > > > > like > > > > > > to start discussing the options of having Spark-based ingestion > > into > > > > Druid. > > > > > > > > > > > > There's already an old project ( > > > > https://github.com/metamx/druid-spark-batch) > > > > > > for that, so perhaps we can use that as a starting point. > > > > > > > > > > > > The thread on Slack suggested 2 approaches: > > > > > > > > > > > > 1. *Simply replacing the Hadoop MapReduce ingestion task* - > > having a > > > > > > Spark batch job that ingests data into Druid, as a simple > > > > replacement of > > > > > > the Hadoop MapReduce ingestion task. > > > > > > Meaning - your data pipeline will have a Spark job to > > pre-process > > > > the > > > > > > data (similar to what some of us have today), and another Spark > > job > > > > to read > > > > > > the output of the previous job, and create Druid segments > > (again - > > > > > > following the same pattern as the Hadoop MapReduce ingestion > > task). > > > > > > 2. *Druid output sink for Spark* - rather than having 2 separate > > > > Spark > > > > > > jobs, 1 for pre-processing the data and 1 for ingesting the data > > > > into > > > > > > Druid, you'll have a single Spark job that pre-processes the > > data > > > > and > > > > > > creates Druid segments directly, e.g > > > > sparkDataFrame.write.format("druid") > > > > > > (as suggested by omngr on Slack). > > > > > > > > > > > > > > > > > > I personally prefer the 2nd approach - while it might be harder to > > > > > > implement, it seems the benefits are greater in this approach. > > > > > > > > > > > > I'd like to hear your thoughts and to start getting this ball > > rolling. > > > > > > > > > > > > Thanks, > > > > > > Itai > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org