Github proposal <https://github.com/apache/druid/issues/9780>. I'll send a separate email to the dev list in the morning as well.
On Thu, Apr 2, 2020 at 11:04 AM Julian Jaffe <jja...@pinterest.com> wrote: > I had a few hours last night, so I worked up a rough cut of a spark reader > <https://github.com/JulianJaffePinterest/druid/tree/spark_druid_connector> > to help think through some design decisions. I've only run it locally and > haven't attempted to hook it up to a real cluster or metadata instance yet, > but in keeping with the Apache way I figured I'd share it now instead of > waiting until I'd fully built it and then asking for feedback. Hopefully > this can help us all reach consensus on what these connectors should look > like. > > On Sun, Mar 22, 2020 at 3:19 AM itai yaffe <itai.ya...@gmail.com> wrote: > >> Hey everyone, >> I created the initial design doc: >> https://docs.google.com/document/d/112VsrCKhtqtUTph5yXMzsaoxtz9wX1U2poi1vxuDswY/edit?usp=sharing >> It lays out the motivation and a few more details (as discussed on the >> different channels). >> Let’s start working on it together, and then we can get Gian’s review. >> >> BTW - the doc is currently open for everyone to edit, let me know if you >> think I should change that. >> >> On 2020/03/11 22:33:19, itai yaffe <itai.ya...@gmail.com> wrote: >> > Hey Rajiv, >> > Can you please provide some details on the use-case of querying Druid >> from >> > Spark (e.g what type of queries, how big is the result set, and any >> other >> > information you think is relevant)? >> > >> > Thanks! >> > >> > On Tue, Mar 10, 2020 at 6:08 PM Rajiv Mordani >> <rmord...@vmware.com.invalid> >> > wrote: >> > >> > > As part of the requirements please include querying / reading from >> Spark >> > > as well. This is a high priority for us. >> > > >> > > - Rajiv >> > > >> > > On 3/10/20, 1:26 AM, "Oguzhan Mangir" <sosyalmedya.oguz...@gmail.com >> > >> > > wrote: >> > > >> > > What we will do for that? I think, we can start to write >> requirements >> > > and flows. >> > > >> > > On 2020/03/05 20:19:38, Julian Jaffe <jja...@pinterest.com.INVALID >> > >> > > wrote: >> > > > Yeah, I think the primary objective here is a standalone writer >> from >> > > Spark >> > > > to Druid. >> > > > >> > > > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe < >> itai.ya...@gmail.com> >> > > wrote: >> > > > >> > > > > Thanks Julian! >> > > > > I'm actually targeting for this connector to allow write >> > > capabilities (at >> > > > > least as a first phase), rather than focusing on read >> capabilities. >> > > > > Having said that, I definitely see the value (even for the >> > > use-cases in my >> > > > > company) of having a reader that queries S3 segments directly! >> > > Funny, we >> > > > > too have implemented a mechanism (although a very simple one) >> to >> > > get the >> > > > > locations of the segments through SegmentMetadataQueries, to >> allow >> > > > > batch-oriented queries to work with against the deep storage >> :) >> > > > > >> > > > > Anyway, as I said, I think we can focus on write capabilities >> for >> > > now, and >> > > > > worry about read capabilities later (if that's OK). >> > > > > >> > > > > On 2020/03/05 18:29:09, Julian Jaffe >> <jja...@pinterest.com.INVALID >> > > > >> > > > > wrote: >> > > > > > The spark-druid-connector you shared brings up another >> design >> > > decision we >> > > > > > should probably talk through. That connector effectively >> wraps >> > > an HTTP >> > > > > > query client with Spark plumbing. An alternative approach >> (and >> > > the one I >> > > > > > ended up building due to our business requirements) is to >> build >> > > a reader >> > > > > > that operates directly over the S3 segments, shifting load >> for >> > > what are >> > > > > > likely very large and non-interactive queries off >> Druid-specific >> > > hardware >> > > > > > (with the exception of a few SegmentMetadataQueries to get >> > > location >> > > > > info). >> > > > > > >> > > > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe < >> itai.ya...@gmail.com> >> > > wrote: >> > > > > > >> > > > > > > I'll let Julian answer, but in the meantime, I just >> wanted to >> > > point >> > > > > out we >> > > > > > > might be able to draw some inspiration from this >> Spark-Redshift >> > > > > connector ( >> > > > > > > >> > > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0 >> > > ). >> > > > > > > Though it's somewhat outdated, it probably can be used as >> a >> > > reference >> > > > > for >> > > > > > > this new Spark-Druid connector we're planning. >> > > > > > > Another project to look at is >> > > > > > > >> > > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0 >> > > . >> > > > > > > >> > > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r < >> > > > > sosyalmedya.oguz...@gmail.com> >> > > > > > > wrote: >> > > > > > > > I think second option would be better. Many people use >> spark >> > > for >> > > > > batch >> > > > > > > operations with isolated clusters. Me and my friends will >> > > taking time >> > > > > for >> > > > > > > that. Julian, can you share your experiences for that? >> After >> > > that, we >> > > > > can >> > > > > > > write our aims, requirements and flows easily. >> > > > > > > > >> > > > > > > > On 2020/02/26 13:26:13, itai yaffe < >> itai.ya...@gmail.com> >> > > wrote: >> > > > > > > > > Hey, >> > > > > > > > > Per Gian's proposal, and following this thread in >> Druid >> > > user group >> > > > > ( >> > > > > > > > > >> > > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0 >> > > ) >> > > > > and >> > > > > > > this >> > > > > > > > > thread in Druid Slack channel ( >> > > > > > > > > >> > > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&reserved=0 >> > > ), >> > > > > I'd >> > > > > > > like >> > > > > > > > > to start discussing the options of having Spark-based >> > > ingestion >> > > > > into >> > > > > > > Druid. >> > > > > > > > > >> > > > > > > > > There's already an old project ( >> > > > > > > >> > > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=HplKPAboYAvUnJ%2BGJUF%2FRbmrGrCI5guUEA%2FdJ64O0b8%3D&reserved=0 >> > > ) >> > > > > > > > > for that, so perhaps we can use that as a starting >> point. >> > > > > > > > > >> > > > > > > > > The thread on Slack suggested 2 approaches: >> > > > > > > > > >> > > > > > > > > 1. *Simply replacing the Hadoop MapReduce ingestion >> > > task* - >> > > > > having a >> > > > > > > > > Spark batch job that ingests data into Druid, as a >> > > simple >> > > > > > > replacement of >> > > > > > > > > the Hadoop MapReduce ingestion task. >> > > > > > > > > Meaning - your data pipeline will have a Spark job >> to >> > > > > pre-process >> > > > > > > the >> > > > > > > > > data (similar to what some of us have today), and >> > > another Spark >> > > > > job >> > > > > > > to read >> > > > > > > > > the output of the previous job, and create Druid >> > > segments >> > > > > (again - >> > > > > > > > > following the same pattern as the Hadoop MapReduce >> > > ingestion >> > > > > task). >> > > > > > > > > 2. *Druid output sink for Spark* - rather than >> having 2 >> > > separate >> > > > > > > Spark >> > > > > > > > > jobs, 1 for pre-processing the data and 1 for >> ingesting >> > > the data >> > > > > > > into >> > > > > > > > > Druid, you'll have a single Spark job that >> > > pre-processes the >> > > > > data >> > > > > > > and >> > > > > > > > > creates Druid segments directly, e.g >> > > > > > > sparkDataFrame.write.format("druid") >> > > > > > > > > (as suggested by omngr on Slack). >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > I personally prefer the 2nd approach - while it might >> be >> > > harder to >> > > > > > > > > implement, it seems the benefits are greater in this >> > > approach. >> > > > > > > > > >> > > > > > > > > I'd like to hear your thoughts and to start getting >> this >> > > ball >> > > > > rolling. >> > > > > > > > > >> > > > > > > > > Thanks, >> > > > > > > > > Itai >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > --------------------------------------------------------------------- >> > > > > > > > To unsubscribe, e-mail: >> dev-unsubscr...@druid.apache.org >> > > > > > > > For additional commands, e-mail: >> dev-h...@druid.apache.org >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > --------------------------------------------------------------------- >> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org >> > > > > > > For additional commands, e-mail: >> dev-h...@druid.apache.org >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > --------------------------------------------------------------------- >> > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org >> > > > > For additional commands, e-mail: dev-h...@druid.apache.org >> > > > > >> > > > > >> > > > >> > > >> > > >> --------------------------------------------------------------------- >> > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org >> > > For additional commands, e-mail: dev-h...@druid.apache.org >> > > >> > > >> > > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org >> For additional commands, e-mail: dev-h...@druid.apache.org >> >>