Re: Spark-based ingestion into Druid
n of a few SegmentMetadataQueries to get >> > > location >> > > > > info). >> > > > > > >> > > > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe < >> itai.ya...@gmail.com> >> > > wrote: >> > > > > > >> > > > > > > I'll let Julian answer, but in the meantime, I just >> wanted to >> > > point >> > > > > out we >> > > > > > > might be able to draw some inspiration from this >> Spark-Redshift >> > > > > connector ( >> > > > > > > >> > > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0 >> > > ). >> > > > > > > Though it's somewhat outdated, it probably can be used as >> a >> > > reference >> > > > > for >> > > > > > > this new Spark-Druid connector we're planning. >> > > > > > > Another project to look at is >> > > > > > > >> > > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0 >> > > . >> > > > > > > >> > > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r < >> > > > > sosyalmedya.oguz...@gmail.com> >> > > > > > > wrote: >> > > > > > > > I think second option would be better. Many people use >> spark >> > > for >> > > > > batch >> > > > > > > operations with isolated clusters. Me and my friends will >> > > taking time >> > > > > for >> > > > > > > that. Julian, can you share your experiences for that? >> After >> > > that, we >> > > > > can >> > > > > > > write our aims, requirements and flows easily. >> > > > > > > > >> > > > > > > > On 2020/02/26 13:26:13, itai yaffe < >> itai.ya...@gmail.com> >> > > wrote: >> > > > > > > > > Hey, >> > > > > > > > > Per Gian's proposal, and following this thread in >> Druid >> > > user group >> > > > > ( >> > > > > > > > > >> > > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0 >> > > ) >> > > > > and >> > > > > > > this >> > > > > > > > > thread in Druid Slack channel ( >> > > > > > > > > >> > > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&reserved=0 >> > > ), >> > > > > I'd >> > > > > > > like >> > > > > > > > > to start discussing the options of having Spark-based >> > > ingestion >> > > > > into >> > > > > > > Druid. >> > > > > > > > > >> > > > > > > > > There's already an old project ( >> > > > > > > >> > > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdr
Re: Spark-based ingestion into Druid
; https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0 > > > ). > > > > > > > Though it's somewhat outdated, it probably can be used as a > > > reference > > > > > for > > > > > > > this new Spark-Druid connector we're planning. > > > > > > > Another project to look at is > > > > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0 > > > . > > > > > > > > > > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r < > > > > > sosyalmedya.oguz...@gmail.com> > > > > > > > wrote: > > > > > > > > I think second option would be better. Many people use > spark > > > for > > > > > batch > > > > > > > operations with isolated clusters. Me and my friends will > > > taking time > > > > > for > > > > > > > that. Julian, can you share your experiences for that? > After > > > that, we > > > > > can > > > > > > > write our aims, requirements and flows easily. > > > > > > > > > > > > > > > > On 2020/02/26 13:26:13, itai yaffe > > > > wrote: > > > > > > > > > Hey, > > > > > > > > > Per Gian's proposal, and following this thread in Druid > > > user group > > > > > ( > > > > > > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0 > > > ) > > > > > and > > > > > > > this > > > > > > > > > thread in Druid Slack channel ( > > > > > > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&reserved=0 > > > ), > > > > > I'd > > > > > > > like > > > > > > > > > to start discussing the options of having Spark-based > > > ingestion > > > > > into > > > > > > > Druid. > > > > > > > > > > > > > > > > > > There's already an old project ( > > > > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=HplKPAboYAvUnJ%2BGJUF%2FRbmrGrCI5guUEA%2FdJ64O0b8%3D&reserved=0 > > > ) > > > > > > > > > for that, so perhaps we can use that as a starting > point. > > > > > > > > > > > > > > > > > > The thread on Slack suggested 2 approaches: > > > > > > > > > > > > > > > > > >1. *Simply replacing the Hadoop MapReduce ingestion > > > task* - > > > > > having a > > > > > > > > >Spark batch job that ingests data into Druid, as a > > > simple > > > > > > > replacement of > > > > > > > > >the Hadoop MapReduce ingestion task. > > > > > > > > >Meaning - your data pipeline will have a Spark job > to > > > >
Re: Spark-based ingestion into Druid
> > sosyalmedya.oguz...@gmail.com> > > > > > > wrote: > > > > > > > I think second option would be better. Many people use spark > > for > > > > batch > > > > > > operations with isolated clusters. Me and my friends will > > taking time > > > > for > > > > > > that. Julian, can you share your experiences for that? After > > that, we > > > > can > > > > > > write our aims, requirements and flows easily. > > > > > > > > > > > > > > On 2020/02/26 13:26:13, itai yaffe > > wrote: > > > > > > > > Hey, > > > > > > > > Per Gian's proposal, and following this thread in Druid > > user group > > > > ( > > > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0 > > ) > > > > and > > > > > > this > > > > > > > > thread in Druid Slack channel ( > > > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&reserved=0 > > ), > > > > I'd > > > > > > like > > > > > > > > to start discussing the options of having Spark-based > > ingestion > > > > into > > > > > > Druid. > > > > > > > > > > > > > > > > There's already an old project ( > > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=HplKPAboYAvUnJ%2BGJUF%2FRbmrGrCI5guUEA%2FdJ64O0b8%3D&reserved=0 > > ) > > > > > > > > for that, so perhaps we can use that as a starting point. > > > > > > > > > > > > > > > > The thread on Slack suggested 2 approaches: > > > > > > > > > > > > > > > >1. *Simply replacing the Hadoop MapReduce ingestion > > task* - > > > > having a > > > > > > > >Spark batch job that ingests data into Druid, as a > > simple > > > > > > replacement of > > > > > > > >the Hadoop MapReduce ingestion task. > > > > > > > >Meaning - your data pipeline will have a Spark job to > > > > pre-process > > > > > > the > > > > > > > >data (similar to what some of us have today), and > > another Spark > > > > job > > > > > > to read > > > > > > > >the output of the previous job, and create Druid > > segments > > > > (again - > > > > > > > >following the same pattern as the Hadoop MapReduce > > ingestion > > > > task). > > > > > > > >2. *Druid output sink for Spark* - rather than having 2 > > separate > > > > > > Spark > > > > > > > >jobs, 1 for pre-processing the data and 1 for ingesting > > the data > > > > > > into > > > > > > > >Druid, you'll have a single Spark job that > > pre-processes the > > > > data > > > > > > and > > > > > > > >creates Druid segments directly, e.g > > > > > > sparkDataFrame.write.format("druid") > > > > > > > >(as suggested by omngr on Slack). > > > > > > > > > > > > > > > > > > > > > > > > I personally prefer the 2nd approach - while it might be > > harder to > > > > > > > > implement, it seems the benefits are greater in this > > approach. > > > > > > > > > > > > > > > > I'd like to hear your thoughts and to start getting this > > ball > > > > rolling. > > > > > > > > > > > > > > > > Thanks, > > > > > > > >Itai > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > > > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > - > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > - To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org
Re: Spark-based ingestion into Druid
Hey Rajiv, Can you please provide some details on the use-case of querying Druid from Spark (e.g what type of queries, how big is the result set, and any other information you think is relevant)? Thanks! On Tue, Mar 10, 2020 at 6:08 PM Rajiv Mordani wrote: > As part of the requirements please include querying / reading from Spark > as well. This is a high priority for us. > > - Rajiv > > On 3/10/20, 1:26 AM, "Oguzhan Mangir" > wrote: > > What we will do for that? I think, we can start to write requirements > and flows. > > On 2020/03/05 20:19:38, Julian Jaffe > wrote: > > Yeah, I think the primary objective here is a standalone writer from > Spark > > to Druid. > > > > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe > wrote: > > > > > Thanks Julian! > > > I'm actually targeting for this connector to allow write > capabilities (at > > > least as a first phase), rather than focusing on read capabilities. > > > Having said that, I definitely see the value (even for the > use-cases in my > > > company) of having a reader that queries S3 segments directly! > Funny, we > > > too have implemented a mechanism (although a very simple one) to > get the > > > locations of the segments through SegmentMetadataQueries, to allow > > > batch-oriented queries to work with against the deep storage :) > > > > > > Anyway, as I said, I think we can focus on write capabilities for > now, and > > > worry about read capabilities later (if that's OK). > > > > > > On 2020/03/05 18:29:09, Julian Jaffe > > > > wrote: > > > > The spark-druid-connector you shared brings up another design > decision we > > > > should probably talk through. That connector effectively wraps > an HTTP > > > > query client with Spark plumbing. An alternative approach (and > the one I > > > > ended up building due to our business requirements) is to build > a reader > > > > that operates directly over the S3 segments, shifting load for > what are > > > > likely very large and non-interactive queries off Druid-specific > hardware > > > > (with the exception of a few SegmentMetadataQueries to get > location > > > info). > > > > > > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe > wrote: > > > > > > > > > I'll let Julian answer, but in the meantime, I just wanted to > point > > > out we > > > > > might be able to draw some inspiration from this Spark-Redshift > > > connector ( > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0 > ). > > > > > Though it's somewhat outdated, it probably can be used as a > reference > > > for > > > > > this new Spark-Druid connector we're planning. > > > > > Another project to look at is > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0 > . > > > > > > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r < > > > sosyalmedya.oguz...@gmail.com> > > > > > wrote: > > > > > > I think second option would be better. Many people use spark > for > > > batch > > > > > operations with isolated clusters. Me and my friends will > taking time > > > for > > > > > that. Julian, can you share your experiences for that? After > that, we > > > can > > > > > write our aims, requirements and flows easily. > > > > > > > > > > > > On 2020/02/26 13:26:13, itai yaffe > wrote: > > > > > > > Hey, > > > > > > > Per Gian's proposal, and following this thread in Druid > user group > > > ( > > > > > > > > https://nam04.safelinks.
Re: Spark-based ingestion into Druid
As part of the requirements please include querying / reading from Spark as well. This is a high priority for us. - Rajiv On 3/10/20, 1:26 AM, "Oguzhan Mangir" wrote: What we will do for that? I think, we can start to write requirements and flows. On 2020/03/05 20:19:38, Julian Jaffe wrote: > Yeah, I think the primary objective here is a standalone writer from Spark > to Druid. > > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe wrote: > > > Thanks Julian! > > I'm actually targeting for this connector to allow write capabilities (at > > least as a first phase), rather than focusing on read capabilities. > > Having said that, I definitely see the value (even for the use-cases in my > > company) of having a reader that queries S3 segments directly! Funny, we > > too have implemented a mechanism (although a very simple one) to get the > > locations of the segments through SegmentMetadataQueries, to allow > > batch-oriented queries to work with against the deep storage :) > > > > Anyway, as I said, I think we can focus on write capabilities for now, and > > worry about read capabilities later (if that's OK). > > > > On 2020/03/05 18:29:09, Julian Jaffe > > wrote: > > > The spark-druid-connector you shared brings up another design decision we > > > should probably talk through. That connector effectively wraps an HTTP > > > query client with Spark plumbing. An alternative approach (and the one I > > > ended up building due to our business requirements) is to build a reader > > > that operates directly over the S3 segments, shifting load for what are > > > likely very large and non-interactive queries off Druid-specific hardware > > > (with the exception of a few SegmentMetadataQueries to get location > > info). > > > > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe wrote: > > > > > > > I'll let Julian answer, but in the meantime, I just wanted to point > > out we > > > > might be able to draw some inspiration from this Spark-Redshift > > connector ( > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0). > > > > Though it's somewhat outdated, it probably can be used as a reference > > for > > > > this new Spark-Druid connector we're planning. > > > > Another project to look at is > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0. > > > > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r < > > sosyalmedya.oguz...@gmail.com> > > > > wrote: > > > > > I think second option would be better. Many people use spark for > > batch > > > > operations with isolated clusters. Me and my friends will taking time > > for > > > > that. Julian, can you share your experiences for that? After that, we > > can > > > > write our aims, requirements and flows easily. > > > > > > > > > > On 2020/02/26 13:26:13, itai yaffe wrote: > > > > > > Hey, > > > > > > Per Gian's proposal, and following this thread in Druid user group > > ( > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0) > > and > > > > this > > > > > > thread in Druid Slack channel ( > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172
Re: Spark-based ingestion into Druid
We also have a use case of reading from Spark. However we are using HDFS (on prem solution) and not S3. While write would also be needed, our first requirement is really to query the data from Spark. We ingest via Kafka today into Druid. - Rajiv On 3/5/20, 11:43 AM, "itai yaffe" wrote: Thanks Julian! I'm actually targeting for this connector to allow write capabilities (at least as a first phase), rather than focusing on read capabilities. Having said that, I definitely see the value (even for the use-cases in my company) of having a reader that queries S3 segments directly! Funny, we too have implemented a mechanism (although a very simple one) to get the locations of the segments through SegmentMetadataQueries, to allow batch-oriented queries to work with against the deep storage :) Anyway, as I said, I think we can focus on write capabilities for now, and worry about read capabilities later (if that's OK). On 2020/03/05 18:29:09, Julian Jaffe wrote: > The spark-druid-connector you shared brings up another design decision we > should probably talk through. That connector effectively wraps an HTTP > query client with Spark plumbing. An alternative approach (and the one I > ended up building due to our business requirements) is to build a reader > that operates directly over the S3 segments, shifting load for what are > likely very large and non-interactive queries off Druid-specific hardware > (with the exception of a few SegmentMetadataQueries to get location info). > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe wrote: > > > I'll let Julian answer, but in the meantime, I just wanted to point out we > > might be able to draw some inspiration from this Spark-Redshift connector ( > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=iUCk9MeSzgpxHtnO9VPhAbOMywPE8VHBDKhaMQ6%2Be9Q%3D&reserved=0). > > Though it's somewhat outdated, it probably can be used as a reference for > > this new Spark-Druid connector we're planning. > > Another project to look at is > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=GAlPvt1CxFF2jZpK5vQ31vTY1OEmQOiZ7siJ4IoNuAU%3D&reserved=0. > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r > > wrote: > > > I think second option would be better. Many people use spark for batch > > operations with isolated clusters. Me and my friends will taking time for > > that. Julian, can you share your experiences for that? After that, we can > > write our aims, requirements and flows easily. > > > > > > On 2020/02/26 13:26:13, itai yaffe wrote: > > > > Hey, > > > > Per Gian's proposal, and following this thread in Druid user group ( > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=UP%2FaBDuXJaByUAXQOtFXV2BvA1BV05dF9pOtKguOFNE%3D&reserved=0) and > > this > > > > thread in Druid Slack channel ( > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=cYAEldtu1R8k0BuFoPty4%2BkNtI47gP12W3W4O%2BlRGgc%3D&reserved=0), I'd > > like > > > > to start discussing the options of having Spark-based ingestion into > > Druid. > > > > > > > > There's already an old project ( > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=3YLbT0jKx%2FLYQc2JVYRg3c5zUL5ZP3jUeerW7y%2FatU0%3D&reserved=0) > > > > for that, so perhaps we can use that as a starting point. > > > > > > > > The thread on Slack suggested 2 approaches: > > > > > >
Re: Spark-based ingestion into Druid
What we will do for that? I think, we can start to write requirements and flows. On 2020/03/05 20:19:38, Julian Jaffe wrote: > Yeah, I think the primary objective here is a standalone writer from Spark > to Druid. > > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe wrote: > > > Thanks Julian! > > I'm actually targeting for this connector to allow write capabilities (at > > least as a first phase), rather than focusing on read capabilities. > > Having said that, I definitely see the value (even for the use-cases in my > > company) of having a reader that queries S3 segments directly! Funny, we > > too have implemented a mechanism (although a very simple one) to get the > > locations of the segments through SegmentMetadataQueries, to allow > > batch-oriented queries to work with against the deep storage :) > > > > Anyway, as I said, I think we can focus on write capabilities for now, and > > worry about read capabilities later (if that's OK). > > > > On 2020/03/05 18:29:09, Julian Jaffe > > wrote: > > > The spark-druid-connector you shared brings up another design decision we > > > should probably talk through. That connector effectively wraps an HTTP > > > query client with Spark plumbing. An alternative approach (and the one I > > > ended up building due to our business requirements) is to build a reader > > > that operates directly over the S3 segments, shifting load for what are > > > likely very large and non-interactive queries off Druid-specific hardware > > > (with the exception of a few SegmentMetadataQueries to get location > > info). > > > > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe wrote: > > > > > > > I'll let Julian answer, but in the meantime, I just wanted to point > > out we > > > > might be able to draw some inspiration from this Spark-Redshift > > connector ( > > > > https://github.com/databricks/spark-redshift#scala). > > > > Though it's somewhat outdated, it probably can be used as a reference > > for > > > > this new Spark-Druid connector we're planning. > > > > Another project to look at is > > > > https://github.com/SharpRay/spark-druid-connector. > > > > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r < > > sosyalmedya.oguz...@gmail.com> > > > > wrote: > > > > > I think second option would be better. Many people use spark for > > batch > > > > operations with isolated clusters. Me and my friends will taking time > > for > > > > that. Julian, can you share your experiences for that? After that, we > > can > > > > write our aims, requirements and flows easily. > > > > > > > > > > On 2020/02/26 13:26:13, itai yaffe wrote: > > > > > > Hey, > > > > > > Per Gian's proposal, and following this thread in Druid user group > > ( > > > > > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) > > and > > > > this > > > > > > thread in Druid Slack channel ( > > > > > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), > > I'd > > > > like > > > > > > to start discussing the options of having Spark-based ingestion > > into > > > > Druid. > > > > > > > > > > > > There's already an old project ( > > > > https://github.com/metamx/druid-spark-batch) > > > > > > for that, so perhaps we can use that as a starting point. > > > > > > > > > > > > The thread on Slack suggested 2 approaches: > > > > > > > > > > > >1. *Simply replacing the Hadoop MapReduce ingestion task* - > > having a > > > > > >Spark batch job that ingests data into Druid, as a simple > > > > replacement of > > > > > >the Hadoop MapReduce ingestion task. > > > > > >Meaning - your data pipeline will have a Spark job to > > pre-process > > > > the > > > > > >data (similar to what some of us have today), and another Spark > > job > > > > to read > > > > > >the output of the previous job, and create Druid segments > > (again - > > > > > >following the same pattern as the Hadoop MapReduce ingestion > > task). > > > > > >2. *Dru
Re: Spark-based ingestion into Druid
Yeah, I think the primary objective here is a standalone writer from Spark to Druid. On Thu, Mar 5, 2020 at 11:43 AM itai yaffe wrote: > Thanks Julian! > I'm actually targeting for this connector to allow write capabilities (at > least as a first phase), rather than focusing on read capabilities. > Having said that, I definitely see the value (even for the use-cases in my > company) of having a reader that queries S3 segments directly! Funny, we > too have implemented a mechanism (although a very simple one) to get the > locations of the segments through SegmentMetadataQueries, to allow > batch-oriented queries to work with against the deep storage :) > > Anyway, as I said, I think we can focus on write capabilities for now, and > worry about read capabilities later (if that's OK). > > On 2020/03/05 18:29:09, Julian Jaffe > wrote: > > The spark-druid-connector you shared brings up another design decision we > > should probably talk through. That connector effectively wraps an HTTP > > query client with Spark plumbing. An alternative approach (and the one I > > ended up building due to our business requirements) is to build a reader > > that operates directly over the S3 segments, shifting load for what are > > likely very large and non-interactive queries off Druid-specific hardware > > (with the exception of a few SegmentMetadataQueries to get location > info). > > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe wrote: > > > > > I'll let Julian answer, but in the meantime, I just wanted to point > out we > > > might be able to draw some inspiration from this Spark-Redshift > connector ( > > > https://github.com/databricks/spark-redshift#scala). > > > Though it's somewhat outdated, it probably can be used as a reference > for > > > this new Spark-Druid connector we're planning. > > > Another project to look at is > > > https://github.com/SharpRay/spark-druid-connector. > > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r < > sosyalmedya.oguz...@gmail.com> > > > wrote: > > > > I think second option would be better. Many people use spark for > batch > > > operations with isolated clusters. Me and my friends will taking time > for > > > that. Julian, can you share your experiences for that? After that, we > can > > > write our aims, requirements and flows easily. > > > > > > > > On 2020/02/26 13:26:13, itai yaffe wrote: > > > > > Hey, > > > > > Per Gian's proposal, and following this thread in Druid user group > ( > > > > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) > and > > > this > > > > > thread in Druid Slack channel ( > > > > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), > I'd > > > like > > > > > to start discussing the options of having Spark-based ingestion > into > > > Druid. > > > > > > > > > > There's already an old project ( > > > https://github.com/metamx/druid-spark-batch) > > > > > for that, so perhaps we can use that as a starting point. > > > > > > > > > > The thread on Slack suggested 2 approaches: > > > > > > > > > >1. *Simply replacing the Hadoop MapReduce ingestion task* - > having a > > > > >Spark batch job that ingests data into Druid, as a simple > > > replacement of > > > > >the Hadoop MapReduce ingestion task. > > > > >Meaning - your data pipeline will have a Spark job to > pre-process > > > the > > > > >data (similar to what some of us have today), and another Spark > job > > > to read > > > > >the output of the previous job, and create Druid segments > (again - > > > > >following the same pattern as the Hadoop MapReduce ingestion > task). > > > > >2. *Druid output sink for Spark* - rather than having 2 separate > > > Spark > > > > >jobs, 1 for pre-processing the data and 1 for ingesting the data > > > into > > > > >Druid, you'll have a single Spark job that pre-processes the > data > > > and > > > > >creates Druid segments directly, e.g > > > sparkDataFrame.write.format("druid") > > > > >(as suggested by omngr on Slack). > > > > > > > > > > > > > > > I personally prefer the 2nd approach - while it might be harder to > > > > > implement, it seems the benefits are greater in this approach. > > > > > > > > > > I'd like to hear your thoughts and to start getting this ball > rolling. > > > > > > > > > > Thanks, > > > > >Itai > > > > > > > > > > > > > - > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > > > > > > > - > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > For additional commands, e-mail: dev-h...@druid.apache.org > >
Re: Spark-based ingestion into Druid
Thanks Julian! I'm actually targeting for this connector to allow write capabilities (at least as a first phase), rather than focusing on read capabilities. Having said that, I definitely see the value (even for the use-cases in my company) of having a reader that queries S3 segments directly! Funny, we too have implemented a mechanism (although a very simple one) to get the locations of the segments through SegmentMetadataQueries, to allow batch-oriented queries to work with against the deep storage :) Anyway, as I said, I think we can focus on write capabilities for now, and worry about read capabilities later (if that's OK). On 2020/03/05 18:29:09, Julian Jaffe wrote: > The spark-druid-connector you shared brings up another design decision we > should probably talk through. That connector effectively wraps an HTTP > query client with Spark plumbing. An alternative approach (and the one I > ended up building due to our business requirements) is to build a reader > that operates directly over the S3 segments, shifting load for what are > likely very large and non-interactive queries off Druid-specific hardware > (with the exception of a few SegmentMetadataQueries to get location info). > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe wrote: > > > I'll let Julian answer, but in the meantime, I just wanted to point out we > > might be able to draw some inspiration from this Spark-Redshift connector ( > > https://github.com/databricks/spark-redshift#scala). > > Though it's somewhat outdated, it probably can be used as a reference for > > this new Spark-Druid connector we're planning. > > Another project to look at is > > https://github.com/SharpRay/spark-druid-connector. > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r > > wrote: > > > I think second option would be better. Many people use spark for batch > > operations with isolated clusters. Me and my friends will taking time for > > that. Julian, can you share your experiences for that? After that, we can > > write our aims, requirements and flows easily. > > > > > > On 2020/02/26 13:26:13, itai yaffe wrote: > > > > Hey, > > > > Per Gian's proposal, and following this thread in Druid user group ( > > > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and > > this > > > > thread in Druid Slack channel ( > > > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd > > like > > > > to start discussing the options of having Spark-based ingestion into > > Druid. > > > > > > > > There's already an old project ( > > https://github.com/metamx/druid-spark-batch) > > > > for that, so perhaps we can use that as a starting point. > > > > > > > > The thread on Slack suggested 2 approaches: > > > > > > > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a > > > >Spark batch job that ingests data into Druid, as a simple > > replacement of > > > >the Hadoop MapReduce ingestion task. > > > >Meaning - your data pipeline will have a Spark job to pre-process > > the > > > >data (similar to what some of us have today), and another Spark job > > to read > > > >the output of the previous job, and create Druid segments (again - > > > >following the same pattern as the Hadoop MapReduce ingestion task). > > > >2. *Druid output sink for Spark* - rather than having 2 separate > > Spark > > > >jobs, 1 for pre-processing the data and 1 for ingesting the data > > into > > > >Druid, you'll have a single Spark job that pre-processes the data > > and > > > >creates Druid segments directly, e.g > > sparkDataFrame.write.format("druid") > > > >(as suggested by omngr on Slack). > > > > > > > > > > > > I personally prefer the 2nd approach - while it might be harder to > > > > implement, it seems the benefits are greater in this approach. > > > > > > > > I'd like to hear your thoughts and to start getting this ball rolling. > > > > > > > > Thanks, > > > >Itai > > > > > > > > > > - > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > - To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org
Re: Spark-based ingestion into Druid
The spark-druid-connector you shared brings up another design decision we should probably talk through. That connector effectively wraps an HTTP query client with Spark plumbing. An alternative approach (and the one I ended up building due to our business requirements) is to build a reader that operates directly over the S3 segments, shifting load for what are likely very large and non-interactive queries off Druid-specific hardware (with the exception of a few SegmentMetadataQueries to get location info). On Thu, Mar 5, 2020 at 8:04 AM itai yaffe wrote: > I'll let Julian answer, but in the meantime, I just wanted to point out we > might be able to draw some inspiration from this Spark-Redshift connector ( > https://github.com/databricks/spark-redshift#scala). > Though it's somewhat outdated, it probably can be used as a reference for > this new Spark-Druid connector we're planning. > Another project to look at is > https://github.com/SharpRay/spark-druid-connector. > > On 2020/03/02 14:31:27, O��uzhan Mang��r > wrote: > > I think second option would be better. Many people use spark for batch > operations with isolated clusters. Me and my friends will taking time for > that. Julian, can you share your experiences for that? After that, we can > write our aims, requirements and flows easily. > > > > On 2020/02/26 13:26:13, itai yaffe wrote: > > > Hey, > > > Per Gian's proposal, and following this thread in Druid user group ( > > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and > this > > > thread in Druid Slack channel ( > > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd > like > > > to start discussing the options of having Spark-based ingestion into > Druid. > > > > > > There's already an old project ( > https://github.com/metamx/druid-spark-batch) > > > for that, so perhaps we can use that as a starting point. > > > > > > The thread on Slack suggested 2 approaches: > > > > > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a > > >Spark batch job that ingests data into Druid, as a simple > replacement of > > >the Hadoop MapReduce ingestion task. > > >Meaning - your data pipeline will have a Spark job to pre-process > the > > >data (similar to what some of us have today), and another Spark job > to read > > >the output of the previous job, and create Druid segments (again - > > >following the same pattern as the Hadoop MapReduce ingestion task). > > >2. *Druid output sink for Spark* - rather than having 2 separate > Spark > > >jobs, 1 for pre-processing the data and 1 for ingesting the data > into > > >Druid, you'll have a single Spark job that pre-processes the data > and > > >creates Druid segments directly, e.g > sparkDataFrame.write.format("druid") > > >(as suggested by omngr on Slack). > > > > > > > > > I personally prefer the 2nd approach - while it might be harder to > > > implement, it seems the benefits are greater in this approach. > > > > > > I'd like to hear your thoughts and to start getting this ball rolling. > > > > > > Thanks, > > >Itai > > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > For additional commands, e-mail: dev-h...@druid.apache.org > >
Re: Spark-based ingestion into Druid
I'll let Julian answer, but in the meantime, I just wanted to point out we might be able to draw some inspiration from this Spark-Redshift connector (https://github.com/databricks/spark-redshift#scala). Though it's somewhat outdated, it probably can be used as a reference for this new Spark-Druid connector we're planning. Another project to look at is https://github.com/SharpRay/spark-druid-connector. On 2020/03/02 14:31:27, O��uzhan Mang��r wrote: > I think second option would be better. Many people use spark for batch > operations with isolated clusters. Me and my friends will taking time for > that. Julian, can you share your experiences for that? After that, we can > write our aims, requirements and flows easily. > > On 2020/02/26 13:26:13, itai yaffe wrote: > > Hey, > > Per Gian's proposal, and following this thread in Druid user group ( > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this > > thread in Druid Slack channel ( > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like > > to start discussing the options of having Spark-based ingestion into Druid. > > > > There's already an old project (https://github.com/metamx/druid-spark-batch) > > for that, so perhaps we can use that as a starting point. > > > > The thread on Slack suggested 2 approaches: > > > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a > >Spark batch job that ingests data into Druid, as a simple replacement of > >the Hadoop MapReduce ingestion task. > >Meaning - your data pipeline will have a Spark job to pre-process the > >data (similar to what some of us have today), and another Spark job to > > read > >the output of the previous job, and create Druid segments (again - > >following the same pattern as the Hadoop MapReduce ingestion task). > >2. *Druid output sink for Spark* - rather than having 2 separate Spark > >jobs, 1 for pre-processing the data and 1 for ingesting the data into > >Druid, you'll have a single Spark job that pre-processes the data and > >creates Druid segments directly, e.g sparkDataFrame.write.format("druid") > >(as suggested by omngr on Slack). > > > > > > I personally prefer the 2nd approach - while it might be harder to > > implement, it seems the benefits are greater in this approach. > > > > I'd like to hear your thoughts and to start getting this ball rolling. > > > > Thanks, > >Itai > > > > - > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > For additional commands, e-mail: dev-h...@druid.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org
Re: Spark-based ingestion into Druid
I've submitted https://github.com/apache/druid/pull/9454 today to add a `OnHeapMemorySegmentWriteOutMediumFactory`. On Mon, Mar 2, 2020 at 8:57 AM Oğuzhan Mangır wrote: > > > On 2020/02/26 13:26:13, itai yaffe wrote: > > Hey, > > Per Gian's proposal, and following this thread in Druid user group ( > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this > > thread in Druid Slack channel ( > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd > like > > to start discussing the options of having Spark-based ingestion into > Druid. > > > > There's already an old project ( > https://github.com/metamx/druid-spark-batch) > > for that, so perhaps we can use that as a starting point. > > > > The thread on Slack suggested 2 approaches: > > > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a > >Spark batch job that ingests data into Druid, as a simple replacement > of > >the Hadoop MapReduce ingestion task. > >Meaning - your data pipeline will have a Spark job to pre-process the > >data (similar to what some of us have today), and another Spark job > to read > >the output of the previous job, and create Druid segments (again - > >following the same pattern as the Hadoop MapReduce ingestion task). > >2. *Druid output sink for Spark* - rather than having 2 separate Spark > >jobs, 1 for pre-processing the data and 1 for ingesting the data into > >Druid, you'll have a single Spark job that pre-processes the data and > >creates Druid segments directly, e.g > sparkDataFrame.write.format("druid") > >(as suggested by omngr on Slack). > > > > > > I personally prefer the 2nd approach - while it might be harder to > > implement, it seems the benefits are greater in this approach. > > > > I'd like to hear your thoughts and to start getting this ball rolling. > > > > Thanks, > >Itai > > > - > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > For additional commands, e-mail: dev-h...@druid.apache.org > >
Re: Spark-based ingestion into Druid
We should be organized for that. This is a big problem for all batch operations. Me and my friends will be taking time for that. Julian, can you share your experiences for that? After that, we can write our aims, requirements and flows. On 2020/02/26 13:26:13, itai yaffe wrote: > Hey, > Per Gian's proposal, and following this thread in Druid user group ( > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this > thread in Druid Slack channel ( > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like > to start discussing the options of having Spark-based ingestion into Druid. > > There's already an old project (https://github.com/metamx/druid-spark-batch) > for that, so perhaps we can use that as a starting point. > > The thread on Slack suggested 2 approaches: > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a >Spark batch job that ingests data into Druid, as a simple replacement of >the Hadoop MapReduce ingestion task. >Meaning - your data pipeline will have a Spark job to pre-process the >data (similar to what some of us have today), and another Spark job to read >the output of the previous job, and create Druid segments (again - >following the same pattern as the Hadoop MapReduce ingestion task). >2. *Druid output sink for Spark* - rather than having 2 separate Spark >jobs, 1 for pre-processing the data and 1 for ingesting the data into >Druid, you'll have a single Spark job that pre-processes the data and >creates Druid segments directly, e.g sparkDataFrame.write.format("druid") >(as suggested by omngr on Slack). > > > I personally prefer the 2nd approach - while it might be harder to > implement, it see - To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org
Re: Spark-based ingestion into Druid
On 2020/02/26 13:26:13, itai yaffe wrote: > Hey, > Per Gian's proposal, and following this thread in Druid user group ( > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this > thread in Druid Slack channel ( > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like > to start discussing the options of having Spark-based ingestion into Druid. > > There's already an old project (https://github.com/metamx/druid-spark-batch) > for that, so perhaps we can use that as a starting point. > > The thread on Slack suggested 2 approaches: > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a >Spark batch job that ingests data into Druid, as a simple replacement of >the Hadoop MapReduce ingestion task. >Meaning - your data pipeline will have a Spark job to pre-process the >data (similar to what some of us have today), and another Spark job to read >the output of the previous job, and create Druid segments (again - >following the same pattern as the Hadoop MapReduce ingestion task). >2. *Druid output sink for Spark* - rather than having 2 separate Spark >jobs, 1 for pre-processing the data and 1 for ingesting the data into >Druid, you'll have a single Spark job that pre-processes the data and >creates Druid segments directly, e.g sparkDataFrame.write.format("druid") >(as suggested by omngr on Slack). > > > I personally prefer the 2nd approach - while it might be harder to > implement, it seems the benefits are greater in this approach. > > I'd like to hear your thoughts and to start getting this ball rolling. > > Thanks, >Itai > - To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org
Re: Spark-based ingestion into Druid
I think second option would be better. Many people use spark for batch operations with isolated clusters. Me and my friends will taking time for that. Julian, can you share your experiences for that? After that, we can write our aims, requirements and flows easily. On 2020/02/26 13:26:13, itai yaffe wrote: > Hey, > Per Gian's proposal, and following this thread in Druid user group ( > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this > thread in Druid Slack channel ( > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like > to start discussing the options of having Spark-based ingestion into Druid. > > There's already an old project (https://github.com/metamx/druid-spark-batch) > for that, so perhaps we can use that as a starting point. > > The thread on Slack suggested 2 approaches: > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a >Spark batch job that ingests data into Druid, as a simple replacement of >the Hadoop MapReduce ingestion task. >Meaning - your data pipeline will have a Spark job to pre-process the >data (similar to what some of us have today), and another Spark job to read >the output of the previous job, and create Druid segments (again - >following the same pattern as the Hadoop MapReduce ingestion task). >2. *Druid output sink for Spark* - rather than having 2 separate Spark >jobs, 1 for pre-processing the data and 1 for ingesting the data into >Druid, you'll have a single Spark job that pre-processes the data and >creates Druid segments directly, e.g sparkDataFrame.write.format("druid") >(as suggested by omngr on Slack). > > > I personally prefer the 2nd approach - while it might be harder to > implement, it seems the benefits are greater in this approach. > > I'd like to hear your thoughts and to start getting this ball rolling. > > Thanks, >Itai > - To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org
Re: Spark-based ingestion into Druid
I think for whatever approach we take, we'll need to expose a OnHeapMemorySegmentWriteOutMediumFactory for OnHeapMemorySegmentWriteOutMedium that parallels OffHeapMemorySegmentWriteOutMediumFactory. Although off heap index building will be faster, it's very difficult to get most schedulers to allocate off-heap resources correctly for Spark containers. I can likely get a diff up in the next day or two. On Wed, Feb 26, 2020 at 5:26 AM itai yaffe wrote: > Hey, > Per Gian's proposal, and following this thread in Druid user group ( > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this > thread in Druid Slack channel ( > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like > to start discussing the options of having Spark-based ingestion into Druid. > > There's already an old project ( > https://github.com/metamx/druid-spark-batch) > for that, so perhaps we can use that as a starting point. > > The thread on Slack suggested 2 approaches: > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a >Spark batch job that ingests data into Druid, as a simple replacement of >the Hadoop MapReduce ingestion task. >Meaning - your data pipeline will have a Spark job to pre-process the >data (similar to what some of us have today), and another Spark job to > read >the output of the previous job, and create Druid segments (again - >following the same pattern as the Hadoop MapReduce ingestion task). >2. *Druid output sink for Spark* - rather than having 2 separate Spark >jobs, 1 for pre-processing the data and 1 for ingesting the data into >Druid, you'll have a single Spark job that pre-processes the data and >creates Druid segments directly, e.g > sparkDataFrame.write.format("druid") >(as suggested by omngr on Slack). > > > I personally prefer the 2nd approach - while it might be harder to > implement, it seems the benefits are greater in this approach. > > I'd like to hear your thoughts and to start getting this ball rolling. > > Thanks, >Itai >
Spark-based ingestion into Druid
Hey, Per Gian's proposal, and following this thread in Druid user group ( https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this thread in Druid Slack channel ( https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like to start discussing the options of having Spark-based ingestion into Druid. There's already an old project (https://github.com/metamx/druid-spark-batch) for that, so perhaps we can use that as a starting point. The thread on Slack suggested 2 approaches: 1. *Simply replacing the Hadoop MapReduce ingestion task* - having a Spark batch job that ingests data into Druid, as a simple replacement of the Hadoop MapReduce ingestion task. Meaning - your data pipeline will have a Spark job to pre-process the data (similar to what some of us have today), and another Spark job to read the output of the previous job, and create Druid segments (again - following the same pattern as the Hadoop MapReduce ingestion task). 2. *Druid output sink for Spark* - rather than having 2 separate Spark jobs, 1 for pre-processing the data and 1 for ingesting the data into Druid, you'll have a single Spark job that pre-processes the data and creates Druid segments directly, e.g sparkDataFrame.write.format("druid") (as suggested by omngr on Slack). I personally prefer the 2nd approach - while it might be harder to implement, it seems the benefits are greater in this approach. I'd like to hear your thoughts and to start getting this ball rolling. Thanks, Itai