Re: Spark-based ingestion into Druid

2020-04-27 Thread Julian Jaffe
n of a few SegmentMetadataQueries to get
>> > > location
>> > > > > info).
>> > > > > >
>> > > > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe <
>> itai.ya...@gmail.com>
>> > > wrote:
>> > > > > >
>> > > > > > > I'll let Julian answer, but in the meantime, I just
>> wanted to
>> > > point
>> > > > > out we
>> > > > > > > might be able to draw some inspiration from this
>> Spark-Redshift
>> > > > > connector (
>> > > > > > >
>> > >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0
>> > > ).
>> > > > > > > Though it's somewhat outdated, it probably can be used as
>> a
>> > > reference
>> > > > > for
>> > > > > > > this new Spark-Druid connector we're planning.
>> > > > > > > Another project to look at is
>> > > > > > >
>> > >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0
>> > > .
>> > > > > > >
>> > > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
>> > >     > > sosyalmedya.oguz...@gmail.com>
>> > > > > > > wrote:
>> > > > > > > > I think second option would be better. Many people use
>> spark
>> > > for
>> > > > > batch
>> > > > > > > operations with isolated clusters. Me and my friends will
>> > > taking time
>> > > > > for
>> > > > > > > that. Julian, can you share your experiences for that?
>> After
>> > > that, we
>> > > > > can
>> > > > > > > write our aims, requirements and flows easily.
>> > > > > > > >
>> > > > > > > > On 2020/02/26 13:26:13, itai yaffe <
>> itai.ya...@gmail.com>
>> > > wrote:
>> > > > > > > > > Hey,
>> > > > > > > > > Per Gian's proposal, and following this thread in
>> Druid
>> > > user group
>> > > > > (
>> > > > > > > > >
>> > >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0
>> > > )
>> > > > > and
>> > > > > > > this
>> > > > > > > > > thread in Druid Slack channel (
>> > > > > > > > >
>> > >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&reserved=0
>> > > ),
>> > > > > I'd
>> > > > > > > like
>> > > > > > > > > to start discussing the options of having Spark-based
>> > > ingestion
>> > > > > into
>> > > > > > > Druid.
>> > > > > > > > >
>> > > > > > > > > There's already an old project (
>> > > > > > >
>> > >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdr

Re: Spark-based ingestion into Druid

2020-04-02 Thread Julian Jaffe
; https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0
> > > ).
> > > > > > > Though it's somewhat outdated, it probably can be used as a
> > > reference
> > > > > for
> > > > > > > this new Spark-Druid connector we're planning.
> > > > > > > Another project to look at is
> > > > > > >
> > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0
> > > .
> > > > > > >
> > > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
> > > > > sosyalmedya.oguz...@gmail.com>
> > > > > > > wrote:
> > > > > > > > I think second option would be better. Many people use
> spark
> > > for
> > > > > batch
> > > > > > > operations with isolated clusters. Me and my friends will
> > > taking time
> > >     > > for
> > > > > > > that. Julian, can you share your experiences for that?
> After
> > > that, we
> > > > > can
> > > > > > > write our aims, requirements and flows easily.
> > > > > > > >
> > > > > > > > On 2020/02/26 13:26:13, itai yaffe  >
> > > wrote:
> > > > > > > > > Hey,
> > > > > > > > > Per Gian's proposal, and following this thread in Druid
> > > user group
> > > > > (
> > > > > > > > >
> > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0
> > > )
> > > > > and
> > > > > > > this
> > > > > > > > > thread in Druid Slack channel (
> > > > > > > > >
> > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&reserved=0
> > > ),
> > > > > I'd
> > > > > > > like
> > > > > > > > > to start discussing the options of having Spark-based
> > > ingestion
> > > > > into
> > > > > > > Druid.
> > > > > > > > >
> > > > > > > > > There's already an old project (
> > > > > > >
> > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=HplKPAboYAvUnJ%2BGJUF%2FRbmrGrCI5guUEA%2FdJ64O0b8%3D&reserved=0
> > > )
> > > > > > > > > for that, so perhaps we can use that as a starting
> point.
> > > > > > > > >
> > > > > > > > > The thread on Slack suggested 2 approaches:
> > > > > > > > >
> > > > > > > > >1. *Simply replacing the Hadoop MapReduce ingestion
> > > task* -
> > > > > having a
> > > > > > > > >Spark batch job that ingests data into Druid, as a
> > > simple
> > > > > > > replacement of
> > > > > > > > >the Hadoop MapReduce ingestion task.
> > > > > > > > >Meaning - your data pipeline will have a Spark job
> to
> > > >

Re: Spark-based ingestion into Druid

2020-03-22 Thread itai yaffe
> > sosyalmedya.oguz...@gmail.com>
> > > > > > wrote:
> > > > > > > I think second option would be better. Many people use spark
> > for
> > > > batch
> > > > > > operations with isolated clusters. Me and my friends will
> > taking time
> > > > for
> > > > > > that. Julian, can you share your experiences for that? After
> > that, we
> > > > can
> > > > > > write our aims, requirements and flows easily.
> > > > > > >
> > > > > > > On 2020/02/26 13:26:13, itai yaffe 
> > wrote:
> >     > > > > > > Hey,
> > > > > > > > Per Gian's proposal, and following this thread in Druid
> > user group
> > > > (
> > > > > > > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0
> > )
> > > > and
> > > > > > this
> > > > > > > > thread in Druid Slack channel (
> > > > > > > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&reserved=0
> > ),
> > > > I'd
> > > > > > like
> > > > > > > > to start discussing the options of having Spark-based
> > ingestion
> > > > into
> > > > > > Druid.
> > > > > > > >
> > > > > > > > There's already an old project (
> > > > > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=HplKPAboYAvUnJ%2BGJUF%2FRbmrGrCI5guUEA%2FdJ64O0b8%3D&reserved=0
> > )
> > > > > > > > for that, so perhaps we can use that as a starting point.
> > > > > > > >
> > > > > > > > The thread on Slack suggested 2 approaches:
> > > > > > > >
> > > > > > > >1. *Simply replacing the Hadoop MapReduce ingestion
> > task* -
> > > > having a
> > > > > > > >Spark batch job that ingests data into Druid, as a
> > simple
> > > > > > replacement of
> > > > > > > >the Hadoop MapReduce ingestion task.
> > > > > > > >Meaning - your data pipeline will have a Spark job to
> > > > pre-process
> > > > > > the
> > > > > > > >data (similar to what some of us have today), and
> > another Spark
> > > > job
> > > > > > to read
> > > > > > > >the output of the previous job, and create Druid
> > segments
> > > > (again -
> > > > > > > >following the same pattern as the Hadoop MapReduce
> > ingestion
> > > > task).
> > > > > > > >2. *Druid output sink for Spark* - rather than having 2
> > separate
> > > > > > Spark
> > > > > > > >jobs, 1 for pre-processing the data and 1 for ingesting
> > the data
> > > > > > into
> > > > > > > >Druid, you'll have a single Spark job that
> > pre-processes the
> > > > data
> > > > > > and
> > > > > > > >creates Druid segments directly, e.g
> > > > > > sparkDataFrame.write.format("druid")
> > > > > > > >(as suggested by omngr on Slack).
> > > > > > > >
> > > > > > > >
> > > > > > > > I personally prefer the 2nd approach - while it might be
> > harder to
> > > > > > > > implement, it seems the benefits are greater in this
> > approach.
> > > > > > > >
> > > > > > > > I'd like to hear your thoughts and to start getting this
> > ball
> > > > rolling.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >Itai
> > > > > > > >
> > > > > > >
> > > > > > >
> > -
> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > -
> > > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > >
> > > >
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
> >
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Spark-based ingestion into Druid

2020-03-11 Thread itai yaffe
Hey Rajiv,
Can you please provide some details on the use-case of querying Druid from
Spark (e.g what type of queries, how big is the result set, and any other
information you think is relevant)?

Thanks!

On Tue, Mar 10, 2020 at 6:08 PM Rajiv Mordani 
wrote:

> As part of the requirements please include querying / reading from Spark
> as well. This is a high priority for us.
>
> - Rajiv
>
> On 3/10/20, 1:26 AM, "Oguzhan Mangir" 
> wrote:
>
> What we will do for that? I think, we can start to write requirements
> and flows.
>
> On 2020/03/05 20:19:38, Julian Jaffe 
> wrote:
> > Yeah, I think the primary objective here is a standalone writer from
> Spark
> > to Druid.
> >
> > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe 
> wrote:
> >
> > > Thanks Julian!
> > > I'm actually targeting for this connector to allow write
> capabilities (at
> > > least as a first phase), rather than focusing on read capabilities.
> > > Having said that, I definitely see the value (even for the
> use-cases in my
> > > company) of having a reader that queries S3 segments directly!
> Funny, we
> > > too have implemented a mechanism (although a very simple one) to
> get the
> > > locations of the segments through SegmentMetadataQueries, to allow
> > > batch-oriented queries to work with against the deep storage :)
> > >
> > > Anyway, as I said, I think we can focus on write capabilities for
> now, and
> > > worry about read capabilities later (if that's OK).
> > >
> > > On 2020/03/05 18:29:09, Julian Jaffe  >
> > > wrote:
> > > > The spark-druid-connector you shared brings up another design
> decision we
> > > > should probably talk through. That connector effectively wraps
> an HTTP
> > > > query client with Spark plumbing. An alternative approach (and
> the one I
> > > > ended up building due to our business requirements) is to build
> a reader
> > > > that operates directly over the S3 segments, shifting load for
> what are
> > > > likely very large and non-interactive queries off Druid-specific
> hardware
> > > > (with the exception of a few SegmentMetadataQueries to get
> location
> > > info).
> > > >
> > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe 
> wrote:
> > > >
> > > > > I'll let Julian answer, but in the meantime, I just wanted to
> point
> > > out we
> > > > > might be able to draw some inspiration from this Spark-Redshift
> > > connector (
> > > > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0
> ).
> > > > > Though it's somewhat outdated, it probably can be used as a
> reference
> > > for
> > > > > this new Spark-Druid connector we're planning.
> > > > > Another project to look at is
> > > > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0
> .
> > > > >
> > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
> > > sosyalmedya.oguz...@gmail.com>
> > > > > wrote:
> > > > > > I think second option would be better. Many people use spark
> for
> > > batch
> > > > > operations with isolated clusters. Me and my friends will
> taking time
> > > for
> > > > > that. Julian, can you share your experiences for that? After
> that, we
> > > can
> > > > > write our aims, requirements and flows easily.
> > > > > >
> > > > > > On 2020/02/26 13:26:13, itai yaffe 
> wrote:
> > > > > > > Hey,
> > > > > > > Per Gian's proposal, and following this thread in Druid
> user group
> > > (
> > > > > > >
> https://nam04.safelinks.

Re: Spark-based ingestion into Druid

2020-03-10 Thread Rajiv Mordani
As part of the requirements please include querying / reading from Spark as 
well. This is a high priority for us.

- Rajiv

On 3/10/20, 1:26 AM, "Oguzhan Mangir"  wrote:

What we will do for that? I think, we can start to write requirements and 
flows.

On 2020/03/05 20:19:38, Julian Jaffe  wrote: 
> Yeah, I think the primary objective here is a standalone writer from Spark
> to Druid.
> 
> On Thu, Mar 5, 2020 at 11:43 AM itai yaffe  wrote:
> 
> > Thanks Julian!
> > I'm actually targeting for this connector to allow write capabilities 
(at
> > least as a first phase), rather than focusing on read capabilities.
> > Having said that, I definitely see the value (even for the use-cases in 
my
> > company) of having a reader that queries S3 segments directly! Funny, we
> > too have implemented a mechanism (although a very simple one) to get the
> > locations of the segments through SegmentMetadataQueries, to allow
> > batch-oriented queries to work with against the deep storage :)
> >
> > Anyway, as I said, I think we can focus on write capabilities for now, 
and
> > worry about read capabilities later (if that's OK).
> >
> > On 2020/03/05 18:29:09, Julian Jaffe 
> > wrote:
> > > The spark-druid-connector you shared brings up another design 
decision we
> > > should probably talk through. That connector effectively wraps an HTTP
> > > query client with Spark plumbing. An alternative approach (and the 
one I
> > > ended up building due to our business requirements) is to build a 
reader
> > > that operates directly over the S3 segments, shifting load for what 
are
> > > likely very large and non-interactive queries off Druid-specific 
hardware
> > > (with the exception of a few SegmentMetadataQueries to get location
> > info).
> > >
> > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe  
wrote:
> > >
> > > > I'll let Julian answer, but in the meantime, I just wanted to point
> > out we
> > > > might be able to draw some inspiration from this Spark-Redshift
> > connector (
> > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0).
> > > > Though it's somewhat outdated, it probably can be used as a 
reference
> > for
> > > > this new Spark-Druid connector we're planning.
> > > > Another project to look at is
> > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0.
> > > >
> > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
> > sosyalmedya.oguz...@gmail.com>
> > > > wrote:
> > > > > I think second option would be better. Many people use spark for
> > batch
> > > > operations with isolated clusters. Me and my friends will taking 
time
> > for
> > > > that. Julian, can you share your experiences for that? After that, 
we
> > can
> > > > write our aims, requirements and flows easily.
> > > > >
> > > > > On 2020/02/26 13:26:13, itai yaffe  wrote:
> > > > > > Hey,
> > > > > > Per Gian's proposal, and following this thread in Druid user 
group
> > (
> > > > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0)
> > and
> > > > this
> > > > > > thread in Druid Slack channel (
> > > > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172

Re: Spark-based ingestion into Druid

2020-03-10 Thread Rajiv Mordani
We also have a use case of reading from Spark. However we are using HDFS (on 
prem solution) and not S3. While write would also be needed, our first 
requirement is really to query the data from Spark. We ingest via Kafka today 
into Druid. 

- Rajiv

On 3/5/20, 11:43 AM, "itai yaffe"  wrote:

Thanks Julian!
I'm actually targeting for this connector to allow write capabilities (at 
least as a first phase), rather than focusing on read capabilities.
Having said that, I definitely see the value (even for the use-cases in my 
company) of having a reader that queries S3 segments directly! Funny, we too 
have implemented a mechanism (although a very simple one) to get the locations 
of the segments through SegmentMetadataQueries, to allow batch-oriented queries 
to work with against the deep storage :)

Anyway, as I said, I think we can focus on write capabilities for now, and 
worry about read capabilities later (if that's OK).

On 2020/03/05 18:29:09, Julian Jaffe  wrote: 
> The spark-druid-connector you shared brings up another design decision we
> should probably talk through. That connector effectively wraps an HTTP
> query client with Spark plumbing. An alternative approach (and the one I
> ended up building due to our business requirements) is to build a reader
> that operates directly over the S3 segments, shifting load for what are
> likely very large and non-interactive queries off Druid-specific hardware
> (with the exception of a few SegmentMetadataQueries to get location info).
> 
> On Thu, Mar 5, 2020 at 8:04 AM itai yaffe  wrote:
> 
> > I'll let Julian answer, but in the meantime, I just wanted to point out 
we
> > might be able to draw some inspiration from this Spark-Redshift 
connector (
> > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=iUCk9MeSzgpxHtnO9VPhAbOMywPE8VHBDKhaMQ6%2Be9Q%3D&reserved=0).
> > Though it's somewhat outdated, it probably can be used as a reference 
for
> > this new Spark-Druid connector we're planning.
> > Another project to look at is
> > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=GAlPvt1CxFF2jZpK5vQ31vTY1OEmQOiZ7siJ4IoNuAU%3D&reserved=0.
> >
> > On 2020/03/02 14:31:27, O��uzhan Mang��r 
> > wrote:
> > > I think second option would be better. Many people use spark for batch
> > operations with isolated clusters. Me and my friends will taking time 
for
> > that. Julian, can you share your experiences for that? After that, we 
can
> > write our aims, requirements and flows easily.
> > >
> > > On 2020/02/26 13:26:13, itai yaffe  wrote:
> > > > Hey,
> > > > Per Gian's proposal, and following this thread in Druid user group (
> > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=UP%2FaBDuXJaByUAXQOtFXV2BvA1BV05dF9pOtKguOFNE%3D&reserved=0)
 and
> > this
> > > > thread in Druid Slack channel (
> > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=cYAEldtu1R8k0BuFoPty4%2BkNtI47gP12W3W4O%2BlRGgc%3D&reserved=0),
 I'd
> > like
> > > > to start discussing the options of having Spark-based ingestion into
> > Druid.
> > > >
> > > > There's already an old project (
> > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7Ce1c4b7ca140f4eefb11808d7c13d7ad1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637190342111876048&sdata=3YLbT0jKx%2FLYQc2JVYRg3c5zUL5ZP3jUeerW7y%2FatU0%3D&reserved=0)
> > > > for that, so perhaps we can use that as a starting point.
> > > >
> > > > The thread on Slack suggested 2 approaches:
> > > >
> >

Re: Spark-based ingestion into Druid

2020-03-10 Thread Oguzhan Mangir
What we will do for that? I think, we can start to write requirements and flows.

On 2020/03/05 20:19:38, Julian Jaffe  wrote: 
> Yeah, I think the primary objective here is a standalone writer from Spark
> to Druid.
> 
> On Thu, Mar 5, 2020 at 11:43 AM itai yaffe  wrote:
> 
> > Thanks Julian!
> > I'm actually targeting for this connector to allow write capabilities (at
> > least as a first phase), rather than focusing on read capabilities.
> > Having said that, I definitely see the value (even for the use-cases in my
> > company) of having a reader that queries S3 segments directly! Funny, we
> > too have implemented a mechanism (although a very simple one) to get the
> > locations of the segments through SegmentMetadataQueries, to allow
> > batch-oriented queries to work with against the deep storage :)
> >
> > Anyway, as I said, I think we can focus on write capabilities for now, and
> > worry about read capabilities later (if that's OK).
> >
> > On 2020/03/05 18:29:09, Julian Jaffe 
> > wrote:
> > > The spark-druid-connector you shared brings up another design decision we
> > > should probably talk through. That connector effectively wraps an HTTP
> > > query client with Spark plumbing. An alternative approach (and the one I
> > > ended up building due to our business requirements) is to build a reader
> > > that operates directly over the S3 segments, shifting load for what are
> > > likely very large and non-interactive queries off Druid-specific hardware
> > > (with the exception of a few SegmentMetadataQueries to get location
> > info).
> > >
> > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe  wrote:
> > >
> > > > I'll let Julian answer, but in the meantime, I just wanted to point
> > out we
> > > > might be able to draw some inspiration from this Spark-Redshift
> > connector (
> > > > https://github.com/databricks/spark-redshift#scala).
> > > > Though it's somewhat outdated, it probably can be used as a reference
> > for
> > > > this new Spark-Druid connector we're planning.
> > > > Another project to look at is
> > > > https://github.com/SharpRay/spark-druid-connector.
> > > >
> > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
> > sosyalmedya.oguz...@gmail.com>
> > > > wrote:
> > > > > I think second option would be better. Many people use spark for
> > batch
> > > > operations with isolated clusters. Me and my friends will taking time
> > for
> > > > that. Julian, can you share your experiences for that? After that, we
> > can
> > > > write our aims, requirements and flows easily.
> > > > >
> > > > > On 2020/02/26 13:26:13, itai yaffe  wrote:
> > > > > > Hey,
> > > > > > Per Gian's proposal, and following this thread in Druid user group
> > (
> > > > > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM)
> > and
> > > > this
> > > > > > thread in Druid Slack channel (
> > > > > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600),
> > I'd
> > > > like
> > > > > > to start discussing the options of having Spark-based ingestion
> > into
> > > > Druid.
> > > > > >
> > > > > > There's already an old project (
> > > > https://github.com/metamx/druid-spark-batch)
> > > > > > for that, so perhaps we can use that as a starting point.
> > > > > >
> > > > > > The thread on Slack suggested 2 approaches:
> > > > > >
> > > > > >1. *Simply replacing the Hadoop MapReduce ingestion task* -
> > having a
> > > > > >Spark batch job that ingests data into Druid, as a simple
> > > > replacement of
> > > > > >the Hadoop MapReduce ingestion task.
> > > > > >Meaning - your data pipeline will have a Spark job to
> > pre-process
> > > > the
> > > > > >data (similar to what some of us have today), and another Spark
> > job
> > > > to read
> > > > > >the output of the previous job, and create Druid segments
> > (again -
> > > > > >following the same pattern as the Hadoop MapReduce ingestion
> > task).
> > > > > >2. *Dru

Re: Spark-based ingestion into Druid

2020-03-05 Thread Julian Jaffe
Yeah, I think the primary objective here is a standalone writer from Spark
to Druid.

On Thu, Mar 5, 2020 at 11:43 AM itai yaffe  wrote:

> Thanks Julian!
> I'm actually targeting for this connector to allow write capabilities (at
> least as a first phase), rather than focusing on read capabilities.
> Having said that, I definitely see the value (even for the use-cases in my
> company) of having a reader that queries S3 segments directly! Funny, we
> too have implemented a mechanism (although a very simple one) to get the
> locations of the segments through SegmentMetadataQueries, to allow
> batch-oriented queries to work with against the deep storage :)
>
> Anyway, as I said, I think we can focus on write capabilities for now, and
> worry about read capabilities later (if that's OK).
>
> On 2020/03/05 18:29:09, Julian Jaffe 
> wrote:
> > The spark-druid-connector you shared brings up another design decision we
> > should probably talk through. That connector effectively wraps an HTTP
> > query client with Spark plumbing. An alternative approach (and the one I
> > ended up building due to our business requirements) is to build a reader
> > that operates directly over the S3 segments, shifting load for what are
> > likely very large and non-interactive queries off Druid-specific hardware
> > (with the exception of a few SegmentMetadataQueries to get location
> info).
> >
> > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe  wrote:
> >
> > > I'll let Julian answer, but in the meantime, I just wanted to point
> out we
> > > might be able to draw some inspiration from this Spark-Redshift
> connector (
> > > https://github.com/databricks/spark-redshift#scala).
> > > Though it's somewhat outdated, it probably can be used as a reference
> for
> > > this new Spark-Druid connector we're planning.
> > > Another project to look at is
> > > https://github.com/SharpRay/spark-druid-connector.
> > >
> > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
> sosyalmedya.oguz...@gmail.com>
> > > wrote:
> > > > I think second option would be better. Many people use spark for
> batch
> > > operations with isolated clusters. Me and my friends will taking time
> for
> > > that. Julian, can you share your experiences for that? After that, we
> can
> > > write our aims, requirements and flows easily.
> > > >
> > > > On 2020/02/26 13:26:13, itai yaffe  wrote:
> > > > > Hey,
> > > > > Per Gian's proposal, and following this thread in Druid user group
> (
> > > > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM)
> and
> > > this
> > > > > thread in Druid Slack channel (
> > > > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600),
> I'd
> > > like
> > > > > to start discussing the options of having Spark-based ingestion
> into
> > > Druid.
> > > > >
> > > > > There's already an old project (
> > > https://github.com/metamx/druid-spark-batch)
> > > > > for that, so perhaps we can use that as a starting point.
> > > > >
> > > > > The thread on Slack suggested 2 approaches:
> > > > >
> > > > >1. *Simply replacing the Hadoop MapReduce ingestion task* -
> having a
> > > > >Spark batch job that ingests data into Druid, as a simple
> > > replacement of
> > > > >the Hadoop MapReduce ingestion task.
> > > > >Meaning - your data pipeline will have a Spark job to
> pre-process
> > > the
> > > > >data (similar to what some of us have today), and another Spark
> job
> > > to read
> > > > >the output of the previous job, and create Druid segments
> (again -
> > > > >following the same pattern as the Hadoop MapReduce ingestion
> task).
> > > > >2. *Druid output sink for Spark* - rather than having 2 separate
> > > Spark
> > > > >jobs, 1 for pre-processing the data and 1 for ingesting the data
> > > into
> > > > >Druid, you'll have a single Spark job that pre-processes the
> data
> > > and
> > > > >creates Druid segments directly, e.g
> > > sparkDataFrame.write.format("druid")
> > > > >(as suggested by omngr on Slack).
> > > > >
> > > > >
> > > > > I personally prefer the 2nd approach - while it might be harder to
> > > > > implement, it seems the benefits are greater in this approach.
> > > > >
> > > > > I'd like to hear your thoughts and to start getting this ball
> rolling.
> > > > >
> > > > > Thanks,
> > > > >Itai
> > > > >
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > For additional commands, e-mail: dev-h...@druid.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Spark-based ingestion into Druid

2020-03-05 Thread itai yaffe
Thanks Julian!
I'm actually targeting for this connector to allow write capabilities (at least 
as a first phase), rather than focusing on read capabilities.
Having said that, I definitely see the value (even for the use-cases in my 
company) of having a reader that queries S3 segments directly! Funny, we too 
have implemented a mechanism (although a very simple one) to get the locations 
of the segments through SegmentMetadataQueries, to allow batch-oriented queries 
to work with against the deep storage :)

Anyway, as I said, I think we can focus on write capabilities for now, and 
worry about read capabilities later (if that's OK).

On 2020/03/05 18:29:09, Julian Jaffe  wrote: 
> The spark-druid-connector you shared brings up another design decision we
> should probably talk through. That connector effectively wraps an HTTP
> query client with Spark plumbing. An alternative approach (and the one I
> ended up building due to our business requirements) is to build a reader
> that operates directly over the S3 segments, shifting load for what are
> likely very large and non-interactive queries off Druid-specific hardware
> (with the exception of a few SegmentMetadataQueries to get location info).
> 
> On Thu, Mar 5, 2020 at 8:04 AM itai yaffe  wrote:
> 
> > I'll let Julian answer, but in the meantime, I just wanted to point out we
> > might be able to draw some inspiration from this Spark-Redshift connector (
> > https://github.com/databricks/spark-redshift#scala).
> > Though it's somewhat outdated, it probably can be used as a reference for
> > this new Spark-Druid connector we're planning.
> > Another project to look at is
> > https://github.com/SharpRay/spark-druid-connector.
> >
> > On 2020/03/02 14:31:27, O��uzhan Mang��r 
> > wrote:
> > > I think second option would be better. Many people use spark for batch
> > operations with isolated clusters. Me and my friends will taking time for
> > that. Julian, can you share your experiences for that? After that, we can
> > write our aims, requirements and flows easily.
> > >
> > > On 2020/02/26 13:26:13, itai yaffe  wrote:
> > > > Hey,
> > > > Per Gian's proposal, and following this thread in Druid user group (
> > > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and
> > this
> > > > thread in Druid Slack channel (
> > > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd
> > like
> > > > to start discussing the options of having Spark-based ingestion into
> > Druid.
> > > >
> > > > There's already an old project (
> > https://github.com/metamx/druid-spark-batch)
> > > > for that, so perhaps we can use that as a starting point.
> > > >
> > > > The thread on Slack suggested 2 approaches:
> > > >
> > > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
> > > >Spark batch job that ingests data into Druid, as a simple
> > replacement of
> > > >the Hadoop MapReduce ingestion task.
> > > >Meaning - your data pipeline will have a Spark job to pre-process
> > the
> > > >data (similar to what some of us have today), and another Spark job
> > to read
> > > >the output of the previous job, and create Druid segments (again -
> > > >following the same pattern as the Hadoop MapReduce ingestion task).
> > > >2. *Druid output sink for Spark* - rather than having 2 separate
> > Spark
> > > >jobs, 1 for pre-processing the data and 1 for ingesting the data
> > into
> > > >Druid, you'll have a single Spark job that pre-processes the data
> > and
> > > >creates Druid segments directly, e.g
> > sparkDataFrame.write.format("druid")
> > > >(as suggested by omngr on Slack).
> > > >
> > > >
> > > > I personally prefer the 2nd approach - while it might be harder to
> > > > implement, it seems the benefits are greater in this approach.
> > > >
> > > > I'd like to hear your thoughts and to start getting this ball rolling.
> > > >
> > > > Thanks,
> > > >Itai
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > For additional commands, e-mail: dev-h...@druid.apache.org
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Spark-based ingestion into Druid

2020-03-05 Thread Julian Jaffe
The spark-druid-connector you shared brings up another design decision we
should probably talk through. That connector effectively wraps an HTTP
query client with Spark plumbing. An alternative approach (and the one I
ended up building due to our business requirements) is to build a reader
that operates directly over the S3 segments, shifting load for what are
likely very large and non-interactive queries off Druid-specific hardware
(with the exception of a few SegmentMetadataQueries to get location info).

On Thu, Mar 5, 2020 at 8:04 AM itai yaffe  wrote:

> I'll let Julian answer, but in the meantime, I just wanted to point out we
> might be able to draw some inspiration from this Spark-Redshift connector (
> https://github.com/databricks/spark-redshift#scala).
> Though it's somewhat outdated, it probably can be used as a reference for
> this new Spark-Druid connector we're planning.
> Another project to look at is
> https://github.com/SharpRay/spark-druid-connector.
>
> On 2020/03/02 14:31:27, O��uzhan Mang��r 
> wrote:
> > I think second option would be better. Many people use spark for batch
> operations with isolated clusters. Me and my friends will taking time for
> that. Julian, can you share your experiences for that? After that, we can
> write our aims, requirements and flows easily.
> >
> > On 2020/02/26 13:26:13, itai yaffe  wrote:
> > > Hey,
> > > Per Gian's proposal, and following this thread in Druid user group (
> > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and
> this
> > > thread in Druid Slack channel (
> > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd
> like
> > > to start discussing the options of having Spark-based ingestion into
> Druid.
> > >
> > > There's already an old project (
> https://github.com/metamx/druid-spark-batch)
> > > for that, so perhaps we can use that as a starting point.
> > >
> > > The thread on Slack suggested 2 approaches:
> > >
> > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
> > >Spark batch job that ingests data into Druid, as a simple
> replacement of
> > >the Hadoop MapReduce ingestion task.
> > >Meaning - your data pipeline will have a Spark job to pre-process
> the
> > >data (similar to what some of us have today), and another Spark job
> to read
> > >the output of the previous job, and create Druid segments (again -
> > >following the same pattern as the Hadoop MapReduce ingestion task).
> > >2. *Druid output sink for Spark* - rather than having 2 separate
> Spark
> > >jobs, 1 for pre-processing the data and 1 for ingesting the data
> into
> > >Druid, you'll have a single Spark job that pre-processes the data
> and
> > >creates Druid segments directly, e.g
> sparkDataFrame.write.format("druid")
> > >(as suggested by omngr on Slack).
> > >
> > >
> > > I personally prefer the 2nd approach - while it might be harder to
> > > implement, it seems the benefits are greater in this approach.
> > >
> > > I'd like to hear your thoughts and to start getting this ball rolling.
> > >
> > > Thanks,
> > >Itai
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Spark-based ingestion into Druid

2020-03-05 Thread itai yaffe
I'll let Julian answer, but in the meantime, I just wanted to point out we 
might be able to draw some inspiration from this Spark-Redshift connector 
(https://github.com/databricks/spark-redshift#scala).
Though it's somewhat outdated, it probably can be used as a reference for this 
new Spark-Druid connector we're planning.
Another project to look at is https://github.com/SharpRay/spark-druid-connector.

On 2020/03/02 14:31:27, O��uzhan Mang��r  wrote: 
> I think second option would be better. Many people use spark for batch 
> operations with isolated clusters. Me and my friends will taking time for 
> that. Julian, can you share your experiences for that? After that, we can 
> write our aims, requirements and flows easily. 
> 
> On 2020/02/26 13:26:13, itai yaffe  wrote: 
> > Hey,
> > Per Gian's proposal, and following this thread in Druid user group (
> > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> > thread in Druid Slack channel (
> > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
> > to start discussing the options of having Spark-based ingestion into Druid.
> > 
> > There's already an old project (https://github.com/metamx/druid-spark-batch)
> > for that, so perhaps we can use that as a starting point.
> > 
> > The thread on Slack suggested 2 approaches:
> > 
> >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
> >Spark batch job that ingests data into Druid, as a simple replacement of
> >the Hadoop MapReduce ingestion task.
> >Meaning - your data pipeline will have a Spark job to pre-process the
> >data (similar to what some of us have today), and another Spark job to 
> > read
> >the output of the previous job, and create Druid segments (again -
> >following the same pattern as the Hadoop MapReduce ingestion task).
> >2. *Druid output sink for Spark* - rather than having 2 separate Spark
> >jobs, 1 for pre-processing the data and 1 for ingesting the data into
> >Druid, you'll have a single Spark job that pre-processes the data and
> >creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
> >(as suggested by omngr on Slack).
> > 
> > 
> > I personally prefer the 2nd approach - while it might be harder to
> > implement, it seems the benefits are greater in this approach.
> > 
> > I'd like to hear your thoughts and to start getting this ball rolling.
> > 
> > Thanks,
> >Itai
> > 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
> 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Spark-based ingestion into Druid

2020-03-03 Thread Julian Jaffe
I've submitted https://github.com/apache/druid/pull/9454 today to add a
`OnHeapMemorySegmentWriteOutMediumFactory`.

On Mon, Mar 2, 2020 at 8:57 AM Oğuzhan Mangır 
wrote:

>
>
> On 2020/02/26 13:26:13, itai yaffe  wrote:
> > Hey,
> > Per Gian's proposal, and following this thread in Druid user group (
> > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> > thread in Druid Slack channel (
> > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd
> like
> > to start discussing the options of having Spark-based ingestion into
> Druid.
> >
> > There's already an old project (
> https://github.com/metamx/druid-spark-batch)
> > for that, so perhaps we can use that as a starting point.
> >
> > The thread on Slack suggested 2 approaches:
> >
> >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
> >Spark batch job that ingests data into Druid, as a simple replacement
> of
> >the Hadoop MapReduce ingestion task.
> >Meaning - your data pipeline will have a Spark job to pre-process the
> >data (similar to what some of us have today), and another Spark job
> to read
> >the output of the previous job, and create Druid segments (again -
> >following the same pattern as the Hadoop MapReduce ingestion task).
> >2. *Druid output sink for Spark* - rather than having 2 separate Spark
> >jobs, 1 for pre-processing the data and 1 for ingesting the data into
> >Druid, you'll have a single Spark job that pre-processes the data and
> >creates Druid segments directly, e.g
> sparkDataFrame.write.format("druid")
> >(as suggested by omngr on Slack).
> >
> >
> > I personally prefer the 2nd approach - while it might be harder to
> > implement, it seems the benefits are greater in this approach.
> >
> > I'd like to hear your thoughts and to start getting this ball rolling.
> >
> > Thanks,
> >Itai
> >
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Spark-based ingestion into Druid

2020-03-02 Thread Oğuzhan Mangır
We should be organized for that. This is a big problem for all batch 
operations. Me and my friends will be taking time for that. Julian, can you 
share your experiences for that? After that, we can write our aims, 
requirements and flows.

On 2020/02/26 13:26:13, itai yaffe  wrote: 
> Hey,
> Per Gian's proposal, and following this thread in Druid user group (
> https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> thread in Druid Slack channel (
> https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
> to start discussing the options of having Spark-based ingestion into Druid.
> 
> There's already an old project (https://github.com/metamx/druid-spark-batch)
> for that, so perhaps we can use that as a starting point.
> 
> The thread on Slack suggested 2 approaches:
> 
>1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
>Spark batch job that ingests data into Druid, as a simple replacement of
>the Hadoop MapReduce ingestion task.
>Meaning - your data pipeline will have a Spark job to pre-process the
>data (similar to what some of us have today), and another Spark job to read
>the output of the previous job, and create Druid segments (again -
>following the same pattern as the Hadoop MapReduce ingestion task).
>2. *Druid output sink for Spark* - rather than having 2 separate Spark
>jobs, 1 for pre-processing the data and 1 for ingesting the data into
>Druid, you'll have a single Spark job that pre-processes the data and
>creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
>(as suggested by omngr on Slack).
> 
> 
> I personally prefer the 2nd approach - while it might be harder to
> implement, it see

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Spark-based ingestion into Druid

2020-03-02 Thread Oğuzhan Mangır



On 2020/02/26 13:26:13, itai yaffe  wrote: 
> Hey,
> Per Gian's proposal, and following this thread in Druid user group (
> https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> thread in Druid Slack channel (
> https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
> to start discussing the options of having Spark-based ingestion into Druid.
> 
> There's already an old project (https://github.com/metamx/druid-spark-batch)
> for that, so perhaps we can use that as a starting point.
> 
> The thread on Slack suggested 2 approaches:
> 
>1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
>Spark batch job that ingests data into Druid, as a simple replacement of
>the Hadoop MapReduce ingestion task.
>Meaning - your data pipeline will have a Spark job to pre-process the
>data (similar to what some of us have today), and another Spark job to read
>the output of the previous job, and create Druid segments (again -
>following the same pattern as the Hadoop MapReduce ingestion task).
>2. *Druid output sink for Spark* - rather than having 2 separate Spark
>jobs, 1 for pre-processing the data and 1 for ingesting the data into
>Druid, you'll have a single Spark job that pre-processes the data and
>creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
>(as suggested by omngr on Slack).
> 
> 
> I personally prefer the 2nd approach - while it might be harder to
> implement, it seems the benefits are greater in this approach.
> 
> I'd like to hear your thoughts and to start getting this ball rolling.
> 
> Thanks,
>Itai
> 
-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Spark-based ingestion into Druid

2020-03-02 Thread Oğuzhan Mangır
I think second option would be better. Many people use spark for batch 
operations with isolated clusters. Me and my friends will taking time for that. 
Julian, can you share your experiences for that? After that, we can write our 
aims, requirements and flows easily. 

On 2020/02/26 13:26:13, itai yaffe  wrote: 
> Hey,
> Per Gian's proposal, and following this thread in Druid user group (
> https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> thread in Druid Slack channel (
> https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
> to start discussing the options of having Spark-based ingestion into Druid.
> 
> There's already an old project (https://github.com/metamx/druid-spark-batch)
> for that, so perhaps we can use that as a starting point.
> 
> The thread on Slack suggested 2 approaches:
> 
>1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
>Spark batch job that ingests data into Druid, as a simple replacement of
>the Hadoop MapReduce ingestion task.
>Meaning - your data pipeline will have a Spark job to pre-process the
>data (similar to what some of us have today), and another Spark job to read
>the output of the previous job, and create Druid segments (again -
>following the same pattern as the Hadoop MapReduce ingestion task).
>2. *Druid output sink for Spark* - rather than having 2 separate Spark
>jobs, 1 for pre-processing the data and 1 for ingesting the data into
>Druid, you'll have a single Spark job that pre-processes the data and
>creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
>(as suggested by omngr on Slack).
> 
> 
> I personally prefer the 2nd approach - while it might be harder to
> implement, it seems the benefits are greater in this approach.
> 
> I'd like to hear your thoughts and to start getting this ball rolling.
> 
> Thanks,
>Itai
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Spark-based ingestion into Druid

2020-02-27 Thread Julian Jaffe
I think for whatever approach we take, we'll need to expose a
OnHeapMemorySegmentWriteOutMediumFactory for OnHeapMemorySegmentWriteOutMedium
that parallels OffHeapMemorySegmentWriteOutMediumFactory. Although off heap
index building will be faster, it's very difficult to get most schedulers
to allocate off-heap resources correctly for Spark containers. I can likely
get a diff up in the next day or two.

On Wed, Feb 26, 2020 at 5:26 AM itai yaffe  wrote:

> Hey,
> Per Gian's proposal, and following this thread in Druid user group (
> https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> thread in Druid Slack channel (
> https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
> to start discussing the options of having Spark-based ingestion into Druid.
>
> There's already an old project (
> https://github.com/metamx/druid-spark-batch)
> for that, so perhaps we can use that as a starting point.
>
> The thread on Slack suggested 2 approaches:
>
>1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
>Spark batch job that ingests data into Druid, as a simple replacement of
>the Hadoop MapReduce ingestion task.
>Meaning - your data pipeline will have a Spark job to pre-process the
>data (similar to what some of us have today), and another Spark job to
> read
>the output of the previous job, and create Druid segments (again -
>following the same pattern as the Hadoop MapReduce ingestion task).
>2. *Druid output sink for Spark* - rather than having 2 separate Spark
>jobs, 1 for pre-processing the data and 1 for ingesting the data into
>Druid, you'll have a single Spark job that pre-processes the data and
>creates Druid segments directly, e.g
> sparkDataFrame.write.format("druid")
>(as suggested by omngr on Slack).
>
>
> I personally prefer the 2nd approach - while it might be harder to
> implement, it seems the benefits are greater in this approach.
>
> I'd like to hear your thoughts and to start getting this ball rolling.
>
> Thanks,
>Itai
>


Spark-based ingestion into Druid

2020-02-26 Thread itai yaffe
Hey,
Per Gian's proposal, and following this thread in Druid user group (
https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
thread in Druid Slack channel (
https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
to start discussing the options of having Spark-based ingestion into Druid.

There's already an old project (https://github.com/metamx/druid-spark-batch)
for that, so perhaps we can use that as a starting point.

The thread on Slack suggested 2 approaches:

   1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
   Spark batch job that ingests data into Druid, as a simple replacement of
   the Hadoop MapReduce ingestion task.
   Meaning - your data pipeline will have a Spark job to pre-process the
   data (similar to what some of us have today), and another Spark job to read
   the output of the previous job, and create Druid segments (again -
   following the same pattern as the Hadoop MapReduce ingestion task).
   2. *Druid output sink for Spark* - rather than having 2 separate Spark
   jobs, 1 for pre-processing the data and 1 for ingesting the data into
   Druid, you'll have a single Spark job that pre-processes the data and
   creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
   (as suggested by omngr on Slack).


I personally prefer the 2nd approach - while it might be harder to
implement, it seems the benefits are greater in this approach.

I'd like to hear your thoughts and to start getting this ball rolling.

Thanks,
   Itai