Re: REST Structured Steaming Sink

2020-07-03 Thread Sam Elamin
Hi Folks,

Great discussion! I will take into account rate-limiting and make it
configurable for the http request as well as all

I was wondering if there is anything I might have missed that would make it
technically impossible to do or at least difficult enough to not warrant
the effort

Is there anything I might have overlooked? Also, would this be useful to
people?

My idea is from a business perspective, why are we making them wait till
the next scheduled batch run for data that is already available from an
API. You could run a job every minute/hour but that in itself sounds like a
streaming use-case

Thoughts?

Regards
Sam

On Thu, Jul 2, 2020 at 3:31 AM Burak Yavuz  wrote:

> Well, the difference is, a technical user writes the UDF and a
> non-technical user may use this built-in thing (misconfigure it) and shoot
> themselves in the foot.
>
> On Wed, Jul 1, 2020, 6:40 PM Andrew Melo  wrote:
>
>> On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz  wrote:
>> >
>> > I'm not sure having a built-in sink that allows you to DDOS servers is
>> the best idea either. foreachWriter is typically used for such use cases,
>> not foreachBatch. It's also pretty hard to guarantee exactly-once, rate
>> limiting, etc.
>>
>> If you control the machines and can run arbitrary code, you can DDOS
>> whatever you want. What's the difference between this proposal and
>> writing a UDF that opens 1,000 connections to a target machine?
>>
>> > Best,
>> > Burak
>> >
>> > On Wed, Jul 1, 2020 at 5:54 PM Holden Karau 
>> wrote:
>> >>
>> >> I think adding something like this (if it doesn't already exist) could
>> help make structured streaming easier to use, foreachBatch is not the best
>> API.
>> >>
>> >> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>>
>> >>> I guess the method, query parameter, header, and the payload would be
>> all different for almost every use case - that makes it hard to generalize
>> and requires implementation to be pretty much complicated to be flexible
>> enough.
>> >>>
>> >>> I'm not aware of any custom sink implementing REST so your best bet
>> would be simply implementing your own with foreachBatch, but so someone
>> might jump in and provide a pointer if there is something in the Spark
>> ecosystem.
>> >>>
>> >>> Thanks,
>> >>> Jungtaek Lim (HeartSaVioR)
>> >>>
>> >>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin 
>> wrote:
>> 
>>  Hi All,
>> 
>> 
>>  We ingest alot of restful APIs into our lake and I'm wondering if it
>> is at all possible to created a rest sink in structured streaming?
>> 
>>  For now I'm only focusing on restful services that have an
>> incremental ID so my sink can just poll for new data then ingest.
>> 
>>  I can't seem to find a connector that does this and my gut instinct
>> tells me it's probably because it isn't possible due to something
>> completely obvious that I am missing
>> 
>>  I know some RESTful API obfuscate the IDs to a hash of strings and
>> that could be a problem but since I'm planning on focusing on just
>> numerical IDs that just get incremented I think I won't be facing that issue
>> 
>> 
>>  Can anyone let me know if this sounds like a daft idea? Will I need
>> something like Kafka or kinesis as a buffer and redundancy or am I
>> overthinking this?
>> 
>> 
>>  I would love to bounce ideas with people who runs structured
>> streaming jobs in production
>> 
>> 
>>  Kind regards
>>  San
>> 
>> 
>> >>
>> >>
>> >> --
>> >> Twitter: https://twitter.com/holdenkarau
>> >> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: REST Structured Steaming Sink

2020-07-01 Thread Burak Yavuz
Well, the difference is, a technical user writes the UDF and a
non-technical user may use this built-in thing (misconfigure it) and shoot
themselves in the foot.

On Wed, Jul 1, 2020, 6:40 PM Andrew Melo  wrote:

> On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz  wrote:
> >
> > I'm not sure having a built-in sink that allows you to DDOS servers is
> the best idea either. foreachWriter is typically used for such use cases,
> not foreachBatch. It's also pretty hard to guarantee exactly-once, rate
> limiting, etc.
>
> If you control the machines and can run arbitrary code, you can DDOS
> whatever you want. What's the difference between this proposal and
> writing a UDF that opens 1,000 connections to a target machine?
>
> > Best,
> > Burak
> >
> > On Wed, Jul 1, 2020 at 5:54 PM Holden Karau 
> wrote:
> >>
> >> I think adding something like this (if it doesn't already exist) could
> help make structured streaming easier to use, foreachBatch is not the best
> API.
> >>
> >> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >>>
> >>> I guess the method, query parameter, header, and the payload would be
> all different for almost every use case - that makes it hard to generalize
> and requires implementation to be pretty much complicated to be flexible
> enough.
> >>>
> >>> I'm not aware of any custom sink implementing REST so your best bet
> would be simply implementing your own with foreachBatch, but so someone
> might jump in and provide a pointer if there is something in the Spark
> ecosystem.
> >>>
> >>> Thanks,
> >>> Jungtaek Lim (HeartSaVioR)
> >>>
> >>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin 
> wrote:
> 
>  Hi All,
> 
> 
>  We ingest alot of restful APIs into our lake and I'm wondering if it
> is at all possible to created a rest sink in structured streaming?
> 
>  For now I'm only focusing on restful services that have an
> incremental ID so my sink can just poll for new data then ingest.
> 
>  I can't seem to find a connector that does this and my gut instinct
> tells me it's probably because it isn't possible due to something
> completely obvious that I am missing
> 
>  I know some RESTful API obfuscate the IDs to a hash of strings and
> that could be a problem but since I'm planning on focusing on just
> numerical IDs that just get incremented I think I won't be facing that issue
> 
> 
>  Can anyone let me know if this sounds like a daft idea? Will I need
> something like Kafka or kinesis as a buffer and redundancy or am I
> overthinking this?
> 
> 
>  I would love to bounce ideas with people who runs structured
> streaming jobs in production
> 
> 
>  Kind regards
>  San
> 
> 
> >>
> >>
> >> --
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: REST Structured Steaming Sink

2020-07-01 Thread Andrew Melo
On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz  wrote:
>
> I'm not sure having a built-in sink that allows you to DDOS servers is the 
> best idea either. foreachWriter is typically used for such use cases, not 
> foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, 
> etc.

If you control the machines and can run arbitrary code, you can DDOS
whatever you want. What's the difference between this proposal and
writing a UDF that opens 1,000 connections to a target machine?

> Best,
> Burak
>
> On Wed, Jul 1, 2020 at 5:54 PM Holden Karau  wrote:
>>
>> I think adding something like this (if it doesn't already exist) could help 
>> make structured streaming easier to use, foreachBatch is not the best API.
>>
>> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim  
>> wrote:
>>>
>>> I guess the method, query parameter, header, and the payload would be all 
>>> different for almost every use case - that makes it hard to generalize and 
>>> requires implementation to be pretty much complicated to be flexible enough.
>>>
>>> I'm not aware of any custom sink implementing REST so your best bet would 
>>> be simply implementing your own with foreachBatch, but so someone might 
>>> jump in and provide a pointer if there is something in the Spark ecosystem.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin  wrote:

 Hi All,


 We ingest alot of restful APIs into our lake and I'm wondering if it is at 
 all possible to created a rest sink in structured streaming?

 For now I'm only focusing on restful services that have an incremental ID 
 so my sink can just poll for new data then ingest.

 I can't seem to find a connector that does this and my gut instinct tells 
 me it's probably because it isn't possible due to something completely 
 obvious that I am missing

 I know some RESTful API obfuscate the IDs to a hash of strings and that 
 could be a problem but since I'm planning on focusing on just numerical 
 IDs that just get incremented I think I won't be facing that issue


 Can anyone let me know if this sounds like a daft idea? Will I need 
 something like Kafka or kinesis as a buffer and redundancy or am I 
 overthinking this?


 I would love to bounce ideas with people who runs structured streaming 
 jobs in production


 Kind regards
 San


>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
On Wed, Jul 1, 2020 at 6:13 PM Burak Yavuz  wrote:

> I'm not sure having a built-in sink that allows you to DDOS servers is the
> best idea either
>
Do you think it would be used accidentally? If so we could have it with
default per server rate limits that people would have to explicitly tune.

> . foreachWriter is typically used for such use cases, not foreachBatch.
> It's also pretty hard to guarantee exactly-once, rate limiting, etc.
>

> Best,
> Burak
>
> On Wed, Jul 1, 2020 at 5:54 PM Holden Karau  wrote:
>
>> I think adding something like this (if it doesn't already exist) could
>> help make structured streaming easier to use, foreachBatch is not the best
>> API.
>>
>> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim 
>> wrote:
>>
>>> I guess the method, query parameter, header, and the payload would
>>> be all different for almost every use case - that makes it hard to
>>> generalize and requires implementation to be pretty much complicated to be
>>> flexible enough.
>>>
>>> I'm not aware of any custom sink implementing REST so your best bet
>>> would be simply implementing your own with foreachBatch, but so someone
>>> might jump in and provide a pointer if there is something in the Spark
>>> ecosystem.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin 
>>> wrote:
>>>
 Hi All,


 We ingest alot of restful APIs into our lake and I'm wondering if it is
 at all possible to created a rest sink in structured streaming?

 For now I'm only focusing on restful services that have an incremental
 ID so my sink can just poll for new data then ingest.

 I can't seem to find a connector that does this and my gut instinct
 tells me it's probably because it isn't possible due to something
 completely obvious that I am missing

 I know some RESTful API obfuscate the IDs to a hash of strings and that
 could be a problem but since I'm planning on focusing on just numerical IDs
 that just get incremented I think I won't be facing that issue


 Can anyone let me know if this sounds like a daft idea? Will I need
 something like Kafka or kinesis as a buffer and redundancy or am I
 overthinking this?


 I would love to bounce ideas with people who runs structured streaming
 jobs in production


 Kind regards
 San



>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: REST Structured Steaming Sink

2020-07-01 Thread Burak Yavuz
I'm not sure having a built-in sink that allows you to DDOS servers is the
best idea either. foreachWriter is typically used for such use cases, not
foreachBatch. It's also pretty hard to guarantee exactly-once, rate
limiting, etc.

Best,
Burak

On Wed, Jul 1, 2020 at 5:54 PM Holden Karau  wrote:

> I think adding something like this (if it doesn't already exist) could
> help make structured streaming easier to use, foreachBatch is not the best
> API.
>
> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim 
> wrote:
>
>> I guess the method, query parameter, header, and the payload would be all
>> different for almost every use case - that makes it hard to generalize and
>> requires implementation to be pretty much complicated to be flexible enough.
>>
>> I'm not aware of any custom sink implementing REST so your best bet would
>> be simply implementing your own with foreachBatch, but so someone might
>> jump in and provide a pointer if there is something in the Spark ecosystem.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin 
>> wrote:
>>
>>> Hi All,
>>>
>>>
>>> We ingest alot of restful APIs into our lake and I'm wondering if it is
>>> at all possible to created a rest sink in structured streaming?
>>>
>>> For now I'm only focusing on restful services that have an incremental
>>> ID so my sink can just poll for new data then ingest.
>>>
>>> I can't seem to find a connector that does this and my gut instinct
>>> tells me it's probably because it isn't possible due to something
>>> completely obvious that I am missing
>>>
>>> I know some RESTful API obfuscate the IDs to a hash of strings and that
>>> could be a problem but since I'm planning on focusing on just numerical IDs
>>> that just get incremented I think I won't be facing that issue
>>>
>>>
>>> Can anyone let me know if this sounds like a daft idea? Will I need
>>> something like Kafka or kinesis as a buffer and redundancy or am I
>>> overthinking this?
>>>
>>>
>>> I would love to bounce ideas with people who runs structured streaming
>>> jobs in production
>>>
>>>
>>> Kind regards
>>> San
>>>
>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
I think adding something like this (if it doesn't already exist) could help
make structured streaming easier to use, foreachBatch is not the best API.

On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim 
wrote:

> I guess the method, query parameter, header, and the payload would be all
> different for almost every use case - that makes it hard to generalize and
> requires implementation to be pretty much complicated to be flexible enough.
>
> I'm not aware of any custom sink implementing REST so your best bet would
> be simply implementing your own with foreachBatch, but so someone might
> jump in and provide a pointer if there is something in the Spark ecosystem.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin  wrote:
>
>> Hi All,
>>
>>
>> We ingest alot of restful APIs into our lake and I'm wondering if it is
>> at all possible to created a rest sink in structured streaming?
>>
>> For now I'm only focusing on restful services that have an incremental ID
>> so my sink can just poll for new data then ingest.
>>
>> I can't seem to find a connector that does this and my gut instinct tells
>> me it's probably because it isn't possible due to something completely
>> obvious that I am missing
>>
>> I know some RESTful API obfuscate the IDs to a hash of strings and that
>> could be a problem but since I'm planning on focusing on just numerical IDs
>> that just get incremented I think I won't be facing that issue
>>
>>
>> Can anyone let me know if this sounds like a daft idea? Will I need
>> something like Kafka or kinesis as a buffer and redundancy or am I
>> overthinking this?
>>
>>
>> I would love to bounce ideas with people who runs structured streaming
>> jobs in production
>>
>>
>> Kind regards
>> San
>>
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: REST Structured Steaming Sink

2020-07-01 Thread Jungtaek Lim
I guess the method, query parameter, header, and the payload would be all
different for almost every use case - that makes it hard to generalize and
requires implementation to be pretty much complicated to be flexible enough.

I'm not aware of any custom sink implementing REST so your best bet would
be simply implementing your own with foreachBatch, but so someone might
jump in and provide a pointer if there is something in the Spark ecosystem.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin  wrote:

> Hi All,
>
>
> We ingest alot of restful APIs into our lake and I'm wondering if it is at
> all possible to created a rest sink in structured streaming?
>
> For now I'm only focusing on restful services that have an incremental ID
> so my sink can just poll for new data then ingest.
>
> I can't seem to find a connector that does this and my gut instinct tells
> me it's probably because it isn't possible due to something completely
> obvious that I am missing
>
> I know some RESTful API obfuscate the IDs to a hash of strings and that
> could be a problem but since I'm planning on focusing on just numerical IDs
> that just get incremented I think I won't be facing that issue
>
>
> Can anyone let me know if this sounds like a daft idea? Will I need
> something like Kafka or kinesis as a buffer and redundancy or am I
> overthinking this?
>
>
> I would love to bounce ideas with people who runs structured streaming
> jobs in production
>
>
> Kind regards
> San
>
>
>


REST Structured Steaming Sink

2020-07-01 Thread Sam Elamin
Hi All,


We ingest alot of restful APIs into our lake and I'm wondering if it is at
all possible to created a rest sink in structured streaming?

For now I'm only focusing on restful services that have an incremental ID
so my sink can just poll for new data then ingest.

I can't seem to find a connector that does this and my gut instinct tells
me it's probably because it isn't possible due to something completely
obvious that I am missing

I know some RESTful API obfuscate the IDs to a hash of strings and that
could be a problem but since I'm planning on focusing on just numerical IDs
that just get incremented I think I won't be facing that issue


Can anyone let me know if this sounds like a daft idea? Will I need
something like Kafka or kinesis as a buffer and redundancy or am I
overthinking this?


I would love to bounce ideas with people who runs structured streaming jobs
in production


Kind regards
San