Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

Yuanzhe Yang Tue, 03 Jan 2017 03:51:57 -0800

Hi Tamas,

Thanks a lot for your suggestion! I will also investigate this one later.


Best regards,
Yang

2017-01-03 12:38 GMT+01:00 Tamas Szuromi <tamas.szur...@odigeo.com>:

>
> You can also try https://github.com/zendesk/maxwell
>
> Tamas
>
> On 3 January 2017 at 12:25, Amrit Jangid <amrit.jan...@goibibo.com> wrote:
>
>> You can try out *debezium* : https://github.com/debezium. it reads data
>> from bin-logs, provides structure and stream into Kafka.
>>
>> Now Kafka can be your new source for streaming.
>>
>> On Tue, Jan 3, 2017 at 4:36 PM, Yuanzhe Yang <yyz1...@gmail.com> wrote:
>>
>>> Hi Hongdi,
>>>
>>> Thanks a lot for your suggestion. The data is truely immutable and the
>>> table is append-only. But actually there are different databases involved,
>>> so the only feature they share in common and I can depend on is jdbc...
>>>
>>> Best regards,
>>> Yang
>>>
>>>
>>> 2016-12-30 6:45 GMT+01:00 任弘迪 <ryan.hd....@gmail.com>:
>>>
>>>> why not sync binlog of mysql(hopefully the data is immutable and the
>>>> table is append-only), send the log through kafka and then consume it by
>>>> spark streaming?
>>>>
>>>> On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> We don't support this yet, but I've opened this JIRA as it sounds
>>>>> generally useful: https://issues.apache.org/jira/browse/SPARK-19031
>>>>>
>>>>> In the mean time you could try implementing your own Source, but that
>>>>> is pretty low level and is not yet a stable API.
>>>>>
>>>>> On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" <
>>>>> yyz1...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Thanks a lot for your contributions to bring us new technologies.
>>>>>>
>>>>>> I don't want to waste your time, so before I write to you, I googled,
>>>>>> checked stackoverflow and mailing list archive with keywords "streaming"
>>>>>> and "jdbc". But I was not able to get any solution to my use case. I 
>>>>>> hope I
>>>>>> can get some clarification from you.
>>>>>>
>>>>>> The use case is quite straightforward, I need to harvest a relational
>>>>>> database via jdbc, do something with data, and store result into Kafka. I
>>>>>> am stuck at the first step, and the difficulty is as follows:
>>>>>>
>>>>>> 1. The database is too large to ingest with one thread.
>>>>>> 2. The database is dynamic and time series data comes in constantly.
>>>>>>
>>>>>> Then an ideal workflow is that multiple workers process partitions of
>>>>>> data incrementally according to a time window. For example, the 
>>>>>> processing
>>>>>> starts from the earliest data with each batch containing data for one 
>>>>>> hour.
>>>>>> If data ingestion speed is faster than data production speed, then
>>>>>> eventually the entire database will be harvested and those workers will
>>>>>> start to "tail" the database for new data streams and the processing
>>>>>> becomes real time.
>>>>>>
>>>>>> With Spark SQL I can ingest data from a JDBC source with partitions
>>>>>> divided by time windows, but how can I dynamically increment the time
>>>>>> windows during execution? Assume that there are two workers ingesting 
>>>>>> data
>>>>>> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next 
>>>>>> task
>>>>>> for 2017-01-03. But I am not able to find out how to increment those 
>>>>>> values
>>>>>> during execution.
>>>>>>
>>>>>> Then I looked into Structured Streaming. It looks much more promising
>>>>>> because window operations based on event time are considered during
>>>>>> streaming, which could be the solution to my use case. However, from
>>>>>> documentation and code example I did not find anything related to 
>>>>>> streaming
>>>>>> data from a growing database. Is there anything I can read to achieve my
>>>>>> goal?
>>>>>>
>>>>>> Any suggestion is highly appreciated. Thank you very much and have a
>>>>>> nice day.
>>>>>>
>>>>>> Best regards,
>>>>>> Yang
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Regards,
>> Amrit
>> Data Team
>>
>
>

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

Reply via email to