You can try out *debezium* : https://github.com/debezium. it reads data from bin-logs, provides structure and stream into Kafka.
Now Kafka can be your new source for streaming. On Tue, Jan 3, 2017 at 4:36 PM, Yuanzhe Yang <yyz1...@gmail.com> wrote: > Hi Hongdi, > > Thanks a lot for your suggestion. The data is truely immutable and the > table is append-only. But actually there are different databases involved, > so the only feature they share in common and I can depend on is jdbc... > > Best regards, > Yang > > > 2016-12-30 6:45 GMT+01:00 任弘迪 <ryan.hd....@gmail.com>: > >> why not sync binlog of mysql(hopefully the data is immutable and the >> table is append-only), send the log through kafka and then consume it by >> spark streaming? >> >> On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >>> We don't support this yet, but I've opened this JIRA as it sounds >>> generally useful: https://issues.apache.org/jira/browse/SPARK-19031 >>> >>> In the mean time you could try implementing your own Source, but that is >>> pretty low level and is not yet a stable API. >>> >>> On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" <yyz1...@gmail.com >>> > wrote: >>> >>>> Hi all, >>>> >>>> Thanks a lot for your contributions to bring us new technologies. >>>> >>>> I don't want to waste your time, so before I write to you, I googled, >>>> checked stackoverflow and mailing list archive with keywords "streaming" >>>> and "jdbc". But I was not able to get any solution to my use case. I hope I >>>> can get some clarification from you. >>>> >>>> The use case is quite straightforward, I need to harvest a relational >>>> database via jdbc, do something with data, and store result into Kafka. I >>>> am stuck at the first step, and the difficulty is as follows: >>>> >>>> 1. The database is too large to ingest with one thread. >>>> 2. The database is dynamic and time series data comes in constantly. >>>> >>>> Then an ideal workflow is that multiple workers process partitions of >>>> data incrementally according to a time window. For example, the processing >>>> starts from the earliest data with each batch containing data for one hour. >>>> If data ingestion speed is faster than data production speed, then >>>> eventually the entire database will be harvested and those workers will >>>> start to "tail" the database for new data streams and the processing >>>> becomes real time. >>>> >>>> With Spark SQL I can ingest data from a JDBC source with partitions >>>> divided by time windows, but how can I dynamically increment the time >>>> windows during execution? Assume that there are two workers ingesting data >>>> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task >>>> for 2017-01-03. But I am not able to find out how to increment those values >>>> during execution. >>>> >>>> Then I looked into Structured Streaming. It looks much more promising >>>> because window operations based on event time are considered during >>>> streaming, which could be the solution to my use case. However, from >>>> documentation and code example I did not find anything related to streaming >>>> data from a growing database. Is there anything I can read to achieve my >>>> goal? >>>> >>>> Any suggestion is highly appreciated. Thank you very much and have a >>>> nice day. >>>> >>>> Best regards, >>>> Yang >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >>> >> > -- Regards, Amrit Data Team