Re: [External] Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-04 Thread Ben Teeuwen
Another option: https://github.com/mysql-time-machine/replicator >From the readme: "Replicates data changes from MySQL binlog to HBase or Kafka. In case of HBase, preserves the previous data versions. HBase storage is intended for auditing purposes of historical data. In addition, special

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, This "inline view" idea is really awesome and enlightens me! Finally I have a plan to move on. I greatly appreciate your help! Best regards, Yang 2017-01-03 18:14 GMT+01:00 ayan guha : > Ahh I see what you meanI confused two terminologiesbecause we were >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
Ahh I see what you meanI confused two terminologiesbecause we were talking about partitioning and then changed topic to identify changed data For that, you can "construct" a dbtable as an inline view - viewSQL = "(select * from table where >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, Yeah, I understand your proposal, but according to here http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases, it says Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
Hi You need to store and capture the Max of the column you intend to use for identifying new records (Ex: INSERTED_ON) after every successful run of your job. Then, use the value in lowerBound option. Essentially, you want to create a query like select * from table where INSERTED_ON >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, Thanks a lot for your suggestion. I am currently looking into sqoop. Concerning your suggestion for Spark, it is indeed parallelized with multiple workers, but the job is one-off and cannot keep streaming. Moreover, I cannot specify any "start row" in the job, it will always ingest the

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
Hi While the solutions provided by others looks promising and I'd like to try out few of them, our old pal sqoop already "does" the job. It has a incremental mode where you can provide a --check-column and --last-modified-value combination to grab the data - and yes, sqoop essentially does it by

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Tamas, Thanks a lot for your suggestion! I will also investigate this one later. Best regards, Yang 2017-01-03 12:38 GMT+01:00 Tamas Szuromi : > > You can also try https://github.com/zendesk/maxwell > > Tamas > > On 3 January 2017 at 12:25, Amrit Jangid

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Amrit, Thanks a lot for your suggestion! I will investigate it later. Best regards, Yang 2017-01-03 12:25 GMT+01:00 Amrit Jangid : > You can try out *debezium* : https://github.com/debezium. it reads data > from bin-logs, provides structure and stream into Kafka. >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, Thanks a lot for such a detailed response. I really appreciate it! I think this use case can be generalized, because the data is immutable and append-only. We only need to find one column or timestamp to track the last row consumed in the previous ingestion. This pattern should be

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Tamas Szuromi
You can also try https://github.com/zendesk/maxwell Tamas On 3 January 2017 at 12:25, Amrit Jangid wrote: > You can try out *debezium* : https://github.com/debezium. it reads data > from bin-logs, provides structure and stream into Kafka. > > Now Kafka can be your new

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Amrit Jangid
You can try out *debezium* : https://github.com/debezium. it reads data from bin-logs, provides structure and stream into Kafka. Now Kafka can be your new source for streaming. On Tue, Jan 3, 2017 at 4:36 PM, Yuanzhe Yang wrote: > Hi Hongdi, > > Thanks a lot for your

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Hongdi, Thanks a lot for your suggestion. The data is truely immutable and the table is append-only. But actually there are different databases involved, so the only feature they share in common and I can depend on is jdbc... Best regards, Yang 2016-12-30 6:45 GMT+01:00 任弘迪

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Michael, Thanks a lot for your ticket. At least it is the first step. Best regards, Yang 2016-12-30 2:01 GMT+01:00 Michael Armbrust : > We don't support this yet, but I've opened this JIRA as it sounds > generally useful:

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread ayan guha
"If data ingestion speed is faster than data production speed, then eventually the entire database will be harvested and those workers will start to "tail" the database for new data streams and the processing becomes real time." This part is really database dependent. So it will be hard to

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread 任弘迪
why not sync binlog of mysql(hopefully the data is immutable and the table is append-only), send the log through kafka and then consume it by spark streaming? On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust wrote: > We don't support this yet, but I've opened this JIRA

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Michael Armbrust
We don't support this yet, but I've opened this JIRA as it sounds generally useful: https://issues.apache.org/jira/browse/SPARK-19031 In the mean time you could try implementing your own Source, but that is pretty low level and is not yet a stable API. On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe

[Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Yuanzhe Yang (杨远哲)
Hi all, Thanks a lot for your contributions to bring us new technologies. I don't want to waste your time, so before I write to you, I googled, checked stackoverflow and mailing list archive with keywords "streaming" and "jdbc". But I was not able to get any solution to my use case. I hope I