[GitHub] spark pull request #17190: [SPARK-19478][SS] JDBC Sink [WIP]

GaalDornick Tue, 07 Mar 2017 05:47:48 -0800

GitHub user GaalDornick opened a pull request:

    https://github.com/apache/spark/pull/17190


    [SPARK-19478][SS] JDBC Sink [WIP]

    ## What changes were proposed in this pull request?
    
    Implementation of Sink that supports storing structured streaming data into 
a JDBC compliant RDBMS database. It supports Overwrite and Append modes. By 
default it supports _atleast once_ operations and can be configured to support 
_exactly once_ 
    
    To keep track of batches that have been written to a table, it creates a 
_log_ table with the name <tablename>$_SINK_LOG. This table has 2 columns: 
batchID and status of batch. The status can either be COMMITTED or UNCOMMITTED. 
When JDBC Sink receives a batch it checks if there is an entry in the sink log 
table for that batch with status = COMMITTED. If status is COMMITTED, it 
ignores the batch, other wise it tries the append/overwrite operation
    
    To enable _exactly once_ the client should create a column in the original 
table that stores the batchID. This column should be of LongType. The name of 
the column should be passed in the options with the name _batchIdCol_. If the 
JDBC Sink finds that this option is set, it will use _exactly once_ mode. In 
this mode, it will set the _batchIdCol_ to the batch id that is inserting or 
overwriting the record. Also, in the beginning of the batch, if it finds a 
batch with status=UNCOMMITTED, it deletes the records in the original table 
that match the batchID
    
    ## How was this patch tested?
    
    Implemented JDBCSinkSuite that is modeled along the lines of other Sink 
tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/GaalDornick/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17190.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17190
    
----
commit 28c8bebadbb7a800c94ba7321af7d144d4678e73
Author: Jayesh Lalwani <lalwani.jay...@gmail.com>
Date:   2017-02-26T07:13:39Z

    Implemented JDBCSink

commit f838c4974d435cc19b7589e198d152b362227959
Author: Jayesh Lalwani <lalwani.jay...@gmail.com>
Date:   2017-02-26T07:15:07Z

    Merge remote-tracking branch 'upstream/master'

commit 7ac0d7899c06e7f35a3253ee57fa31f30aa946a4
Author: Jayesh Lalwani <lalwani.jay...@gmail.com>
Date:   2017-02-28T13:26:49Z

    Formatting code

commit 12086becb1ab882738349b5bb959b4b536832f12
Author: Jayesh Lalwani <lalwani.jay...@gmail.com>
Date:   2017-02-28T14:04:05Z

    Merge remote-tracking branch 'upstream/master'

commit 2a43d29a329afa27f4238d61c681fa918cd84d40
Author: Jayesh Lalwani <lalwani.jay...@gmail.com>
Date:   2017-03-01T13:04:26Z

    Merge remote-tracking branch 'upstream/master'

commit 756ea2cb32c8a85ccb98cd84d85962e1b5d37154
Author: Jayesh Lalwani <lalwani.jay...@gmail.com>
Date:   2017-03-06T14:36:23Z

    Merge remote-tracking branch 'upstream/master'

commit dde8b0b15f11c4e19361e8af485c113ef1a5b422
Author: Jayesh Lalwani <lalwani.jay...@gmail.com>
Date:   2017-03-07T13:35:48Z

    Merge remote-tracking branch 'upstream/master'

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17190: [SPARK-19478][SS] JDBC Sink [WIP]

Reply via email to