[
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101212#comment-16101212
]
ASF subversion and git services commented on BAHIR-110:
-------------------------------------------------------
Commit c7f158d86634d602a19a4abfd873809f8ece9d03 in bahir's branch
refs/heads/master from [~emlaver]
[ https://git-wip-us.apache.org/repos/asf?p=bahir.git;h=c7f158d ]
[BAHIR-110] Implement _changes API for sql-cloudant
- support loading Cloudant data into Spark DataFrames and SQL tables
using '_changes' endpoint
- update README to explain the new config options and differences
between '_all_docs' and '_changes' endpoints when loading data
- Add test suite to test Spark DataFrames using the '_all_docs' and
'_changes' endpoint, assert Cloudant config options, and test Spark
SQL temporary views
Closes #45
> Implement _changes API for non-streaming receiver
> -------------------------------------------------
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
> Issue Type: Improvement
> Reporter: Esteban Laver
> Original Estimate: 216h
> Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API
> for non-streaming receiver. _all_docs API supports parallel reads (using
> offset and range) but performance of _changes API is still better in most
> cases (even with single threaded support).
> With this ticket we want to:
> a) implement _changes API for non-streaming receivers
> b) allow customers to pick either _all_docs (default) or _changes API
> endpoint, with documentation about pros and cons
> _changes performance details:
> Successfully loaded Cloudant (using local cloudant-developer docker image)
> docs into Spark (local standalone) with the following database sizes: 15GB
> (time: 8 1/2 mins), 20GB (17 mins), 46GB (25 mins), and 75GB (48 1/2 mins).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)