[
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075334#comment-16075334
]
ASF GitHub Bot commented on BAHIR-110:
--------------------------------------
Github user mayya-sharipova commented on a diff in the pull request:
https://github.com/apache/bahir/pull/45#discussion_r125746153
--- Diff: sql-cloudant/README.md ---
@@ -52,39 +51,61 @@ Here each subsequent configuration overrides the
previous one. Thus, configurati
### Configuration in application.conf
-Default values are defined in
[here](cloudant-spark-sql/src/main/resources/application.conf).
+Default values are defined in [here](src/main/resources/application.conf).
### Configuration on SparkConf
Name | Default | Meaning
--- |:---:| ---
+cloudant.apiReceiver|"_all_docs"| API endpoint for RelationProvider when
loading or saving data from Cloudant to DataFrames or SQL temporary tables.
Select between "_all_docs" or "_changes" endpoint.
cloudant.protocol|https|protocol to use to transfer data: http or https
-cloudant.host||cloudant host url
-cloudant.username||cloudant userid
-cloudant.password||cloudant password
+cloudant.host| |cloudant host url
+cloudant.username| |cloudant userid
+cloudant.password| |cloudant password
cloudant.useQuery|false|By default, _all_docs endpoint is used if
configuration 'view' and 'index' (see below) are not set. When useQuery is
enabled, _find endpoint will be used in place of _all_docs when query condition
is not on primary key field (_id), so that query predicates may be driven into
datastore.
cloudant.queryLimit|25|The maximum number of results returned when
querying the _find endpoint.
jsonstore.rdd.partitions|10|the number of partitions intent used to drive
JsonStoreRDD loading query result in parallel. The actual number is calculated
based on total rows returned and satisfying maxInPartition and minInPartition
jsonstore.rdd.maxInPartition|-1|the max rows in a partition. -1 means
unlimited
jsonstore.rdd.minInPartition|10|the min rows in a partition.
jsonstore.rdd.requestTimeout|900000| the request timeout in milliseconds
bulkSize|200| the bulk save size
-schemaSampleSize| "-1" | the sample size for RDD schema discovery. 1 means
we are using only first document for schema discovery; -1 means all documents;
0 will be treated as 1; any number N means min(N, total) docs
-createDBOnSave|"false"| whether to create a new database during save
operation. If false, a database should already exist. If true, a new database
will be created. If true, and a database with a provided name already exists,
an error will be raised.
+schemaSampleSize|-1| the sample size for RDD schema discovery. 1 means we
are using only first document for schema discovery; -1 means all documents; 0
will be treated as 1; any number N means min(N, total) docs
+createDBOnSave|false| whether to create a new database during save
operation. If false, a database should already exist. If true, a new database
will be created. If true, and a database with a provided name already exists,
an error will be raised.
+
+The `cloudant.apiReceiver` option allows for _changes or _all_docs API
endpoint to be called while loading Cloudant data into Spark DataFrames or SQL
Tables,
+or saving data from DataFrames or SQL Tables to a Cloudant database.
+
+**Note:** When using `_changes` API, please consider:
+1. Results are partially ordered and may not be be presented in order in
+which documents were updated.
+2. In case of shards' unavailability, you may see duplicate results
(changes that have been seen already)
+3. Can use `selector` option to retrieve all revisions for docs
+4. Only supports single threaded
+
--- End diff --
You can add here that `_changes` support the real snapshot of the database
representing it in a single point of time. While using `_all_docs` using
partitions may not represent the true snapshot of the database, as some some
docs may be added/deleted in the database between loading data into diff spark
partitions.
> Replace use of _all_docs API with _changes API in all receivers
> ---------------------------------------------------------------
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
> Issue Type: Improvement
> Reporter: Esteban Laver
> Original Estimate: 216h
> Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API
> for non-streaming receiver. _all_docs API supports parallel reads (using
> offset and range) but performance of _changes API is still better in most
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and
> cons)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)