[
https://issues.apache.org/jira/browse/BAHIR-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094720#comment-16094720
]
ASF GitHub Bot commented on BAHIR-110:
--------------------------------------
Github user mayya-sharipova commented on the issue:
https://github.com/apache/bahir/pull/45
@emlaver
I am getting the following unexpected behaviour:
I have a database with 13 docs and 1 deleted doc. When displaying
`df.count`, I am getting `14` which is incorrect. When displaying a dataframe,
I am getting the last record is NULL.
+--------+---+--------------------+-----------+
|_deleted|_id| _rev|airportName|
+--------+---+--------------------+-----------+
| null|DEL|1-67f14f8891a9f32...| Delhi|
| null|JFK|1-ee8206c8e56a114...| New York|
| null|SVO|1-7d18769b68f6099...| Moscow|
| null|FRA|1-f358b62b0499340...| Frankfurt|
| null|HKG|1-b040e40df5d0080...| Hong Kong|
| null|CDG|1-8c51e401185272e...| Paris|
| null|FCO|1-89431c8db8aa8e4...| Rome|
| null|NRT|1-dce312ac1414110...| Tokyo|
| null|LHR|1-303c622ad8380c9...| London|
| null|BOM|2-a3f39a0741938c4...| Mumbaii|
| null|YUL|1-19a9fe9cace23ec...| Montreal|
| null|IKA|1-3dea74452ca86af...| Tehran|
| null|SIN|1-67037272289432e...| Singapore|
| true|SYD|2-1cc4f2c62db144a...| null|
+--------+---+--------------------+-----------+
We should NOT load into dataframe any deleted documents. A user may have
thousands or millions of deleted documents. We should load only undeleted docs,
and a dataframe should NOT have a column `"_deleted"`.
________
Another error:
Occasionally, when running `CloudantDF.py` example, get an error:
File "/Cloudant/bahir/sql-cloudant/examples/python/CloudantDF.py", line 45,
in <module>
df.filter(df.airportName >= 'Moscow').select("_id",'airportName').show()
File
"..spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
line 1020, in __getattr__
AttributeError: 'DataFrame' object has no attribute 'airportName'
For this PR, we can disregard this error and investigate further in
following PRs.
_____________
> Replace use of _all_docs API with _changes API in all receivers
> ---------------------------------------------------------------
>
> Key: BAHIR-110
> URL: https://issues.apache.org/jira/browse/BAHIR-110
> Project: Bahir
> Issue Type: Improvement
> Reporter: Esteban Laver
> Original Estimate: 216h
> Remaining Estimate: 216h
>
> Today we use the _changes API for Spark streaming receiver and _all_docs API
> for non-streaming receiver. _all_docs API supports parallel reads (using
> offset and range) but performance of _changes API is still better in most
> cases (even with single threaded support).
> With this ticket we want to:
> a) re-implement all receivers using _changes API
> b) compare performance between the two implementations based on _changes and
> _all_docs
> Based on the results in b) we could decide to either
> - replace _all_docs implementation with _changes based implementation OR
> - allow customers to pick one (with a solid documentation about pros and
> cons)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)