Hi Chris,
When schemaSampleSize is set to -1, the connector will scan all the documents in the database. +1 scans only the first document. Using the value -1 would add the most overhead. N number of documents will scan an arbitrary number of documents in the database (if N is greater than the number of documents in the database, we will apply -1). 0 or any non-integer value is not permitted and will result in an error. Below is an example of adding the setting directly to your Spark Context:
spark = SparkSession\
.builder\
.appName("Multiple schema test")\
.config("cloudant.host","ACCOUNT.cloudant.com")\
.config("cloudant.username", "USERNAME")\
.config("cloudant.password","PASSWORD")\
.config("jsonstore.rdd.schemaSampleSize", -1)\
.getOrCreate()
And, Here is how the option can be used for a local setting applied to a single RDD:
spark.sql("CREATE TEMPORARY TABLE schema-test USING com.cloudant.spark OPTIONS ( schemaSampleSize '10',database 'schema-test')")
schemaTestTable = spark.sql("SELECT * FROM schema-test")
This and some additional information can be found here: https://github.com/cloudant-labs/spark-cloudant#schema-variance. This information will soon be added to the bahir/sql-cloudant project.
Thanks,
Esteban