Hi Chris,
 
When schemaSampleSize is set to -1, the connector will scan all the documents in the database.  +1 scans only the first document.  Using the value -1 would add the most overhead. N number of documents will scan an arbitrary number of documents in the database (if N is greater than the number of documents in the database, we will apply -1).  0 or any non-integer value is not permitted and will result in an error.  Below is an example of adding the setting directly to your Spark Context:

spark = SparkSession\

    .builder\

    .appName("Multiple schema test")\

    .config("cloudant.host","ACCOUNT.cloudant.com")\

    .config("cloudant.username", "USERNAME")\

    .config("cloudant.password","PASSWORD")\

    .config("jsonstore.rdd.schemaSampleSize", -1)\

    .getOrCreate()

And, Here is how the option can be used for a local setting applied to a single RDD: 
spark.sql("CREATE TEMPORARY TABLE schema-test USING com.cloudant.spark OPTIONS ( schemaSampleSize '10',database 'schema-test')")
schemaTestTable = spark.sql("SELECT * FROM schema-test")
 
This and some additional information can be found here: https://github.com/cloudant-labs/spark-cloudant#schema-variance.  This information will soon be added to the bahir/sql-cloudant project.
 
Thanks,
Esteban

Reply via email to