Nicholas Chammas created SPARK-15193:
----------------------------------------

             Summary: samplingRatio should default to 1.0 across the board
                 Key: SPARK-15193
                 URL: https://issues.apache.org/jira/browse/SPARK-15193
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, SQL
            Reporter: Nicholas Chammas
            Priority: Minor


The default sampling ratio for {{jsonRDD}} is 
[1.0|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.jsonRDD],
 whereas for {{createDataFrame}} it's 
[{{None}}|http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame].

I think the default sampling ratio should be 1.0 across the board. Users should 
have to explicitly supply a lower sampling ratio if they know their dataset has 
a consistent structure. Otherwise, I think the "safer" thing to default to is 
to check all the data.

Targeting this for 2.0 in case we consider it a breaking change that would be 
more difficult to get in later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to