[jira] [Created] (SPARK-15867) TABLESAMPLE BUCKET semantics don't match Hive's

Andrew Or (JIRA) Fri, 10 Jun 2016 00:43:43 -0700

Andrew Or created SPARK-15867:
---------------------------------

             Summary: TABLESAMPLE BUCKET semantics don't match Hive's
                 Key: SPARK-15867
                 URL: https://issues.apache.org/jira/browse/SPARK-15867
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Andrew Or



{code}
SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
{code}

In Hive, this would select the 3rd bucket out of every 16 buckets there are in 
the table. E.g. if the table was clustered by 32 buckets then this would sample 
the 3rd and the 19th bucket. (See 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)

In Spark, however, we simply sample 3/16 of the number of input rows.

Either we don't support it in Spark or do it in a way that's consistent with 
Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-15867) TABLESAMPLE BUCKET semantics don't match Hive's

Reply via email to