[jira] [Commented] (SPARK-19628) Duplicate Spark jobs in 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226559#comment-16226559 ] Jork Zijlstra commented on SPARK-19628: --- Hello [~guilhermeslucas], I'm currently no longer employed at the company where we encountered the problem. [~skoning] Do you still have the problem and could you help? How much more code do you need? Usually you want to scale down the test to find the problem and this is pretty much the minimal version. {code} spark.read.orc(...).show(20) or spark.read.orc(...).collect() {code} Both trigger the duplicate jobs. Regards, Jork > Duplicate Spark jobs in 2.1.0 > - > > Key: SPARK-19628 > URL: https://issues.apache.org/jira/browse/SPARK-19628 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jork Zijlstra > Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, > spark2.1.0.png > > > After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs > executed. Going back to Spark 2.0.1 they are gone again > {code} > import org.apache.spark.sql._ > object DoubleJobs { > def main(args: Array[String]) { > System.setProperty("hadoop.home.dir", "/tmp"); > val sparkSession: SparkSession = SparkSession.builder > .master("local[4]") > .appName("spark session example") > .config("spark.driver.maxResultSize", "6G") > .config("spark.sql.orc.filterPushdown", true) > .config("spark.sql.hive.metastorePartitionPruning", true) > .getOrCreate() > sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > val paths = Seq( > ""//some orc source > ) > def dataFrame(path: String): DataFrame = { > sparkSession.read.orc(path) > } > paths.foreach(path => { > dataFrame(path).show(20) > }) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on S3N when secrets are in the URL
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029297#comment-16029297 ] Jork Zijlstra commented on SPARK-20799: --- Hi [~dongjoon], Sorry that is took some time to test the Parquet file. Our spark cluster for the notebook got updated to spark 2.1.1 but it wouldn't play nice with the notebook version. Especially when using s3a path. Using s3n paths I could generated the no partition Parquet file. It also seems to be a problem with Parquet files. It throws the same error. {code}Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;{code} [~ste...@apache.org] Thanks for the settings. I'm playing and exploring the options now for the s3a paths Don't you mean {code} fs.s3a.bucket.site-2.access.key=my access key fs.s3a.bucket.site-2.secret.key=my access secret {code} Regards, jork > Unable to infer schema for ORC on S3N when secrets are in the URL > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Jork Zijlstra >Priority: Minor > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20799) Unable to infer schema for ORC on S3N when secrets are in the URL
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029297#comment-16029297 ] Jork Zijlstra edited comment on SPARK-20799 at 5/30/17 11:50 AM: - Hi [~dongjoon], Sorry that is took some time to test the Parquet file. Our spark cluster for the notebook got updated to spark 2.1.1 but it wouldn't play nice with the notebook version. Especially when using s3a path. Using s3n paths I could generated the no partition Parquet file. It also seems to be a problem with Parquet files. It throws the same error. {code}Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;{code} [~ste...@apache.org] Thanks for the settings. I'm trying to get the notebook to play nice with s3a path and playing and exploring the options now. Don't you mean {code} fs.s3a.bucket.site-2.access.key=my access key fs.s3a.bucket.site-2.secret.key=my access secret {code} Regards, jork was (Author: jzijlstra): Hi [~dongjoon], Sorry that is took some time to test the Parquet file. Our spark cluster for the notebook got updated to spark 2.1.1 but it wouldn't play nice with the notebook version. Especially when using s3a path. Using s3n paths I could generated the no partition Parquet file. It also seems to be a problem with Parquet files. It throws the same error. {code}Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;{code} [~ste...@apache.org] Thanks for the settings. I'm playing and exploring the options now for the s3a paths Don't you mean {code} fs.s3a.bucket.site-2.access.key=my access key fs.s3a.bucket.site-2.secret.key=my access secret {code} Regards, jork > Unable to infer schema for ORC on S3N when secrets are in the URL > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Jork Zijlstra >Priority: Minor > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021450#comment-16021450 ] Jork Zijlstra commented on SPARK-20799: --- [~dongjoon] I don't know since we don't use Parquet files. But I can off course generate one from the orc. Will try this tomorrow and let you know. > Unable to infer schema for ORC on reading ORC from S3 > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Jork Zijlstra > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017278#comment-16017278 ] Jork Zijlstra commented on SPARK-20799: --- Hi Steve, Thanks for the quick response. We indeed don't need the credentials anymore to be on the path. I indeed forgot to mention the version we are running. We are using Spark 2.1.1 with indeed Hadoop 2.8.0 Any other information you need? Regards, Jork > Unable to infer schema for ORC on reading ORC from S3 > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Jork Zijlstra > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3
Jork Zijlstra created SPARK-20799: - Summary: Unable to infer schema for ORC on reading ORC from S3 Key: SPARK-20799 URL: https://issues.apache.org/jira/browse/SPARK-20799 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1 Reporter: Jork Zijlstra We are getting the following exception: {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.{code} Combining following factors will cause it: - Use S3 - Use format ORC - Don't apply a partitioning on de data - Embed AWS credentials in the path The problem is in the PartitioningAwareFileIndex def allFiles() {code} leafDirToChildrenFiles.get(qualifiedPath) .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } .getOrElse(Array.empty) {code} leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the qualifiedPath contains the path WITH credentials. So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data is read and the schema cannot be defined. Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore. Workaround: Move the AWS credentials from the path to the SparkSession {code} SparkSession.builder .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19628) Duplicate Spark jobs in 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869880#comment-15869880 ] Jork Zijlstra edited comment on SPARK-19628 at 2/16/17 1:07 PM: I have just attached a screenshot which contains duplicate jobs when executing the above given example code. The example code uses show(), but in our application we use collect(). Both seem to trigger this duplication. The issue is that both jobs take time (they are executed sequentially), so the execution time has doubled for the same action. was (Author: jzijlstra): I have just attached a screenshot which contains duplicate jobs when executing the above given example code. The example code uses show(), but in our application we use collect(). Both seem to trigger this duplication. The issue is that both jobs take time, so the execution time has doubled for the same action. > Duplicate Spark jobs in 2.1.0 > - > > Key: SPARK-19628 > URL: https://issues.apache.org/jira/browse/SPARK-19628 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Jork Zijlstra > Fix For: 2.0.1 > > Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, > spark2.1.0.png > > > After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs > executed. Going back to Spark 2.0.1 they are gone again > {code} > import org.apache.spark.sql._ > object DoubleJobs { > def main(args: Array[String]) { > System.setProperty("hadoop.home.dir", "/tmp"); > val sparkSession: SparkSession = SparkSession.builder > .master("local[4]") > .appName("spark session example") > .config("spark.driver.maxResultSize", "6G") > .config("spark.sql.orc.filterPushdown", true) > .config("spark.sql.hive.metastorePartitionPruning", true) > .getOrCreate() > sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > val paths = Seq( > ""//some orc source > ) > def dataFrame(path: String): DataFrame = { > sparkSession.read.orc(path) > } > paths.foreach(path => { > dataFrame(path).show(20) > }) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19628) Duplicate Spark jobs in 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869880#comment-15869880 ] Jork Zijlstra commented on SPARK-19628: --- I have just attached a screenshot which contains duplicate jobs when executing the above given example code. The example code uses show(), but in our application we use collect(). Both seem to trigger this duplication. The issue is that both jobs take time, so the execution time has doubled for the same action. > Duplicate Spark jobs in 2.1.0 > - > > Key: SPARK-19628 > URL: https://issues.apache.org/jira/browse/SPARK-19628 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Jork Zijlstra > Fix For: 2.0.1 > > Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, > spark2.1.0.png > > > After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs > executed. Going back to Spark 2.0.1 they are gone again > {code} > import org.apache.spark.sql._ > object DoubleJobs { > def main(args: Array[String]) { > System.setProperty("hadoop.home.dir", "/tmp"); > val sparkSession: SparkSession = SparkSession.builder > .master("local[4]") > .appName("spark session example") > .config("spark.driver.maxResultSize", "6G") > .config("spark.sql.orc.filterPushdown", true) > .config("spark.sql.hive.metastorePartitionPruning", true) > .getOrCreate() > sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > val paths = Seq( > ""//some orc source > ) > def dataFrame(path: String): DataFrame = { > sparkSession.read.orc(path) > } > paths.foreach(path => { > dataFrame(path).show(20) > }) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19628) Duplicate Spark jobs in 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jork Zijlstra updated SPARK-19628: -- Attachment: spark2.1.0-examplecode.png > Duplicate Spark jobs in 2.1.0 > - > > Key: SPARK-19628 > URL: https://issues.apache.org/jira/browse/SPARK-19628 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Jork Zijlstra > Fix For: 2.0.1 > > Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, > spark2.1.0.png > > > After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs > executed. Going back to Spark 2.0.1 they are gone again > {code} > import org.apache.spark.sql._ > object DoubleJobs { > def main(args: Array[String]) { > System.setProperty("hadoop.home.dir", "/tmp"); > val sparkSession: SparkSession = SparkSession.builder > .master("local[4]") > .appName("spark session example") > .config("spark.driver.maxResultSize", "6G") > .config("spark.sql.orc.filterPushdown", true) > .config("spark.sql.hive.metastorePartitionPruning", true) > .getOrCreate() > sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > val paths = Seq( > ""//some orc source > ) > def dataFrame(path: String): DataFrame = { > sparkSession.read.orc(path) > } > paths.foreach(path => { > dataFrame(path).show(20) > }) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19628) Duplicate Spark jobs in 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869847#comment-15869847 ] Jork Zijlstra commented on SPARK-19628: --- The attached screenshots are from our application. The code example provided is from an isolated example where the issue also persisted > Duplicate Spark jobs in 2.1.0 > - > > Key: SPARK-19628 > URL: https://issues.apache.org/jira/browse/SPARK-19628 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Jork Zijlstra > Fix For: 2.0.1 > > Attachments: spark2.0.1.png, spark2.1.0.png > > > After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs > executed. Going back to Spark 2.0.1 they are gone again > {code} > import org.apache.spark.sql._ > object DoubleJobs { > def main(args: Array[String]) { > System.setProperty("hadoop.home.dir", "/tmp"); > val sparkSession: SparkSession = SparkSession.builder > .master("local[4]") > .appName("spark session example") > .config("spark.driver.maxResultSize", "6G") > .config("spark.sql.orc.filterPushdown", true) > .config("spark.sql.hive.metastorePartitionPruning", true) > .getOrCreate() > sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > val paths = Seq( > ""//some orc source > ) > def dataFrame(path: String): DataFrame = { > sparkSession.read.orc(path) > } > paths.foreach(path => { > dataFrame(path).show(20) > }) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19628) Duplicate Spark jobs in 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jork Zijlstra updated SPARK-19628: -- Attachment: spark2.0.1.png spark2.1.0.png > Duplicate Spark jobs in 2.1.0 > - > > Key: SPARK-19628 > URL: https://issues.apache.org/jira/browse/SPARK-19628 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Jork Zijlstra > Fix For: 2.0.1 > > Attachments: spark2.0.1.png, spark2.1.0.png > > > After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs > executed. Going back to Spark 2.0.1 they are gone again > {code} > import org.apache.spark.sql._ > object DoubleJobs { > def main(args: Array[String]) { > System.setProperty("hadoop.home.dir", "/tmp"); > val sparkSession: SparkSession = SparkSession.builder > .master("local[4]") > .appName("spark session example") > .config("spark.driver.maxResultSize", "6G") > .config("spark.sql.orc.filterPushdown", true) > .config("spark.sql.hive.metastorePartitionPruning", true) > .getOrCreate() > sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > val paths = Seq( > ""//some orc source > ) > def dataFrame(path: String): DataFrame = { > sparkSession.read.orc(path) > } > paths.foreach(path => { > dataFrame(path).show(20) > }) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19628) Duplicate Spark jobs in 2.1.0
Jork Zijlstra created SPARK-19628: - Summary: Duplicate Spark jobs in 2.1.0 Key: SPARK-19628 URL: https://issues.apache.org/jira/browse/SPARK-19628 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Jork Zijlstra Fix For: 2.0.1 Attachments: spark2.0.1.png, spark2.1.0.png After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs executed. Going back to Spark 2.0.1 they are gone again {code} import org.apache.spark.sql._ object DoubleJobs { def main(args: Array[String]) { System.setProperty("hadoop.home.dir", "/tmp"); val sparkSession: SparkSession = SparkSession.builder .master("local[4]") .appName("spark session example") .config("spark.driver.maxResultSize", "6G") .config("spark.sql.orc.filterPushdown", true) .config("spark.sql.hive.metastorePartitionPruning", true) .getOrCreate() sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true") val paths = Seq( ""//some orc source ) def dataFrame(path: String): DataFrame = { sparkSession.read.orc(path) } paths.foreach(path => { dataFrame(path).show(20) }) } } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15787194#comment-15787194 ] Jork Zijlstra commented on SPARK-19012: --- [~hvanhovell] Its already working for me, I was already prefixing the tableOrViewName. I thought you needed an example on how a developers mind work in (mis)using other peoples code. Its nice to see that its been resolved in just 2 days. > CreateOrReplaceTempView throws > org.apache.spark.sql.catalyst.parser.ParseException when viewName first char > is numerical > > > Key: SPARK-19012 > URL: https://issues.apache.org/jira/browse/SPARK-19012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.0.2, 2.1.0 >Reporter: Jork Zijlstra >Assignee: Dongjoon Hyun > Fix For: 2.2.0 > > > Using a viewName where the the fist char is a numerical value on > dataframe.createOrReplaceTempView(viewName: String) causes: > {code} > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', > 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', > 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) > == SQL == > 1 > {code} > {code} > val tableOrViewName = "1" //fails > val tableOrViewName = "a" //works > sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784796#comment-15784796 ] Jork Zijlstra edited comment on SPARK-19012 at 12/29/16 7:56 AM: - Good to see that its already being discussed. MSSQL also has some limitation in tableOrViewNames which is described in the documentation. Maybe updating the annotation of the method would also be enough. Having an Exception with a clear reason would definitely already a fix. [~hvanhovell] We specify our queries inside a configuration not the code. So we have this in our config: dataPath = "hdfs://" dataQuery: "SELECT column1, column2 FROM \[TABLE] WHERE 1 = 1" Since we have one SparkSession for the application and the tableOrViewName is coupled to that and we don't want to specify an extra config option for the tableOrViewname, I though I'd just use the hashcode from the dataquery as the tableOrViewName. Use that in the createOrReplaceTempView and replace \[TABLE] inside the query with that. {code} val path = "hdfs://{path}" val dataQuery = "SELECT * FROM [TABLE] LIMIT 1" val tableOrViewName = "_" + Math.abs(path.hashCode).toString + Math.abs(qry.hashCode).toString val df = sparkSession.read.orc(path) df.createOrReplaceTempView(tableOrViewName) val result = sparkSession.sqlContext.sql(qry.replace("[TABLE]", tableOrViewName)).collect {code} Later I want to check If the tableOrViewName has already been created and not call createOrReplaceTempView everytime, but this is just performance improvement. was (Author: jzijlstra): Good to see that its already being discussed. MSSQL also has some limitation in tableOrViewNames which is described in the documentation. Maybe updating the annotation of the method would also be enough. Having an Exception with a clear reason would definitely already a fix. [~hvanhovell] We specify our queries inside a configuration not the code. So we have this in our config: dataPath = "hdfs://" dataQuery: "SELECT column1, column2 FROM \[TABLE] WHERE 1 = 1" Since we have one SparkSession for the application and the tableOrViewName is coupled to that and we don't want to specify an extra config option for the tableOrViewname, I though I'd just use the hashcode from the dataquery as the tableOrViewName. Use that in the createOrReplaceTempView and replace \[TABLE] inside the query with that. {code} val path = "hdfs://{path}" val dataQuery = "SELECT * FROM [TABLE] LIMIT 1" val tableOrViewName = "_" + Math.abs(path.hashCode).toString + Math.abs(qry.hashCode).toString val df = sparkSession.read.orc(path) df.createOrReplaceTempView(tableOrViewName) val result = sparkSession.sqlContext.sql(qry.replace("[TABLE]", tableOrViewName)).collect {code} > CreateOrReplaceTempView throws > org.apache.spark.sql.catalyst.parser.ParseException when viewName first char > is numerical > > > Key: SPARK-19012 > URL: https://issues.apache.org/jira/browse/SPARK-19012 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.0.2 >Reporter: Jork Zijlstra > > Using a viewName where the the fist char is a numerical value on > dataframe.createOrReplaceTempView(viewName: String) causes: > {code} > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', > 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', > 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE',
[jira] [Commented] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784796#comment-15784796 ] Jork Zijlstra commented on SPARK-19012: --- Good to see that its already being discussed. MSSQL also has some limitation in tableOrViewNames which is described in the documentation. Maybe updating the annotation of the method would also be enough. Having an Exception with a clear reason would definitely already a fix. [~hvanhovell] We specify our queries inside a configuration not the code. So we have this in our config: dataPath = "hdfs://" dataQuery: "SELECT column1, column2 FROM \[TABLE] WHERE 1 = 1" Since we have one SparkSession for the application and the tableOrViewName is coupled to that and we don't want to specify an extra config option for the tableOrViewname, I though I'd just use the hashcode from the dataquery as the tableOrViewName. Use that in the createOrReplaceTempView and replace \[TABLE] inside the query with that. {code} val path = "hdfs://{path}" val dataQuery = "SELECT * FROM [TABLE] LIMIT 1" val tableOrViewName = "_" + Math.abs(path.hashCode).toString + Math.abs(qry.hashCode).toString val df = sparkSession.read.orc(path) df.createOrReplaceTempView(tableOrViewName) val result = sparkSession.sqlContext.sql(qry.replace("[TABLE]", tableOrViewName)).collect {code} > CreateOrReplaceTempView throws > org.apache.spark.sql.catalyst.parser.ParseException when viewName first char > is numerical > > > Key: SPARK-19012 > URL: https://issues.apache.org/jira/browse/SPARK-19012 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.0.2 >Reporter: Jork Zijlstra > > Using a viewName where the the fist char is a numerical value on > dataframe.createOrReplaceTempView(viewName: String) causes: > {code} > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', > 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', > 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) > == SQL == > 1 > {code} > {code} > val tableOrViewName = "1" //fails > val tableOrViewName = "a" //works > sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15780545#comment-15780545 ] Jork Zijlstra commented on SPARK-19012: --- The type is viewName: String so you would expect any string to work. When executing createOrReplaceTempView(tableOrViewName: String) an ParseException of {code} == SQL == {tableOrViewName} {code} is thrown. You might not see that these are related, because it goes on about an SQL ParseException and you just defined a tempView You might not expect that the tableOrViewName trigger some SQL parsing. A slighter more clearer exception message (viewName not supported) could be more helpful or add the identifier rules set to the documentation. As you said prefixing the tableOrViewName with a non-numerical value does the trick (although for me this feels more like a workaround) > CreateOrReplaceTempView throws > org.apache.spark.sql.catalyst.parser.ParseException when viewName first char > is numerical > > > Key: SPARK-19012 > URL: https://issues.apache.org/jira/browse/SPARK-19012 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.0.2 >Reporter: Jork Zijlstra > > Using a viewName where the the fist char is a numerical value on > dataframe.createOrReplaceTempView(viewName: String) causes: > {code} > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', > 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', > 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) > == SQL == > 1 > {code} > {code} > val tableOrViewName = "1" //fails > val tableOrViewName = "a" //works > sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jork Zijlstra updated SPARK-19012: -- Affects Version/s: 2.0.2 > CreateOrReplaceTempView throws > org.apache.spark.sql.catalyst.parser.ParseException when viewName first char > is numerical > > > Key: SPARK-19012 > URL: https://issues.apache.org/jira/browse/SPARK-19012 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.0.2 >Reporter: Jork Zijlstra > > Using a viewName where the the fist char is a numerical value on > dataframe.createOrReplaceTempView(viewName: String) causes: > {code} > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', > 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', > 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', > 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', > 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', > 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', > 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', > 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', > 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', > 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', > 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', > 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', > 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', > 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', > 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', > 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, > DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', > 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', > 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', > 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', > 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', > IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) > == SQL == > 1 > {code} > {code} > val tableOrViewName = "1" //fails > val tableOrViewName = "a" //works > sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
[ https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jork Zijlstra updated SPARK-19012: -- Description: Using a viewName where the the fist char is a numerical value on dataframe.createOrReplaceTempView(viewName: String) causes: {code} Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) == SQL == 1 {code} {code} val tableOrViewName = "1" //fails val tableOrViewName = "a" //works sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) {code} was: Using a viewName where the the fist char is a numerical value on dataframe.createOrReplaceTempView(viewName: String) causes: {code} Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL',
[jira] [Created] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical
Jork Zijlstra created SPARK-19012: - Summary: CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical Key: SPARK-19012 URL: https://issues.apache.org/jira/browse/SPARK-19012 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.1 Reporter: Jork Zijlstra Using a viewName where the the fist char is a numerical value on dataframe.createOrReplaceTempView(viewName: String) causes: {code} Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0) == SQL == 1468079114 {code} {code} val tableOrViewName = "1" //fails val tableOrViewName = "a" //works sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18269) NumberFormatException when reading csv for a nullable column
[ https://issues.apache.org/jira/browse/SPARK-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712431#comment-15712431 ] Jork Zijlstra commented on SPARK-18269: --- Thanks for the quick response. Eagerly awaiting the spark 2.1 release. > NumberFormatException when reading csv for a nullable column > > > Key: SPARK-18269 > URL: https://issues.apache.org/jira/browse/SPARK-18269 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Jork Zijlstra >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > Having a schema with a nullable column thrown an > java.lang.NumberFormatException: null when the data + delimeter isn't > specified in the csv. > Specifying the schema: > {code} > StructType(Array( > StructField("id", IntegerType, nullable = false), > StructField("underlyingId", IntegerType, true) > )) > {code} > Data (without trailing delimeter to specify the second column): > {code} > 1 > {code} > Read the data: > {code} > sparkSession.read > .schema(sourceSchema) > .option("header", "false") > .option("delimiter", """\t""") > .csv(files(dates): _*) > .rdd > {code} > Actual Result: > {code} > java.lang.NumberFormatException: null > at java.lang.Integer.parseInt(Integer.java:542) > at java.lang.Integer.parseInt(Integer.java:615) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244) > {code} > Reason: > The csv line is parsed into a Map (indexSafeTokens), which is short of one > value. So indexSafeTokens(index) throws a NullpointerException reading the > optional value which isn't in the Map. > The NullpointerException is then given to the CSVTypeCast.castTo(datum: > String, .) as the datum value. > The subsequent NumberFormatException is thrown due to the fact that a > NullpointerException cannot be cast into the Type. > Possible fix: > - Use the provided schema to parse the line with the correct number of columns > - Since its nullable implement a try catch on CSVRelation.csvParser > indexSafeTokens(index) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704736#comment-15704736 ] Jork Zijlstra edited comment on SPARK-17916 at 11/29/16 9:05 AM: - I also have the same issue in 2.0.1. This code seems to be the problem: {code} private def rowToString(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = params.nullValue } i += 1 } values } def castTo( datum: String, castType: DataType, nullable: Boolean = true, options: CSVOptions = CSVOptions()): Any = { if (nullable && datum == options.nullValue) { null } else { }{code} So first the missing value in the data in transformed into the nullValue. Then in the castTo the value is checked against the nullValue, which is always true for a missing value, and casted to null was (Author: jzijlstra): I also have the same issue in 2.0.1. This code seems to be the problem: {code} private def rowToString(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = params.nullValue } i += 1 } values } def castTo( datum: String, castType: DataType, nullable: Boolean = true, options: CSVOptions = CSVOptions()): Any = { if (nullable && datum == options.nullValue) { null } else { }{code} So first the missing value in the data in transformed into the nullValue. Then in the castTo the value is checked against the nullValue, which is always true for a missing value. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704736#comment-15704736 ] Jork Zijlstra edited comment on SPARK-17916 at 11/29/16 9:05 AM: - I also have the same issue in 2.0.1. This code seems to be the problem: {code} private def rowToString(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = params.nullValue } i += 1 } values } def castTo( datum: String, castType: DataType, nullable: Boolean = true, options: CSVOptions = CSVOptions()): Any = { if (nullable && datum == options.nullValue) { null } else { }{code} So first the missing value in the data in transformed into the nullValue. Then in the castTo the value is checked against the nullValue, which is always true for a missing value. was (Author: jzijlstra): I also have the same issue in 2.0.1. This code seems to be the problem: ```private def rowToString(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = params.nullValue } i += 1 } values } def castTo( datum: String, castType: DataType, nullable: Boolean = true, options: CSVOptions = CSVOptions()): Any = { if (nullable && datum == options.nullValue) { null } else { }``` So first the missing value in the data in transformed into the nullValue. Then in the castTo the value is checked against the nullValue, which is always true for a missing value. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704736#comment-15704736 ] Jork Zijlstra edited comment on SPARK-17916 at 11/29/16 9:04 AM: - I also have the same issue in 2.0.1. This code seems to be the problem: ```private def rowToString(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = params.nullValue } i += 1 } values } def castTo( datum: String, castType: DataType, nullable: Boolean = true, options: CSVOptions = CSVOptions()): Any = { if (nullable && datum == options.nullValue) { null } else { }``` So first the missing value in the data in transformed into the nullValue. Then in the castTo the value is checked against the nullValue, which is always true for a missing value. was (Author: jzijlstra): I also have the same issue in 2.0.1. This code seems to be the problem: private def rowToString(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = params.nullValue } i += 1 } values } def castTo( datum: String, castType: DataType, nullable: Boolean = true, options: CSVOptions = CSVOptions()): Any = { if (nullable && datum == options.nullValue) { null } else { } So first the missing value in the data in transformed into the nullValue. Then in the castTo the value is checked against the nullValue, which is always true for a missing value. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704736#comment-15704736 ] Jork Zijlstra commented on SPARK-17916: --- I also have the same issue in 2.0.1. This code seems to be the problem: private def rowToString(row: InternalRow): Seq[String] = { var i = 0 val values = new Array[String](row.numFields) while (i < row.numFields) { if (!row.isNullAt(i)) { values(i) = valueConverters(i).apply(row, i) } else { values(i) = params.nullValue } i += 1 } values } def castTo( datum: String, castType: DataType, nullable: Boolean = true, options: CSVOptions = CSVOptions()): Any = { if (nullable && datum == options.nullValue) { null } else { } So first the missing value in the data in transformed into the nullValue. Then in the castTo the value is checked against the nullValue, which is always true for a missing value. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18269) NumberFormatException when reading csv for a nullable column
[ https://issues.apache.org/jira/browse/SPARK-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636192#comment-15636192 ] Jork Zijlstra edited comment on SPARK-18269 at 11/4/16 12:22 PM: - The error that is thrown is java.lang.NumberFormatException: null. In this case null is a NullPointerException and not the value "null". I did try this before submitting this issue but having the value "null" as nullValue doesn't work since "null" != NullPointerException. Apparently putting a NullpointerException in a parameter of type String works. was (Author: jzijlstra): The error that is thrown is java.lang.NumberFormatException: null. In this case null is a NullPointerException and not the value "null". I did try this before submitting this issue but having the value "null" as nullValue doesn't work since "null" != NullPointerException. Apparently putting a NullpointterException in a parameter of type String works. > NumberFormatException when reading csv for a nullable column > > > Key: SPARK-18269 > URL: https://issues.apache.org/jira/browse/SPARK-18269 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Jork Zijlstra > > Having a schema with a nullable column thrown an > java.lang.NumberFormatException: null when the data + delimeter isn't > specified in the csv. > Specifying the schema: > StructType(Array( > StructField("id", IntegerType, nullable = false), > StructField("underlyingId", IntegerType, true) > )) > Data (without trailing delimeter to specify the second column): > 1 > Read the data: > sparkSession.read > .schema(sourceSchema) > .option("header", "false") > .option("delimiter", """\t""") > .csv(files(dates): _*) > .rdd > Actual Result: > java.lang.NumberFormatException: null > at java.lang.Integer.parseInt(Integer.java:542) > at java.lang.Integer.parseInt(Integer.java:615) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244) > Reason: > The csv line is parsed into a Map (indexSafeTokens), which is short of one > value. So indexSafeTokens(index) throws a NullpointerException reading the > optional value which isn't in the Map. > The NullpointerException is then given to the CSVTypeCast.castTo(datum: > String, .) as the datum value. > The subsequent NumberFormatException is thrown due to the fact that a > NullpointerException cannot be cast into the Type. > Possible fix: > - Use the provided schema to parse the line with the correct number of columns > - Since its nullable implement a try catch on CSVRelation.csvParser > indexSafeTokens(index) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18269) NumberFormatException when reading csv for a nullable column
[ https://issues.apache.org/jira/browse/SPARK-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636192#comment-15636192 ] Jork Zijlstra commented on SPARK-18269: --- The error that is thrown is java.lang.NumberFormatException: null. In this case null is a NullPointerException and not the value "null". I did try this before submitting this issue but having the value "null" as nullValue doesn't work since "null" != NullPointerException. Apparently putting a NullpointterException in a parameter of type String works. > NumberFormatException when reading csv for a nullable column > > > Key: SPARK-18269 > URL: https://issues.apache.org/jira/browse/SPARK-18269 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Jork Zijlstra > > Having a schema with a nullable column thrown an > java.lang.NumberFormatException: null when the data + delimeter isn't > specified in the csv. > Specifying the schema: > StructType(Array( > StructField("id", IntegerType, nullable = false), > StructField("underlyingId", IntegerType, true) > )) > Data (without trailing delimeter to specify the second column): > 1 > Read the data: > sparkSession.read > .schema(sourceSchema) > .option("header", "false") > .option("delimiter", """\t""") > .csv(files(dates): _*) > .rdd > Actual Result: > java.lang.NumberFormatException: null > at java.lang.Integer.parseInt(Integer.java:542) > at java.lang.Integer.parseInt(Integer.java:615) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244) > Reason: > The csv line is parsed into a Map (indexSafeTokens), which is short of one > value. So indexSafeTokens(index) throws a NullpointerException reading the > optional value which isn't in the Map. > The NullpointerException is then given to the CSVTypeCast.castTo(datum: > String, .) as the datum value. > The subsequent NumberFormatException is thrown due to the fact that a > NullpointerException cannot be cast into the Type. > Possible fix: > - Use the provided schema to parse the line with the correct number of columns > - Since its nullable implement a try catch on CSVRelation.csvParser > indexSafeTokens(index) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18270) Users schema with non-nullable properties is overidden with true
Jork Zijlstra created SPARK-18270: - Summary: Users schema with non-nullable properties is overidden with true Key: SPARK-18270 URL: https://issues.apache.org/jira/browse/SPARK-18270 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Jork Zijlstra Users schema with non-nullable properties is overidden with true in CSVRelation.csvParser. The schema that is given to the CSVRelation.csvParser(schema: StructType) isnt the version that is user specifies. All nullable option are set to true Specifying the schema: StructType(Array( StructField("id", IntegerType, nullable = false), StructField("underlyingId", IntegerType, true) )) Read the data: sparkSession.read .schema(sourceSchema) .option("header", "false") .option("delimiter", """\t""") .csv(files(dates): _*) .rdd Actual Result: schema inside csvParser contains only nullable = true values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18269) NumberFormatException when reading csv for a nullable column
Jork Zijlstra created SPARK-18269: - Summary: NumberFormatException when reading csv for a nullable column Key: SPARK-18269 URL: https://issues.apache.org/jira/browse/SPARK-18269 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Jork Zijlstra Having a schema with a nullable column thrown an java.lang.NumberFormatException: null when the data + delimeter isn't specified in the csv. Specifying the schema: StructType(Array( StructField("id", IntegerType, nullable = false), StructField("underlyingId", IntegerType, true) )) Data (without trailing delimeter to specify the second column): 1 Read the data: sparkSession.read .schema(sourceSchema) .option("header", "false") .option("delimiter", """\t""") .csv(files(dates): _*) .rdd Actual Result: java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:542) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244) Reason: The csv line is parsed into a Map (indexSafeTokens), which is short of one value. So indexSafeTokens(index) throws a NullpointerException reading the optional value which isn't in the Map. The NullpointerException is then given to the CSVTypeCast.castTo(datum: String, .) as the datum value. The subsequent NumberFormatException is thrown due to the fact that a NullpointerException cannot be cast into the Type. Possible fix: - Use the provided schema to parse the line with the correct number of columns - Since its nullable implement a try catch on CSVRelation.csvParser indexSafeTokens(index) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org