[jira] [Commented] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-10-31 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226559#comment-16226559
 ] 

Jork Zijlstra commented on SPARK-19628:
---

Hello [~guilhermeslucas],

I'm currently no longer employed at the company where we encountered the 
problem. 
[~skoning] Do you still have the problem and could you help?

How much more code do you need? Usually you want to scale down the test to find 
the problem and this is pretty much the minimal version.

{code}
spark.read.orc(...).show(20) or spark.read.orc(...).collect()
{code}
Both trigger the duplicate jobs.

Regards, Jork

> Duplicate Spark jobs in 2.1.0
> -
>
> Key: SPARK-19628
> URL: https://issues.apache.org/jira/browse/SPARK-19628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jork Zijlstra
> Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, 
> spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
> System.setProperty("hadoop.home.dir", "/tmp");
> val sparkSession: SparkSession = SparkSession.builder
>   .master("local[4]")
>   .appName("spark session example")
>   .config("spark.driver.maxResultSize", "6G")
>   .config("spark.sql.orc.filterPushdown", true)
>   .config("spark.sql.hive.metastorePartitionPruning", true)
>   .getOrCreate()
> sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val paths = Seq(
>   ""//some orc source
> )
> def dataFrame(path: String): DataFrame = {
>   sparkSession.read.orc(path)
> }
> paths.foreach(path => {
>   dataFrame(path).show(20)
> })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on S3N when secrets are in the URL

2017-05-30 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029297#comment-16029297
 ] 

Jork Zijlstra commented on SPARK-20799:
---

Hi [~dongjoon],

Sorry that is took some time to test the Parquet file. Our spark cluster for 
the notebook got updated to spark 2.1.1 but it wouldn't play nice with the 
notebook version. Especially when using s3a path. Using s3n paths I could 
generated the no partition Parquet file.

It also seems to be a problem with Parquet files. It throws the same error.
{code}Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable 
to infer schema for Parquet. It must be specified manually.;{code}

[~ste...@apache.org] 
Thanks for the settings. I'm playing and exploring the options now for the s3a 
paths

Don't you mean {code}
fs.s3a.bucket.site-2.access.key=my access key
fs.s3a.bucket.site-2.secret.key=my access secret
{code}

Regards, jork

> Unable to infer schema for ORC on S3N when secrets are in the URL
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jork Zijlstra
>Priority: Minor
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20799) Unable to infer schema for ORC on S3N when secrets are in the URL

2017-05-30 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029297#comment-16029297
 ] 

Jork Zijlstra edited comment on SPARK-20799 at 5/30/17 11:50 AM:
-

Hi [~dongjoon],

Sorry that is took some time to test the Parquet file. Our spark cluster for 
the notebook got updated to spark 2.1.1 but it wouldn't play nice with the 
notebook version. Especially when using s3a path. Using s3n paths I could 
generated the no partition Parquet file.

It also seems to be a problem with Parquet files. It throws the same error.
{code}Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable 
to infer schema for Parquet. It must be specified manually.;{code}

[~ste...@apache.org] 
Thanks for the settings. I'm trying to get the notebook to play nice with s3a 
path and playing and exploring the options now.

Don't you mean {code}
fs.s3a.bucket.site-2.access.key=my access key
fs.s3a.bucket.site-2.secret.key=my access secret
{code}

Regards, jork


was (Author: jzijlstra):
Hi [~dongjoon],

Sorry that is took some time to test the Parquet file. Our spark cluster for 
the notebook got updated to spark 2.1.1 but it wouldn't play nice with the 
notebook version. Especially when using s3a path. Using s3n paths I could 
generated the no partition Parquet file.

It also seems to be a problem with Parquet files. It throws the same error.
{code}Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable 
to infer schema for Parquet. It must be specified manually.;{code}

[~ste...@apache.org] 
Thanks for the settings. I'm playing and exploring the options now for the s3a 
paths

Don't you mean {code}
fs.s3a.bucket.site-2.access.key=my access key
fs.s3a.bucket.site-2.secret.key=my access secret
{code}

Regards, jork

> Unable to infer schema for ORC on S3N when secrets are in the URL
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jork Zijlstra
>Priority: Minor
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

2017-05-23 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021450#comment-16021450
 ] 

Jork Zijlstra commented on SPARK-20799:
---

[~dongjoon]
I don't know since we don't use Parquet files. But I can off course generate 
one from the orc. Will try this tomorrow and let you know.

> Unable to infer schema for ORC on reading ORC from S3
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Jork Zijlstra
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

2017-05-19 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017278#comment-16017278
 ] 

Jork Zijlstra commented on SPARK-20799:
---

Hi Steve, 

Thanks for the quick response. We indeed don't need the credentials anymore to 
be on the path.

I indeed forgot to mention the version we are running. We are using Spark 2.1.1 
with indeed Hadoop 2.8.0
Any other information you need?

Regards, Jork

> Unable to infer schema for ORC on reading ORC from S3
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jork Zijlstra
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

2017-05-18 Thread Jork Zijlstra (JIRA)
Jork Zijlstra created SPARK-20799:
-

 Summary: Unable to infer schema for ORC on reading ORC from S3
 Key: SPARK-20799
 URL: https://issues.apache.org/jira/browse/SPARK-20799
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Jork Zijlstra


We are getting the following exception: 
{code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
It must be specified manually.{code}

Combining following factors will cause it:
- Use S3
- Use format ORC
- Don't apply a partitioning on de data
- Embed AWS credentials in the path

The problem is in the PartitioningAwareFileIndex def allFiles()

{code}
leafDirToChildrenFiles.get(qualifiedPath)
  .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
  .getOrElse(Array.empty)
{code}

leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
qualifiedPath contains the path WITH credentials.
So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data 
is read and the schema cannot be defined.


Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
details. This is insecure and may be unsupported in future., but this should 
not mean that it shouldn't work anymore.

Workaround:
Move the AWS credentials from the path to the SparkSession
{code}
SparkSession.builder
.config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
.config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-02-16 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869880#comment-15869880
 ] 

Jork Zijlstra edited comment on SPARK-19628 at 2/16/17 1:07 PM:


I have just attached a screenshot which contains duplicate jobs when executing 
the above given example code. 

The example code uses show(), but in our application we use collect(). Both 
seem to trigger this duplication. 
The issue is that both jobs take time (they are executed sequentially), so the 
execution time has doubled for the same action.


was (Author: jzijlstra):
I have just attached a screenshot which contains duplicate jobs when executing 
the above given example code. 

The example code uses show(), but in our application we use collect(). Both 
seem to trigger this duplication. 
The issue is that both jobs take time, so the execution time has doubled for 
the same action.

> Duplicate Spark jobs in 2.1.0
> -
>
> Key: SPARK-19628
> URL: https://issues.apache.org/jira/browse/SPARK-19628
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jork Zijlstra
> Fix For: 2.0.1
>
> Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, 
> spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
> System.setProperty("hadoop.home.dir", "/tmp");
> val sparkSession: SparkSession = SparkSession.builder
>   .master("local[4]")
>   .appName("spark session example")
>   .config("spark.driver.maxResultSize", "6G")
>   .config("spark.sql.orc.filterPushdown", true)
>   .config("spark.sql.hive.metastorePartitionPruning", true)
>   .getOrCreate()
> sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val paths = Seq(
>   ""//some orc source
> )
> def dataFrame(path: String): DataFrame = {
>   sparkSession.read.orc(path)
> }
> paths.foreach(path => {
>   dataFrame(path).show(20)
> })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-02-16 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869880#comment-15869880
 ] 

Jork Zijlstra commented on SPARK-19628:
---

I have just attached a screenshot which contains duplicate jobs when executing 
the above given example code. 

The example code uses show(), but in our application we use collect(). Both 
seem to trigger this duplication. 
The issue is that both jobs take time, so the execution time has doubled for 
the same action.

> Duplicate Spark jobs in 2.1.0
> -
>
> Key: SPARK-19628
> URL: https://issues.apache.org/jira/browse/SPARK-19628
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jork Zijlstra
> Fix For: 2.0.1
>
> Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, 
> spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
> System.setProperty("hadoop.home.dir", "/tmp");
> val sparkSession: SparkSession = SparkSession.builder
>   .master("local[4]")
>   .appName("spark session example")
>   .config("spark.driver.maxResultSize", "6G")
>   .config("spark.sql.orc.filterPushdown", true)
>   .config("spark.sql.hive.metastorePartitionPruning", true)
>   .getOrCreate()
> sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val paths = Seq(
>   ""//some orc source
> )
> def dataFrame(path: String): DataFrame = {
>   sparkSession.read.orc(path)
> }
> paths.foreach(path => {
>   dataFrame(path).show(20)
> })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-02-16 Thread Jork Zijlstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jork Zijlstra updated SPARK-19628:
--
Attachment: spark2.1.0-examplecode.png

> Duplicate Spark jobs in 2.1.0
> -
>
> Key: SPARK-19628
> URL: https://issues.apache.org/jira/browse/SPARK-19628
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jork Zijlstra
> Fix For: 2.0.1
>
> Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, 
> spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
> System.setProperty("hadoop.home.dir", "/tmp");
> val sparkSession: SparkSession = SparkSession.builder
>   .master("local[4]")
>   .appName("spark session example")
>   .config("spark.driver.maxResultSize", "6G")
>   .config("spark.sql.orc.filterPushdown", true)
>   .config("spark.sql.hive.metastorePartitionPruning", true)
>   .getOrCreate()
> sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val paths = Seq(
>   ""//some orc source
> )
> def dataFrame(path: String): DataFrame = {
>   sparkSession.read.orc(path)
> }
> paths.foreach(path => {
>   dataFrame(path).show(20)
> })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-02-16 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869847#comment-15869847
 ] 

Jork Zijlstra commented on SPARK-19628:
---

The attached screenshots are from our application. The code example provided is 
from an isolated example where the issue also persisted

> Duplicate Spark jobs in 2.1.0
> -
>
> Key: SPARK-19628
> URL: https://issues.apache.org/jira/browse/SPARK-19628
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jork Zijlstra
> Fix For: 2.0.1
>
> Attachments: spark2.0.1.png, spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
> System.setProperty("hadoop.home.dir", "/tmp");
> val sparkSession: SparkSession = SparkSession.builder
>   .master("local[4]")
>   .appName("spark session example")
>   .config("spark.driver.maxResultSize", "6G")
>   .config("spark.sql.orc.filterPushdown", true)
>   .config("spark.sql.hive.metastorePartitionPruning", true)
>   .getOrCreate()
> sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val paths = Seq(
>   ""//some orc source
> )
> def dataFrame(path: String): DataFrame = {
>   sparkSession.read.orc(path)
> }
> paths.foreach(path => {
>   dataFrame(path).show(20)
> })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-02-16 Thread Jork Zijlstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jork Zijlstra updated SPARK-19628:
--
Attachment: spark2.0.1.png
spark2.1.0.png

> Duplicate Spark jobs in 2.1.0
> -
>
> Key: SPARK-19628
> URL: https://issues.apache.org/jira/browse/SPARK-19628
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jork Zijlstra
> Fix For: 2.0.1
>
> Attachments: spark2.0.1.png, spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
> System.setProperty("hadoop.home.dir", "/tmp");
> val sparkSession: SparkSession = SparkSession.builder
>   .master("local[4]")
>   .appName("spark session example")
>   .config("spark.driver.maxResultSize", "6G")
>   .config("spark.sql.orc.filterPushdown", true)
>   .config("spark.sql.hive.metastorePartitionPruning", true)
>   .getOrCreate()
> sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val paths = Seq(
>   ""//some orc source
> )
> def dataFrame(path: String): DataFrame = {
>   sparkSession.read.orc(path)
> }
> paths.foreach(path => {
>   dataFrame(path).show(20)
> })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-02-16 Thread Jork Zijlstra (JIRA)
Jork Zijlstra created SPARK-19628:
-

 Summary: Duplicate Spark jobs in 2.1.0
 Key: SPARK-19628
 URL: https://issues.apache.org/jira/browse/SPARK-19628
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Jork Zijlstra
 Fix For: 2.0.1
 Attachments: spark2.0.1.png, spark2.1.0.png

After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
executed. Going back to Spark 2.0.1 they are gone again

{code}
import org.apache.spark.sql._

object DoubleJobs {
  def main(args: Array[String]) {

System.setProperty("hadoop.home.dir", "/tmp");

val sparkSession: SparkSession = SparkSession.builder
  .master("local[4]")
  .appName("spark session example")
  .config("spark.driver.maxResultSize", "6G")
  .config("spark.sql.orc.filterPushdown", true)
  .config("spark.sql.hive.metastorePartitionPruning", true)
  .getOrCreate()

sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

val paths = Seq(
  ""//some orc source
)

def dataFrame(path: String): DataFrame = {
  sparkSession.read.orc(path)
}

paths.foreach(path => {
  dataFrame(path).show(20)
})
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical

2016-12-30 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15787194#comment-15787194
 ] 

Jork Zijlstra commented on SPARK-19012:
---

[~hvanhovell] Its already working for me, I was already prefixing the 
tableOrViewName. 
I thought you needed an example on how a developers mind work in (mis)using 
other peoples code.

Its nice to see that its been resolved in just 2 days. 



> CreateOrReplaceTempView throws 
> org.apache.spark.sql.catalyst.parser.ParseException when viewName first char 
> is numerical
> 
>
> Key: SPARK-19012
> URL: https://issues.apache.org/jira/browse/SPARK-19012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Jork Zijlstra
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> Using a viewName where the the fist char is a numerical value on 
> dataframe.createOrReplaceTempView(viewName: String) causes:
> {code}
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 
> 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 
> 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 
> 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)
> == SQL ==
> 1
> {code}
> {code}
> val tableOrViewName = "1" //fails
> val tableOrViewName = "a" //works
> sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical

2016-12-28 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784796#comment-15784796
 ] 

Jork Zijlstra edited comment on SPARK-19012 at 12/29/16 7:56 AM:
-

Good to see that its already being discussed. 

MSSQL also has some limitation in tableOrViewNames which is described in the 
documentation. Maybe updating the annotation of the method would also be 
enough. Having an Exception with a clear reason would definitely already a fix.

[~hvanhovell]
We specify our queries inside a configuration not the code. So we have this in 
our config:
dataPath = "hdfs://"
dataQuery: "SELECT column1, column2 FROM \[TABLE] WHERE 1 = 1"

Since we have one SparkSession for the application and the tableOrViewName is 
coupled to that and we don't want to specify an extra config option for the 
tableOrViewname, I though I'd just use the hashcode from the dataquery as the 
tableOrViewName. Use that in the createOrReplaceTempView and replace \[TABLE] 
inside the query with that. 

{code}
val path = "hdfs://{path}"
val dataQuery = "SELECT * FROM [TABLE] LIMIT 1"

val tableOrViewName = "_" + Math.abs(path.hashCode).toString + 
Math.abs(qry.hashCode).toString

val df = sparkSession.read.orc(path)
df.createOrReplaceTempView(tableOrViewName)

val result = sparkSession.sqlContext.sql(qry.replace("[TABLE]", 
tableOrViewName)).collect
{code}

Later I want to check If the tableOrViewName has already been created and not 
call createOrReplaceTempView everytime, but this is just performance 
improvement.


was (Author: jzijlstra):
Good to see that its already being discussed. 

MSSQL also has some limitation in tableOrViewNames which is described in the 
documentation. Maybe updating the annotation of the method would also be 
enough. Having an Exception with a clear reason would definitely already a fix.

[~hvanhovell]
We specify our queries inside a configuration not the code. So we have this in 
our config:
dataPath = "hdfs://"
dataQuery: "SELECT column1, column2 FROM \[TABLE] WHERE 1 = 1"

Since we have one SparkSession for the application and the tableOrViewName is 
coupled to that and we don't want to specify an extra config option for the 
tableOrViewname, I though I'd just use the hashcode from the dataquery as the 
tableOrViewName. Use that in the createOrReplaceTempView and replace \[TABLE] 
inside the query with that. 

{code}
val path = "hdfs://{path}"
val dataQuery = "SELECT * FROM [TABLE] LIMIT 1"

val tableOrViewName = "_" + Math.abs(path.hashCode).toString + 
Math.abs(qry.hashCode).toString

val df = sparkSession.read.orc(path)
df.createOrReplaceTempView(tableOrViewName)

val result = sparkSession.sqlContext.sql(qry.replace("[TABLE]", 
tableOrViewName)).collect
{code}

> CreateOrReplaceTempView throws 
> org.apache.spark.sql.catalyst.parser.ParseException when viewName first char 
> is numerical
> 
>
> Key: SPARK-19012
> URL: https://issues.apache.org/jira/browse/SPARK-19012
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Jork Zijlstra
>
> Using a viewName where the the fist char is a numerical value on 
> dataframe.createOrReplaceTempView(viewName: String) causes:
> {code}
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 
> 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 
> 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 
> 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 

[jira] [Commented] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical

2016-12-28 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784796#comment-15784796
 ] 

Jork Zijlstra commented on SPARK-19012:
---

Good to see that its already being discussed. 

MSSQL also has some limitation in tableOrViewNames which is described in the 
documentation. Maybe updating the annotation of the method would also be 
enough. Having an Exception with a clear reason would definitely already a fix.

[~hvanhovell]
We specify our queries inside a configuration not the code. So we have this in 
our config:
dataPath = "hdfs://"
dataQuery: "SELECT column1, column2 FROM \[TABLE] WHERE 1 = 1"

Since we have one SparkSession for the application and the tableOrViewName is 
coupled to that and we don't want to specify an extra config option for the 
tableOrViewname, I though I'd just use the hashcode from the dataquery as the 
tableOrViewName. Use that in the createOrReplaceTempView and replace \[TABLE] 
inside the query with that. 

{code}
val path = "hdfs://{path}"
val dataQuery = "SELECT * FROM [TABLE] LIMIT 1"

val tableOrViewName = "_" + Math.abs(path.hashCode).toString + 
Math.abs(qry.hashCode).toString

val df = sparkSession.read.orc(path)
df.createOrReplaceTempView(tableOrViewName)

val result = sparkSession.sqlContext.sql(qry.replace("[TABLE]", 
tableOrViewName)).collect
{code}

> CreateOrReplaceTempView throws 
> org.apache.spark.sql.catalyst.parser.ParseException when viewName first char 
> is numerical
> 
>
> Key: SPARK-19012
> URL: https://issues.apache.org/jira/browse/SPARK-19012
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Jork Zijlstra
>
> Using a viewName where the the fist char is a numerical value on 
> dataframe.createOrReplaceTempView(viewName: String) causes:
> {code}
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 
> 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 
> 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 
> 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)
> == SQL ==
> 1
> {code}
> {code}
> val tableOrViewName = "1" //fails
> val tableOrViewName = "a" //works
> sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical

2016-12-27 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15780545#comment-15780545
 ] 

Jork Zijlstra commented on SPARK-19012:
---

The type is viewName: String so you would expect any string to work. 

When executing createOrReplaceTempView(tableOrViewName: String) an 
ParseException of 
{code}
== SQL ==
{tableOrViewName}
{code}
is thrown. You might not see that these are related, because it goes on about 
an SQL ParseException and you just defined a tempView

You might not expect that the tableOrViewName trigger some SQL parsing. A 
slighter more clearer exception message (viewName not supported) could be more 
helpful or add the identifier rules set to the documentation.

As you said prefixing the tableOrViewName with a non-numerical value does the 
trick (although for me this feels more like a workaround)

> CreateOrReplaceTempView throws 
> org.apache.spark.sql.catalyst.parser.ParseException when viewName first char 
> is numerical
> 
>
> Key: SPARK-19012
> URL: https://issues.apache.org/jira/browse/SPARK-19012
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Jork Zijlstra
>
> Using a viewName where the the fist char is a numerical value on 
> dataframe.createOrReplaceTempView(viewName: String) causes:
> {code}
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 
> 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 
> 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 
> 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)
> == SQL ==
> 1
> {code}
> {code}
> val tableOrViewName = "1" //fails
> val tableOrViewName = "a" //works
> sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical

2016-12-27 Thread Jork Zijlstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jork Zijlstra updated SPARK-19012:
--
Affects Version/s: 2.0.2

> CreateOrReplaceTempView throws 
> org.apache.spark.sql.catalyst.parser.ParseException when viewName first char 
> is numerical
> 
>
> Key: SPARK-19012
> URL: https://issues.apache.org/jira/browse/SPARK-19012
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Jork Zijlstra
>
> Using a viewName where the the fist char is a numerical value on 
> dataframe.createOrReplaceTempView(viewName: String) causes:
> {code}
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 
> 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 
> 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 
> 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 
> 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 
> 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 
> 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 
> 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 
> 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 
> 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 
> 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
> 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
> 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
> 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
> 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 
> 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 
> 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 
> 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 
> 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, 
> DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 
> 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 
> 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 
> 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 
> 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', 
> IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)
> == SQL ==
> 1
> {code}
> {code}
> val tableOrViewName = "1" //fails
> val tableOrViewName = "a" //works
> sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical

2016-12-27 Thread Jork Zijlstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jork Zijlstra updated SPARK-19012:
--
Description: 
Using a viewName where the the fist char is a numerical value on 
dataframe.createOrReplaceTempView(viewName: String) causes:

{code}
Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 
'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 
'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 
'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'INTERSECT', 'TO', 
'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 
'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
'FORMATTED', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 
'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 
'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 
'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 
'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 
'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 
'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 
'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 
'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 
'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)

== SQL ==
1
{code}

{code}
val tableOrViewName = "1" //fails
val tableOrViewName = "a" //works
sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName)
{code}



  was:
Using a viewName where the the fist char is a numerical value on 
dataframe.createOrReplaceTempView(viewName: String) causes:

{code}
Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 
'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 
'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 
'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'INTERSECT', 'TO', 
'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 
'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
'FORMATTED', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 
'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 
'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 
'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 
'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 
'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 
'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 
'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 
'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 

[jira] [Created] (SPARK-19012) CreateOrReplaceTempView throws org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is numerical

2016-12-27 Thread Jork Zijlstra (JIRA)
Jork Zijlstra created SPARK-19012:
-

 Summary: CreateOrReplaceTempView throws 
org.apache.spark.sql.catalyst.parser.ParseException when viewName first char is 
numerical
 Key: SPARK-19012
 URL: https://issues.apache.org/jira/browse/SPARK-19012
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Jork Zijlstra


Using a viewName where the the fist char is a numerical value on 
dataframe.createOrReplaceTempView(viewName: String) causes:

{code}
Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '1468079114' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 
'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 
'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 
'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 
'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'INTERSECT', 'TO', 
'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 
'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
'FORMATTED', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 
'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 
'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 
'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 
'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 
'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 
'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 
'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 
'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 
'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)

== SQL ==
1468079114
{code}

{code}
val tableOrViewName = "1" //fails
val tableOrViewName = "a" //works
sparkSession.read.orc(path).createOrReplaceTempView(tableOrViewName)
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18269) NumberFormatException when reading csv for a nullable column

2016-12-01 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712431#comment-15712431
 ] 

Jork Zijlstra commented on SPARK-18269:
---

Thanks for the quick response. Eagerly awaiting the spark 2.1 release.

> NumberFormatException when reading csv for a nullable column
> 
>
> Key: SPARK-18269
> URL: https://issues.apache.org/jira/browse/SPARK-18269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jork Zijlstra
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Having a schema with a nullable column thrown an 
> java.lang.NumberFormatException: null when the data + delimeter isn't 
> specified in the csv.
> Specifying the schema:
> {code}
> StructType(Array(
>   StructField("id", IntegerType, nullable = false),
>   StructField("underlyingId", IntegerType, true)
> ))
> {code}
> Data (without trailing delimeter to specify the second column):
> {code}
> 1
> {code}
> Read the data:
> {code}
> sparkSession.read
> .schema(sourceSchema)
> .option("header", "false")
> .option("delimiter", """\t""")
> .csv(files(dates): _*)
> .rdd
> {code}
> Actual Result: 
> {code}
> java.lang.NumberFormatException: null
>   at java.lang.Integer.parseInt(Integer.java:542)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
> {code}
> Reason:
> The csv line is parsed into a Map (indexSafeTokens), which is short of one 
> value. So indexSafeTokens(index) throws a NullpointerException reading the 
> optional value which isn't in the Map.
> The NullpointerException is then given to the CSVTypeCast.castTo(datum: 
> String, .) as the datum value.
> The subsequent NumberFormatException is thrown due to the fact that a 
> NullpointerException cannot be cast into the Type.
> Possible fix:
> - Use the provided schema to parse the line with the correct number of columns
> - Since its nullable implement a try catch on CSVRelation.csvParser 
> indexSafeTokens(index)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-29 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704736#comment-15704736
 ] 

Jork Zijlstra edited comment on SPARK-17916 at 11/29/16 9:05 AM:
-

I also have the same issue in 2.0.1. This code seems to be the problem:

{code}
private def rowToString(row: InternalRow): Seq[String] = {
var i = 0
val values = new Array[String](row.numFields)
while (i < row.numFields) {
  if (!row.isNullAt(i)) {
values(i) = valueConverters(i).apply(row, i)
  } else {
values(i) = params.nullValue
  }
  i += 1
}
values
  }

def castTo(
  datum: String,
  castType: DataType,
  nullable: Boolean = true,
  options: CSVOptions = CSVOptions()): Any = {

if (nullable && datum == options.nullValue) {
  null
} else {

}{code}


So first the missing value in the data in transformed into the nullValue. Then 
in the castTo the value is checked against the nullValue, which is always true 
for a missing value, and casted to null


was (Author: jzijlstra):
I also have the same issue in 2.0.1. This code seems to be the problem:

{code}
private def rowToString(row: InternalRow): Seq[String] = {
var i = 0
val values = new Array[String](row.numFields)
while (i < row.numFields) {
  if (!row.isNullAt(i)) {
values(i) = valueConverters(i).apply(row, i)
  } else {
values(i) = params.nullValue
  }
  i += 1
}
values
  }

def castTo(
  datum: String,
  castType: DataType,
  nullable: Boolean = true,
  options: CSVOptions = CSVOptions()): Any = {

if (nullable && datum == options.nullValue) {
  null
} else {

}{code}


So first the missing value in the data in transformed into the nullValue. Then 
in the castTo the value is checked against the nullValue, which is always true 
for a missing value. 

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-29 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704736#comment-15704736
 ] 

Jork Zijlstra edited comment on SPARK-17916 at 11/29/16 9:05 AM:
-

I also have the same issue in 2.0.1. This code seems to be the problem:

{code}
private def rowToString(row: InternalRow): Seq[String] = {
var i = 0
val values = new Array[String](row.numFields)
while (i < row.numFields) {
  if (!row.isNullAt(i)) {
values(i) = valueConverters(i).apply(row, i)
  } else {
values(i) = params.nullValue
  }
  i += 1
}
values
  }

def castTo(
  datum: String,
  castType: DataType,
  nullable: Boolean = true,
  options: CSVOptions = CSVOptions()): Any = {

if (nullable && datum == options.nullValue) {
  null
} else {

}{code}


So first the missing value in the data in transformed into the nullValue. Then 
in the castTo the value is checked against the nullValue, which is always true 
for a missing value. 


was (Author: jzijlstra):
I also have the same issue in 2.0.1. This code seems to be the problem:

```private def rowToString(row: InternalRow): Seq[String] = {
var i = 0
val values = new Array[String](row.numFields)
while (i < row.numFields) {
  if (!row.isNullAt(i)) {
values(i) = valueConverters(i).apply(row, i)
  } else {
values(i) = params.nullValue
  }
  i += 1
}
values
  }


def castTo(
  datum: String,
  castType: DataType,
  nullable: Boolean = true,
  options: CSVOptions = CSVOptions()): Any = {

if (nullable && datum == options.nullValue) {
  null
} else {

}```

So first the missing value in the data in transformed into the nullValue. Then 
in the castTo the value is checked against the nullValue, which is always true 
for a missing value. 

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-29 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704736#comment-15704736
 ] 

Jork Zijlstra edited comment on SPARK-17916 at 11/29/16 9:04 AM:
-

I also have the same issue in 2.0.1. This code seems to be the problem:

```private def rowToString(row: InternalRow): Seq[String] = {
var i = 0
val values = new Array[String](row.numFields)
while (i < row.numFields) {
  if (!row.isNullAt(i)) {
values(i) = valueConverters(i).apply(row, i)
  } else {
values(i) = params.nullValue
  }
  i += 1
}
values
  }


def castTo(
  datum: String,
  castType: DataType,
  nullable: Boolean = true,
  options: CSVOptions = CSVOptions()): Any = {

if (nullable && datum == options.nullValue) {
  null
} else {

}```

So first the missing value in the data in transformed into the nullValue. Then 
in the castTo the value is checked against the nullValue, which is always true 
for a missing value. 


was (Author: jzijlstra):
I also have the same issue in 2.0.1. This code seems to be the problem:

private def rowToString(row: InternalRow): Seq[String] = {
var i = 0
val values = new Array[String](row.numFields)
while (i < row.numFields) {
  if (!row.isNullAt(i)) {
values(i) = valueConverters(i).apply(row, i)
  } else {
values(i) = params.nullValue
  }
  i += 1
}
values
  }


def castTo(
  datum: String,
  castType: DataType,
  nullable: Boolean = true,
  options: CSVOptions = CSVOptions()): Any = {

if (nullable && datum == options.nullValue) {
  null
} else {

}

So first the missing value in the data in transformed into the nullValue. Then 
in the castTo the value is checked against the nullValue, which is always true 
for a missing value. 

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-29 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704736#comment-15704736
 ] 

Jork Zijlstra commented on SPARK-17916:
---

I also have the same issue in 2.0.1. This code seems to be the problem:

private def rowToString(row: InternalRow): Seq[String] = {
var i = 0
val values = new Array[String](row.numFields)
while (i < row.numFields) {
  if (!row.isNullAt(i)) {
values(i) = valueConverters(i).apply(row, i)
  } else {
values(i) = params.nullValue
  }
  i += 1
}
values
  }


def castTo(
  datum: String,
  castType: DataType,
  nullable: Boolean = true,
  options: CSVOptions = CSVOptions()): Any = {

if (nullable && datum == options.nullValue) {
  null
} else {

}

So first the missing value in the data in transformed into the nullValue. Then 
in the castTo the value is checked against the nullValue, which is always true 
for a missing value. 

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18269) NumberFormatException when reading csv for a nullable column

2016-11-04 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636192#comment-15636192
 ] 

Jork Zijlstra edited comment on SPARK-18269 at 11/4/16 12:22 PM:
-

The error that is thrown is java.lang.NumberFormatException: null. In this case 
null is a NullPointerException and not the value "null".
I did try this before submitting this issue but having the value "null" as 
nullValue doesn't work since "null" != NullPointerException.

Apparently putting a NullpointerException in a parameter of type String works.


was (Author: jzijlstra):
The error that is thrown is java.lang.NumberFormatException: null. In this case 
null is a NullPointerException and not the value "null".
I did try this before submitting this issue but having the value "null" as 
nullValue doesn't work since "null" != NullPointerException.

Apparently putting a NullpointterException in a parameter of type String works.

> NumberFormatException when reading csv for a nullable column
> 
>
> Key: SPARK-18269
> URL: https://issues.apache.org/jira/browse/SPARK-18269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jork Zijlstra
>
> Having a schema with a nullable column thrown an 
> java.lang.NumberFormatException: null when the data + delimeter isn't 
> specified in the csv.
> Specifying the schema:
> StructType(Array(
>   StructField("id", IntegerType, nullable = false),
>   StructField("underlyingId", IntegerType, true)
> ))
> Data (without trailing delimeter to specify the second column):
> 1
> Read the data:
> sparkSession.read
> .schema(sourceSchema)
> .option("header", "false")
> .option("delimiter", """\t""")
> .csv(files(dates): _*)
> .rdd
> Actual Result: 
> java.lang.NumberFormatException: null
>   at java.lang.Integer.parseInt(Integer.java:542)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
> Reason:
> The csv line is parsed into a Map (indexSafeTokens), which is short of one 
> value. So indexSafeTokens(index) throws a NullpointerException reading the 
> optional value which isn't in the Map.
> The NullpointerException is then given to the CSVTypeCast.castTo(datum: 
> String, .) as the datum value.
> The subsequent NumberFormatException is thrown due to the fact that a 
> NullpointerException cannot be cast into the Type.
> Possible fix:
> - Use the provided schema to parse the line with the correct number of columns
> - Since its nullable implement a try catch on CSVRelation.csvParser 
> indexSafeTokens(index)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18269) NumberFormatException when reading csv for a nullable column

2016-11-04 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636192#comment-15636192
 ] 

Jork Zijlstra commented on SPARK-18269:
---

The error that is thrown is java.lang.NumberFormatException: null. In this case 
null is a NullPointerException and not the value "null".
I did try this before submitting this issue but having the value "null" as 
nullValue doesn't work since "null" != NullPointerException.

Apparently putting a NullpointterException in a parameter of type String works.

> NumberFormatException when reading csv for a nullable column
> 
>
> Key: SPARK-18269
> URL: https://issues.apache.org/jira/browse/SPARK-18269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jork Zijlstra
>
> Having a schema with a nullable column thrown an 
> java.lang.NumberFormatException: null when the data + delimeter isn't 
> specified in the csv.
> Specifying the schema:
> StructType(Array(
>   StructField("id", IntegerType, nullable = false),
>   StructField("underlyingId", IntegerType, true)
> ))
> Data (without trailing delimeter to specify the second column):
> 1
> Read the data:
> sparkSession.read
> .schema(sourceSchema)
> .option("header", "false")
> .option("delimiter", """\t""")
> .csv(files(dates): _*)
> .rdd
> Actual Result: 
> java.lang.NumberFormatException: null
>   at java.lang.Integer.parseInt(Integer.java:542)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
> Reason:
> The csv line is parsed into a Map (indexSafeTokens), which is short of one 
> value. So indexSafeTokens(index) throws a NullpointerException reading the 
> optional value which isn't in the Map.
> The NullpointerException is then given to the CSVTypeCast.castTo(datum: 
> String, .) as the datum value.
> The subsequent NumberFormatException is thrown due to the fact that a 
> NullpointerException cannot be cast into the Type.
> Possible fix:
> - Use the provided schema to parse the line with the correct number of columns
> - Since its nullable implement a try catch on CSVRelation.csvParser 
> indexSafeTokens(index)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18270) Users schema with non-nullable properties is overidden with true

2016-11-04 Thread Jork Zijlstra (JIRA)
Jork Zijlstra created SPARK-18270:
-

 Summary: Users schema with non-nullable properties is overidden 
with true
 Key: SPARK-18270
 URL: https://issues.apache.org/jira/browse/SPARK-18270
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Jork Zijlstra


Users schema with non-nullable properties is overidden with true in 
CSVRelation.csvParser.

The schema that is given to the CSVRelation.csvParser(schema: StructType) isnt 
the version that is user specifies.
All nullable option are set to true

Specifying the schema:
StructType(Array(
  StructField("id", IntegerType, nullable = false),
  StructField("underlyingId", IntegerType, true)
))

Read the data:
sparkSession.read
.schema(sourceSchema)
.option("header", "false")
.option("delimiter", """\t""")
.csv(files(dates): _*)
.rdd

Actual Result: 
schema inside csvParser contains only nullable = true values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18269) NumberFormatException when reading csv for a nullable column

2016-11-04 Thread Jork Zijlstra (JIRA)
Jork Zijlstra created SPARK-18269:
-

 Summary: NumberFormatException when reading csv for a nullable 
column
 Key: SPARK-18269
 URL: https://issues.apache.org/jira/browse/SPARK-18269
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Jork Zijlstra


Having a schema with a nullable column thrown an 
java.lang.NumberFormatException: null when the data + delimeter isn't specified 
in the csv.

Specifying the schema:
StructType(Array(
  StructField("id", IntegerType, nullable = false),
  StructField("underlyingId", IntegerType, true)
))

Data (without trailing delimeter to specify the second column):
1

Read the data:
sparkSession.read
.schema(sourceSchema)
.option("header", "false")
.option("delimiter", """\t""")
.csv(files(dates): _*)
.rdd

Actual Result: 
java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542)
at java.lang.Integer.parseInt(Integer.java:615)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)

Reason:
The csv line is parsed into a Map (indexSafeTokens), which is short of one 
value. So indexSafeTokens(index) throws a NullpointerException reading the 
optional value which isn't in the Map.

The NullpointerException is then given to the CSVTypeCast.castTo(datum: String, 
.) as the datum value.
The subsequent NumberFormatException is thrown due to the fact that a 
NullpointerException cannot be cast into the Type.

Possible fix:
- Use the provided schema to parse the line with the correct number of columns
- Since its nullable implement a try catch on CSVRelation.csvParser 
indexSafeTokens(index)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org