[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread huangyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419798#comment-15419798
 ] 

huangyu commented on SPARK-15044:
-

Hi, I know it isn't Spark's fault. However I think maybe it's better to just 
log some error information, rather than throwing an exception. Because I always 
in the situation that, a table with many partitions(maybe one partition per 
hour),  someone deletes the paths of many partitions(I really don't know why 
they do this, maybe there are some bugs in their program). Spark-sql can't work 
util I fix the hive metadata, but I can't run "alter table drop partition.." 
for all missing partitions(too many to run).  So I have no choice but to catch 
the exception and rebuild spark.

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Closed] (SPARK-16716) calling cache on joined dataframe can lead to data being blanked

2016-08-12 Thread PJ Fanning (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning closed SPARK-16716.
--
Resolution: Duplicate

This looks like it was fixed by SPARK-16664

> calling cache on joined dataframe can lead to data being blanked
> 
>
> Key: SPARK-16716
> URL: https://issues.apache.org/jira/browse/SPARK-16716
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: PJ Fanning
>
> I have reproduced the issue in Spark 1.6.2 and latest 1.6.3-SNAPSHOT code.
> The code works ok on Spark 1.6.1.
> I have a notebook up on Databricks Community Edition that demonstrates the 
> issue. The notebook depends on the library com.databricks:spark-csv_2.10:1.4.0
> The code uses some custom code to join 4 dataframes.
> It calls show on this dataframe and the data is as expected.
> After calling .cache, the data is blanked.
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5458351705459939/3760010872339805/5521341683971298/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419768#comment-15419768
 ] 

Barry Becker commented on SPARK-17039:
--

I read the comments in SPARK-16462. It looks like it would fix this issue, but 
I have not tried it yet. I will look into trying to local build the tip of 
2.0.1 later next week and try it out.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check

2016-08-12 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419762#comment-15419762
 ] 

Shivaram Venkataraman commented on SPARK-16519:
---

FYI [~aloknsingh] [~clarkfitzg] some of the RDD methods are being renamed in 
this JIRA

> Handle SparkR RDD generics that create warnings in R CMD check
> --
>
> Key: SPARK-16519
> URL: https://issues.apache.org/jira/browse/SPARK-16519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the warnings we get from R CMD check is that RDD implementations of 
> some of the generics are not documented. These generics are shared between 
> RDD, DataFrames in SparkR. The list includes
> {quote}
> WARNING
> Undocumented S4 methods:
>   generic 'cache' and siglist 'RDD'
>   generic 'collect' and siglist 'RDD'
>   generic 'count' and siglist 'RDD'
>   generic 'distinct' and siglist 'RDD'
>   generic 'first' and siglist 'RDD'
>   generic 'join' and siglist 'RDD,RDD'
>   generic 'length' and siglist 'RDD'
>   generic 'partitionBy' and siglist 'RDD'
>   generic 'persist' and siglist 'RDD,character'
>   generic 'repartition' and siglist 'RDD'
>   generic 'show' and siglist 'RDD'
>   generic 'take' and siglist 'RDD,numeric'
>   generic 'unpersist' and siglist 'RDD'
> {quote}
> As described in 
> https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks 
> like a limitation of R where exporting a generic from a package also exports 
> all the implementations of that generic. 
> One way to get around this is to remove the RDD API or rename the methods in 
> Spark 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419755#comment-15419755
 ] 

Apache Spark commented on SPARK-16975:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14627

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>Assignee: Dongjoon Hyun
>  Labels: parquet
> Fix For: 2.0.1, 2.1.0
>
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419730#comment-15419730
 ] 

Liwei Lin edited comment on SPARK-17039 at 8/13/16 12:36 AM:
-

Thanks [~barrybecker4] for reporting this! Please also see 
https://issues.apache.org/jira/browse/SPARK-16462. It'd be great if you could 
try this patch https://github.com/apache/spark/pull/14118 and provide some 
feedback.


was (Author: proflin):
Thanks [~barrybecker4] for reporting this. Please also see 
https://issues.apache.org/jira/browse/SPARK-16462.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419730#comment-15419730
 ] 

Liwei Lin commented on SPARK-17039:
---

Thanks [~barrybecker4] for reporting this. Please also see 
https://issues.apache.org/jira/browse/SPARK-16462.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16519:


Assignee: (was: Apache Spark)

> Handle SparkR RDD generics that create warnings in R CMD check
> --
>
> Key: SPARK-16519
> URL: https://issues.apache.org/jira/browse/SPARK-16519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the warnings we get from R CMD check is that RDD implementations of 
> some of the generics are not documented. These generics are shared between 
> RDD, DataFrames in SparkR. The list includes
> {quote}
> WARNING
> Undocumented S4 methods:
>   generic 'cache' and siglist 'RDD'
>   generic 'collect' and siglist 'RDD'
>   generic 'count' and siglist 'RDD'
>   generic 'distinct' and siglist 'RDD'
>   generic 'first' and siglist 'RDD'
>   generic 'join' and siglist 'RDD,RDD'
>   generic 'length' and siglist 'RDD'
>   generic 'partitionBy' and siglist 'RDD'
>   generic 'persist' and siglist 'RDD,character'
>   generic 'repartition' and siglist 'RDD'
>   generic 'show' and siglist 'RDD'
>   generic 'take' and siglist 'RDD,numeric'
>   generic 'unpersist' and siglist 'RDD'
> {quote}
> As described in 
> https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks 
> like a limitation of R where exporting a generic from a package also exports 
> all the implementations of that generic. 
> One way to get around this is to remove the RDD API or rename the methods in 
> Spark 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check

2016-08-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419716#comment-15419716
 ] 

Apache Spark commented on SPARK-16519:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/14626

> Handle SparkR RDD generics that create warnings in R CMD check
> --
>
> Key: SPARK-16519
> URL: https://issues.apache.org/jira/browse/SPARK-16519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the warnings we get from R CMD check is that RDD implementations of 
> some of the generics are not documented. These generics are shared between 
> RDD, DataFrames in SparkR. The list includes
> {quote}
> WARNING
> Undocumented S4 methods:
>   generic 'cache' and siglist 'RDD'
>   generic 'collect' and siglist 'RDD'
>   generic 'count' and siglist 'RDD'
>   generic 'distinct' and siglist 'RDD'
>   generic 'first' and siglist 'RDD'
>   generic 'join' and siglist 'RDD,RDD'
>   generic 'length' and siglist 'RDD'
>   generic 'partitionBy' and siglist 'RDD'
>   generic 'persist' and siglist 'RDD,character'
>   generic 'repartition' and siglist 'RDD'
>   generic 'show' and siglist 'RDD'
>   generic 'take' and siglist 'RDD,numeric'
>   generic 'unpersist' and siglist 'RDD'
> {quote}
> As described in 
> https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks 
> like a limitation of R where exporting a generic from a package also exports 
> all the implementations of that generic. 
> One way to get around this is to remove the RDD API or rename the methods in 
> Spark 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16519:


Assignee: Apache Spark

> Handle SparkR RDD generics that create warnings in R CMD check
> --
>
> Key: SPARK-16519
> URL: https://issues.apache.org/jira/browse/SPARK-16519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> One of the warnings we get from R CMD check is that RDD implementations of 
> some of the generics are not documented. These generics are shared between 
> RDD, DataFrames in SparkR. The list includes
> {quote}
> WARNING
> Undocumented S4 methods:
>   generic 'cache' and siglist 'RDD'
>   generic 'collect' and siglist 'RDD'
>   generic 'count' and siglist 'RDD'
>   generic 'distinct' and siglist 'RDD'
>   generic 'first' and siglist 'RDD'
>   generic 'join' and siglist 'RDD,RDD'
>   generic 'length' and siglist 'RDD'
>   generic 'partitionBy' and siglist 'RDD'
>   generic 'persist' and siglist 'RDD,character'
>   generic 'repartition' and siglist 'RDD'
>   generic 'show' and siglist 'RDD'
>   generic 'take' and siglist 'RDD,numeric'
>   generic 'unpersist' and siglist 'RDD'
> {quote}
> As described in 
> https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks 
> like a limitation of R where exporting a generic from a package also exports 
> all the implementations of that generic. 
> One way to get around this is to remove the RDD API or rename the methods in 
> Spark 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-12 Thread Matt Sicker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419705#comment-15419705
 ] 

Matt Sicker commented on SPARK-6305:


Perhaps [composite 
configuration|https://logging.apache.org/log4j/2.x/manual/configuration.html#CompositeConfiguration]
 could be helpful here? This way you can override some default loggers and 
whatnot while still allowing user-defined logging config files.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17001) Enable standardScaler to standardize sparse vectors when withMean=True

2016-08-12 Thread Tobi Bosede (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419592#comment-15419592
 ] 

Tobi Bosede commented on SPARK-17001:
-

This can be implemented in a similar fashion to scikit learn's maxabs_scale. 
See 
http://scikit-learn.org/stable/modules/preprocessing.html#scaling-sparse-data 
for more info.

> Enable standardScaler to standardize sparse vectors when withMean=True
> --
>
> Key: SPARK-17001
> URL: https://issues.apache.org/jira/browse/SPARK-17001
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Tobi Bosede
>Priority: Minor
>
> When withMean = true, StandardScaler will not handle sparse vectors, and 
> instead throw an exception. This is presumably because subtracting the mean 
> makes a sparse vector dense, and this can be undesirable. 
> However, VectorAssembler generates vectors that may be a mix of sparse and 
> dense, even when vectors are smallish, depending on their values. It's common 
> to feed this into StandardScaler, but it would fail sometimes depending on 
> the input if withMean = true. This is kind of surprising.
> StandardScaler should go ahead and operate on sparse vectors and subtract the 
> mean, if explicitly asked to do so with withMean, on the theory that the user 
> knows what he/she is doing, and there is otherwise no way to make this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17038) StreamingSource reports metrics for lastCompletedBatch instead of lastReceivedBatch

2016-08-12 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419571#comment-15419571
 ] 

Shixiong Zhu commented on SPARK-17038:
--

Good catch. Could you submit a PR to fix it, please?

> StreamingSource reports metrics for lastCompletedBatch instead of 
> lastReceivedBatch
> ---
>
> Key: SPARK-17038
> URL: https://issues.apache.org/jira/browse/SPARK-17038
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: metrics
>
> StreamingSource's lastReceivedBatch_submissionTime, 
> lastReceivedBatch_processingTimeStart, and 
> lastReceivedBatch_processingTimeEnd all use data from lastCompletedBatch 
> instead of lastReceivedBatch. In particular, this makes it impossible to 
> match lastReceivedBatch_records with a batchID/submission time.
> This is apparent when looking at StreamingSource.scala, lines 89-94.
> I would guess that just replacing Completed->Received in those lines would 
> fix the issue, but I haven't tested it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-12 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419458#comment-15419458
 ] 

Sital Kedia commented on SPARK-16922:
-

I am using the fix in https://github.com/apache/spark/pull/14464/files, still 
the issue remains. The joining key lies in the range [43304915L to 
10150946266075397L]

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-12 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419438#comment-15419438
 ] 

Davies Liu commented on SPARK-16922:


I think it's fixed by https://github.com/apache/spark/pull/14464/files

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-12 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419434#comment-15419434
 ] 

Davies Liu commented on SPARK-16922:


[~sitalke...@gmail.com] There are two integer overflow bugs fixed recently in 
LongHashedRelation, could you test with latest master? How is the range of your 
joining key?

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16716) calling cache on joined dataframe can lead to data being blanked

2016-08-12 Thread PJ Fanning (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419430#comment-15419430
 ] 

PJ Fanning commented on SPARK-16716:


I set up an equivalent notebook for spark 2.0 in Databricks community edition 
and the join and cache worked out. It appears that issue is just in Spark 1.6.2.

> calling cache on joined dataframe can lead to data being blanked
> 
>
> Key: SPARK-16716
> URL: https://issues.apache.org/jira/browse/SPARK-16716
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: PJ Fanning
>
> I have reproduced the issue in Spark 1.6.2 and latest 1.6.3-SNAPSHOT code.
> The code works ok on Spark 1.6.1.
> I have a notebook up on Databricks Community Edition that demonstrates the 
> issue. The notebook depends on the library com.databricks:spark-csv_2.10:1.4.0
> The code uses some custom code to join 4 dataframes.
> It calls show on this dataframe and the data is as expected.
> After calling .cache, the data is blanked.
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5458351705459939/3760010872339805/5521341683971298/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-08-12 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-16922:

Summary: Query with Broadcast Hash join fails due to executor OOM in Spark 
2.0  (was: Query failure due to executor OOM in Spark 2.0)

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16922) Query failure due to executor OOM in Spark 2.0

2016-08-12 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419383#comment-15419383
 ] 

Sital Kedia edited comment on SPARK-16922 at 8/12/16 8:06 PM:
--

I found that the regression was introduced in 
https://github.com/apache/spark/pull/12278, which introduced a new data 
structure (LongHashedRelation) for long types. I made a hack to use 
UnsafeHashedRelation instead of LongHashedRelation in 
https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L105
 and things started working fine.  This might be due to some data corruption 
happening in LongHashedRelation. 

cc - [~davies]


was (Author: sitalke...@gmail.com):
I found that the regression was introduced in 
https://github.com/apache/spark/pull/12278, whichintroduced a new data 
structure (LongHashedRelation) for long types. I made a hack to use 
UnsafeHashedRelation instead of LongHashedRelation in 
https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L105
 and things started working fine.  This might be due to some data corruption 
happening in LongHashedRelation. 

cc - [~davies]

> Query failure due to executor OOM in Spark 2.0
> --
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> 

[jira] [Commented] (SPARK-16922) Query failure due to executor OOM in Spark 2.0

2016-08-12 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419383#comment-15419383
 ] 

Sital Kedia commented on SPARK-16922:
-

I found that the regression was introduced in 
https://github.com/apache/spark/pull/12278, whichintroduced a new data 
structure (LongHashedRelation) for long types. I made a hack to use 
UnsafeHashedRelation instead of LongHashedRelation in 
https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L105
 and things started working fine.  This might be due to some data corruption 
happening in LongHashedRelation. 

cc - [~davies]

> Query failure due to executor OOM in Spark 2.0
> --
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17045) Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17045:


Assignee: Apache Spark

> Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite
> --
>
> Key: SPARK-17045
> URL: https://issues.apache.org/jira/browse/SPARK-17045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> https://github.com/apache/spark/pull/14498 plans to remove Hive Built-in Hash 
> Functions. 10+ test cases are broken because the results are different from 
> the Hive golden answer files. These broken test cases are not Hive specific. 
> Thus, it makes more sense to move them to `SQLQueryTestSuite`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17045) Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17045:


Assignee: (was: Apache Spark)

> Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite
> --
>
> Key: SPARK-17045
> URL: https://issues.apache.org/jira/browse/SPARK-17045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> https://github.com/apache/spark/pull/14498 plans to remove Hive Built-in Hash 
> Functions. 10+ test cases are broken because the results are different from 
> the Hive golden answer files. These broken test cases are not Hive specific. 
> Thus, it makes more sense to move them to `SQLQueryTestSuite`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17045) Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite

2016-08-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419320#comment-15419320
 ] 

Apache Spark commented on SPARK-17045:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14625

> Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite
> --
>
> Key: SPARK-17045
> URL: https://issues.apache.org/jira/browse/SPARK-17045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> https://github.com/apache/spark/pull/14498 plans to remove Hive Built-in Hash 
> Functions. 10+ test cases are broken because the results are different from 
> the Hive golden answer files. These broken test cases are not Hive specific. 
> Thus, it makes more sense to move them to `SQLQueryTestSuite`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17045) Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite

2016-08-12 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17045:
---

 Summary: Moving Auto_Joins from HiveCompatibilitySuite to 
SQLQueryTestSuite
 Key: SPARK-17045
 URL: https://issues.apache.org/jira/browse/SPARK-17045
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


https://github.com/apache/spark/pull/14498 plans to remove Hive Built-in Hash 
Functions. 10+ test cases are broken because the results are different from the 
Hive golden answer files. These broken test cases are not Hive specific. Thus, 
it makes more sense to move them to `SQLQueryTestSuite`




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419267#comment-15419267
 ] 

Sean Owen commented on SPARK-17042:
---

Scala 2.10 or 2.11? I'm pretty sure this is a duplicate.

> Repl-defined classes cannot be replicated
> -
>
> Key: SPARK-17042
> URL: https://issues.apache.org/jira/browse/SPARK-17042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Eric Liang
>
> A simple fix is to erase the classTag when using the default serializer, 
> since it's not needed in that case, and the classTag was failing to 
> deserialize on the remote end.
> The proper fix is actually to use the right classloader when deserializing 
> the classtags, but that is a much more invasive change for 2.0.
> The following test can be added to ReplSuite to reproduce the bug:
> {code}
>   test("replicating blocks of object with class defined in repl") {
> val output = runInterpreter("local-cluster[2,1,1024]",
>   """
> |import org.apache.spark.storage.StorageLevel._
> |case class Foo(i: Int)
> |val ret = sc.parallelize((1 to 100).map(Foo), 
> 10).persist(MEMORY_ONLY_2)
> |ret.count()
> |sc.getExecutorStorageStatus.map(s => 
> s.rddBlocksById(ret.id).size).sum
>   """.stripMargin)
> assertDoesNotContain("error:", output)
> assertDoesNotContain("Exception", output)
> assertContains(": Int = 20", output)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17043) Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result)

2016-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17043.
---
Resolution: Duplicate

> Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result)
> -
>
> Key: SPARK-17043
> URL: https://issues.apache.org/jira/browse/SPARK-17043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Barry Becker
>
> I have a method that adds a row index column to a dataframe. It only works 
> correctly if the dataframe has less than 200 columns. When more than 200 
> columns nearly all the data becomes empty (""'s for values).
> {code}
> def zipWithIndex(df: DataFrame, rowIdxColName: String): DataFrame = {
> val nullable = false
>  df.sparkSession.createDataFrame(
>   df.rdd.zipWithIndex.map{case (row, i) => Row.fromSeq(row.toSeq :+ i)},
>   StructType(df.schema.fields :+ StructField(rowIdxColName, LongType, 
> nullable))
> )
>   }
> {code}
> This might be related to https://issues.apache.org/jira/browse/SPARK-16664 
> but I'm not sure. I saw the 200 column threshold and it made me think it 
> might be related. I saw this problem in spark 1.6.2 and 2.0.0. Maybe it is 
> fixed in 2.0.1 (have not tried yet). I have no idea why the 200 column 
> threshold is significant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17044) Add window function test in SQLQueryTestSuite

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17044:


Assignee: (was: Apache Spark)

> Add window function test in SQLQueryTestSuite
> -
>
> Key: SPARK-17044
> URL: https://issues.apache.org/jira/browse/SPARK-17044
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> New `SQLQueryTestSuite` simplifies SQL testcases.
> This issue aims to replace `WindowQuerySuite.scala` of `sql/hive` module with 
> `window_functions.sql` in `sql/core` module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17044) Add window function test in SQLQueryTestSuite

2016-08-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419238#comment-15419238
 ] 

Apache Spark commented on SPARK-17044:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14623

> Add window function test in SQLQueryTestSuite
> -
>
> Key: SPARK-17044
> URL: https://issues.apache.org/jira/browse/SPARK-17044
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> New `SQLQueryTestSuite` simplifies SQL testcases.
> This issue aims to replace `WindowQuerySuite.scala` of `sql/hive` module with 
> `window_functions.sql` in `sql/core` module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17044) Add window function test in SQLQueryTestSuite

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17044:


Assignee: Apache Spark

> Add window function test in SQLQueryTestSuite
> -
>
> Key: SPARK-17044
> URL: https://issues.apache.org/jira/browse/SPARK-17044
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> New `SQLQueryTestSuite` simplifies SQL testcases.
> This issue aims to replace `WindowQuerySuite.scala` of `sql/hive` module with 
> `window_functions.sql` in `sql/core` module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17044) Add window function test in SQLQueryTestSuite

2016-08-12 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-17044:
-

 Summary: Add window function test in SQLQueryTestSuite
 Key: SPARK-17044
 URL: https://issues.apache.org/jira/browse/SPARK-17044
 Project: Spark
  Issue Type: Improvement
Reporter: Dongjoon Hyun
Priority: Minor


New `SQLQueryTestSuite` simplifies SQL testcases.
This issue aims to replace `WindowQuerySuite.scala` of `sql/hive` module with 
`window_functions.sql` in `sql/core` module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-12 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17042:
---
Description: 
A simple fix is to erase the classTag when using the default serializer, since 
it's not needed in that case, and the classTag was failing to deserialize on 
the remote end.

The proper fix is actually to use the right classloader when deserializing the 
classtags, but that is a much more invasive change for 2.0.

The following test can be added to ReplSuite to reproduce the bug:

{code}
  test("replicating blocks of object with class defined in repl") {
val output = runInterpreter("local-cluster[2,1,1024]",
  """
|import org.apache.spark.storage.StorageLevel._
|case class Foo(i: Int)
|val ret = sc.parallelize((1 to 100).map(Foo), 
10).persist(MEMORY_ONLY_2)
|ret.count()
|sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum
  """.stripMargin)
assertDoesNotContain("error:", output)
assertDoesNotContain("Exception", output)
assertContains(": Int = 20", output)
  }
{code}

  was:
The following test can be added to ReplSuite to reproduce the bug:

{code}
  test("replicating blocks of object with class defined in repl") {
val output = runInterpreter("local-cluster[2,1,1024]",
  """
|import org.apache.spark.storage.StorageLevel._
|case class Foo(i: Int)
|val ret = sc.parallelize((1 to 100).map(Foo), 
10).persist(MEMORY_ONLY_2)
|ret.count()
|sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum
  """.stripMargin)
assertDoesNotContain("error:", output)
assertDoesNotContain("Exception", output)
assertContains(": Int = 20", output)
  }
{code}


> Repl-defined classes cannot be replicated
> -
>
> Key: SPARK-17042
> URL: https://issues.apache.org/jira/browse/SPARK-17042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Eric Liang
>
> A simple fix is to erase the classTag when using the default serializer, 
> since it's not needed in that case, and the classTag was failing to 
> deserialize on the remote end.
> The proper fix is actually to use the right classloader when deserializing 
> the classtags, but that is a much more invasive change for 2.0.
> The following test can be added to ReplSuite to reproduce the bug:
> {code}
>   test("replicating blocks of object with class defined in repl") {
> val output = runInterpreter("local-cluster[2,1,1024]",
>   """
> |import org.apache.spark.storage.StorageLevel._
> |case class Foo(i: Int)
> |val ret = sc.parallelize((1 to 100).map(Foo), 
> 10).persist(MEMORY_ONLY_2)
> |ret.count()
> |sc.getExecutorStorageStatus.map(s => 
> s.rddBlocksById(ret.id).size).sum
>   """.stripMargin)
> assertDoesNotContain("error:", output)
> assertDoesNotContain("Exception", output)
> assertContains(": Int = 20", output)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17043) Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result)

2016-08-12 Thread Barry Becker (JIRA)
Barry Becker created SPARK-17043:


 Summary: Cannot call zipWithIndex on RDD with more than 200 
columns (get wrong result)
 Key: SPARK-17043
 URL: https://issues.apache.org/jira/browse/SPARK-17043
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.2
Reporter: Barry Becker


I have a method that adds a row index column to a dataframe. It only works 
correctly if the dataframe has less than 200 columns. When more than 200 
columns nearly all the data becomes empty (""'s for values).

{code}
def zipWithIndex(df: DataFrame, rowIdxColName: String): DataFrame = {
val nullable = false
 df.sparkSession.createDataFrame(
  df.rdd.zipWithIndex.map{case (row, i) => Row.fromSeq(row.toSeq :+ i)},
  StructType(df.schema.fields :+ StructField(rowIdxColName, LongType, 
nullable))
)
  }
{code}
This might be related to https://issues.apache.org/jira/browse/SPARK-16664 but 
I'm not sure. I saw the 200 column threshold and it made me think it might be 
related. I saw this problem in spark 1.6.2 and 2.0.0. Maybe it is fixed in 
2.0.1 (have not tried yet). I have no idea why the 200 column threshold is 
significant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-12 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17042:
--

 Summary: Repl-defined classes cannot be replicated
 Key: SPARK-17042
 URL: https://issues.apache.org/jira/browse/SPARK-17042
 Project: Spark
  Issue Type: Sub-task
Reporter: Eric Liang


The following test can be added to ReplSuite to reproduce the bug:

{code}
  test("replicating blocks of object with class defined in repl") {
val output = runInterpreter("local-cluster[2,1,1024]",
  """
|import org.apache.spark.storage.StorageLevel._
|case class Foo(i: Int)
|val ret = sc.parallelize((1 to 100).map(Foo), 
10).persist(MEMORY_ONLY_2)
|ret.count()
|sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum
  """.stripMargin)
assertDoesNotContain("error:", output)
assertDoesNotContain("Exception", output)
assertContains(": Int = 20", output)
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17003) release-build.sh is missing hive-thriftserver for scala 2.11

2016-08-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17003.
--
   Resolution: Fixed
Fix Version/s: 1.6.3

Issue resolved by pull request 14586
[https://github.com/apache/spark/pull/14586]

> release-build.sh is missing hive-thriftserver for scala 2.11
> 
>
> Key: SPARK-17003
> URL: https://issues.apache.org/jira/browse/SPARK-17003
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.2
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.3
>
>
> The same issue as SPARK-16453. But for branch 1.6, we are missing the profile 
> for scala 2.11 build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2016-08-12 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419158#comment-15419158
 ] 

Barry Becker commented on SPARK-17041:
--

I'm not sure either. How can we find out? I think it would be better if columns 
were case sensitive. Was the change intentional? This was a dataset from a 
client, and for whatever reason, they thought it was reasonable to have columns 
that varied only by case. Since it was something that worked before, I thought 
it might be considered a regression, but maybe it should be a feature request.

In our product, someone may rename a column from "output" to "Output". If the 
column names are not case sensitive, I'm not sure what problems this might 
cause. At a minimum, the rename will probably not work.

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-08-12 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419151#comment-15419151
 ] 

Herman van Hovell commented on SPARK-6235:
--

[~gq] it might be a good idea to share some design before pressing ahead. This 
seems to be a complex issue, that probably needs some discussion on the 
approach, before pressing ahead with a PR.

If we don't take this precaution, you might end up putting a lot of time in a 
very complex and very difficult to review PR.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.

2016-08-12 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-16771.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.1.0

> Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when 
> table name collides.
> -
>
> Key: SPARK-16771
> URL: https://issues.apache.org/jira/browse/SPARK-16771
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Furcy Pin
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> How to reproduce: 
> In spark-sql on Hive
> {code}
> DROP TABLE IF EXISTS t1 ;
> CREATE TABLE test.t1(col1 string) ;
> WITH t1 AS (
>   SELECT  col1
>   FROM t1 
> )
> SELECT col1
> FROM t1
> LIMIT 2
> ;
> {code}
> This make a nice StackOverflowError:
> {code}
> java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
> ...
> {code}
> This does not happen if I change 

[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419133#comment-15419133
 ] 

Sean Owen commented on SPARK-17041:
---

Behavior changes across major versions. I'm not sure this is a bug just because 
behavior is different.

> Columns in schema are no longer case sensitive when reading csv file
> 
>
> Key: SPARK-17041
> URL: https://issues.apache.org/jira/browse/SPARK-17041
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> It used to be (in spark 1.6.2) that I could read a csv file that had columns 
> with  names that differed only by case. For example, one column may be 
> "output" and another called "Output". Now (with spark 2.0.0) if I try to read 
> such a file, I get an error like this:
> {code}
> org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, 
> could be: Output#1263, Output#1295.;
> {code}
> The schema (dfSchema below) that I pass to the csv read looks like this:
> {code}
> StructType( StructField(Output,StringType,true), ... 
> StructField(output,StringType,true), ...)
> {code}
> The code that does the read is this
> {code}
> sqlContext.read
>   .format("csv")
>   .option("header", "false") // Use first line of all files as header
>   .option("inferSchema", "false") // Automatically infer data types
>   .schema(dfSchema)
>   .csv(dataFile)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file

2016-08-12 Thread Barry Becker (JIRA)
Barry Becker created SPARK-17041:


 Summary: Columns in schema are no longer case sensitive when 
reading csv file
 Key: SPARK-17041
 URL: https://issues.apache.org/jira/browse/SPARK-17041
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.0.0
Reporter: Barry Becker


It used to be (in spark 1.6.2) that I could read a csv file that had columns 
with  names that differed only by case. For example, one column may be "output" 
and another called "Output". Now (with spark 2.0.0) if I try to read such a 
file, I get an error like this:
{code}
org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, could 
be: Output#1263, Output#1295.;
{code}

The schema (dfSchema below) that I pass to the csv read looks like this:
{code}
StructType( StructField(Output,StringType,true), ... 
StructField(output,StringType,true), ...)
{code}
The code that does the read is this
{code}
sqlContext.read
  .format("csv")
  .option("header", "false") // Use first line of all files as header
  .option("inferSchema", "false") // Automatically infer data types
  .schema(dfSchema)
  .csv(dataFile)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread Artur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419079#comment-15419079
 ] 

Artur commented on SPARK-15044:
---

I don't know what should we do with this issue. The root cause is invalid hive 
metadata and it isn't Spark's fault.

I've built spark where I catch InvalidInputException in HadoopRDD and  just log 
it and here are results:
Spark don't fail now. 
spark-sql> select * from test;
16/08/13 00:58:08 ERROR HadoopRDD: Input path does not exist: .../test/p=1
2   2   
3   3
Time taken: 4.494 seconds, Fetched 2 row(s)

hive> select * from test;
OK
2   2
3   3
Time taken: 0.041 seconds, Fetched: 2 row(s)

However I don't know if this is correct behavior not to fail if file doesn't 
exists in hfds.


> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419073#comment-15419073
 ] 

Barry Becker commented on SPARK-17039:
--

There are literal ?'s in the datafile. The "nullValue" option indicates that 
those ?'s should be read as null values. I also added the "dateFormat" option 
which describes how the dates in the file should be read.

Let me try to provide more information so you can reproduce.

Here is the schema that I am specifiying (dfSchema above):
{code}
StructType(StructField(string normal,StringType,true), 
StructField(Years,TimestampType,true), StructField(Months,TimestampType,true), 
StructField(WeekDays,TimestampType,true), StructField(Days,TimestampType,true), 
StructField(DaysWithNull,TimestampType,true), 
StructField(Hours,TimestampType,true), StructField(Minutes,TimestampType,true), 
StructField(normal dates,TimestampType,true), StructField(Wide Range 
Dates,TimestampType,true), StructField(Narrow,TimestampType,true), 
StructField(Far Future,TimestampType,true), StructField(Mostly 
Null,TimestampType,true), StructField(All Same Date,TimestampType,true), 
StructField(Past/Future,TimestampType,true), StructField(All 
nulls,TimestampType,true), StructField(Seconds,TimestampType,true))
{code}

and here is the contents of the csv datafile (note that there are lots of 
nulls). This worked using databricks spark-csv lib as a dependency in spark 
1.6.2
{code}
foo 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 
2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 
2015-03-09T00:01:00 2007-11-09T00:00:00 1967-11-09T00:00:00 
2015-03-09T12:00:00 2700-01-01T00:00:00 2015-03-09T00:00:00 
2015-03-09T00:00:00 1983-03-09T00:00:00 ?   2015-03-09T12:01:00
bar 2016-03-09T00:00:00 2015-04-09T00:00:00 2015-03-10T00:00:00 
2015-03-10T00:00:00 ?   2015-03-09T01:00:00 2015-03-09T00:03:00 
2007-10-02T00:00:00 1987-10-02T00:00:00 2015-03-09T12:03:00 
3701-01-01T00:00:00 2015-04-09T00:00:00 2015-03-09T00:00:00 
1865-04-09T00:00:00 ?   2015-03-09T12:01:01
baz 2017-03-09T00:00:00 2015-05-09T00:00:00 2015-03-11T00:00:00 
2015-03-11T00:00:00 2015-03-11T00:00:00 2015-03-09T02:00:00 
2015-03-09T00:05:00 1999-04-04T03:00:00 1999-02-03T00:00:00 
2015-03-09T12:08:00 4702-01-01T00:00:00 ?   2015-03-09T00:00:00 
1777-05-09T00:00:00 ?   2015-03-09T12:01:03
but 2018-03-09T00:00:00 2015-06-09T00:00:00 2015-03-12T00:00:00 
2015-03-12T00:00:00 2015-03-12T00:00:00 2015-03-09T03:00:00 
2015-03-09T00:08:00 2025-10-10T00:00:00 2025-10-10T00:00:00 
2015-03-09T12:10:00 4103-01-01T00:00:00 2015-06-09T00:00:00 
2015-03-09T00:00:00 2089-06-09T00:00:00 ?   2015-03-09T12:01:05
fooo2019-03-09T00:00:00 2015-07-09T00:00:00 2015-03-13T00:00:00 
2015-03-13T00:00:00 2015-03-13T00:00:00 2015-03-09T04:00:00 
2015-03-09T00:09:00 2004-02-23T00:00:00 2004-02-23T00:00:00 
2015-03-09T12:15:00 4204-01-01T00:00:00 ?   2015-03-09T00:00:00 
2125-07-09T00:00:00 ?   2015-03-09T12:01:07
bar 2020-03-09T00:00:00 2015-08-09T00:00:00 2015-03-16T00:00:00 
2015-03-14T00:00:00 2015-03-14T00:00:00 2015-03-09T05:00:00 
2015-03-09T00:12:00 2019-03-04T00:00:00 3019-03-04T00:00:00 
2015-03-09T12:20:00 4305-01-01T00:00:00 2015-08-09T00:00:00 
2015-03-09T00:00:00 2215-08-09T00:00:00 ?   2015-03-09T12:01:09
baz 2021-03-09T00:00:00 2015-09-09T00:00:00 2015-03-17T00:00:00 
2015-03-15T00:00:00 2015-03-15T00:00:00 2015-03-09T06:00:00 
2015-03-09T00:20:00 1999-04-04T02:34:00 ?   2015-03-09T12:25:00 
4406-01-01T00:00:00 2015-09-09T00:00:00 2015-03-09T00:00:00 
1754-09-09T00:00:00 ?   2015-03-09T12:01:11
but 2022-03-09T00:00:00 2015-10-09T00:00:00 2015-03-18T00:00:00 
2015-03-16T00:00:00 ?   2015-03-09T07:00:00 2015-03-09T00:30:00 
1999-03-01T00:00:00 1909-03-01T00:00:00 2015-03-09T12:30:00 
4507-01-01T00:00:00 ?   2015-03-09T00:00:00 1958-10-09T00:00:00 
?   2015-03-09T12:01:00
bar 2023-03-09T00:00:00 2015-11-09T00:00:00 2015-03-19T00:00:00 
2015-03-17T00:00:00 2015-03-17T00:00:00 2015-03-09T08:00:00 
2015-03-09T00:35:00 2001-02-12T00:00:00 ?   2015-03-09T12:35:00 
4608-01-01T00:00:00 2015-11-09T00:00:00 2015-03-09T00:00:00 
3000-11-09T00:00:00 ?   2015-03-09T12:01:00
here is a really really really long string value2024-03-09T00:00:00 
2015-12-09T00:00:00 2015-03-20T00:00:00 2015-03-18T00:00:00 
2015-03-18T00:00:00 2015-03-09T09:00:00 

[jira] [Updated] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Barry Becker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barry Becker updated SPARK-17039:
-
Description: 
I see this exact same bug as reported in this [stack overflow 
post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
  using Spark 2.0.0 (released version).
In scala, I read a csv using 

sqlContext.read
  .format("csv")
  .option("header", "false")
  .option("inferSchema", "false") 
  .option("nullValue", "?")
  .option("dateFormat", "-MM-dd'T'HH:mm:ss")
  .schema(dfSchema)
  .csv(dataFile)
The data contains some null dates (represented with ?).
The error I get is:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 
10, localhost): java.text.ParseException: Unparseable date: "?"
{code}

  was:
I see this exact same bug as reported in this [stack overflow 
post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
  using Spark 2.0.0 (released version).
In scala, I read a csv using 

sqlContext.read
  .format("csv")
  .option("header", "false")
  .option("inferSchema", "false") 
  .option("nullValue", "?")
  .schema(dfSchema)
  .csv(dataFile)
The data contains some null dates (represented with ?).
The error I get is:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 
10, localhost): java.text.ParseException: Unparseable date: "?"
{code}


> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .option("dateFormat", "-MM-dd'T'HH:mm:ss")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419062#comment-15419062
 ] 

Sean Owen commented on SPARK-17039:
---

Oh right looked right past that. But if a date is null, converted to "?" per 
config, that wouldn't be a valid date right?

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419040#comment-15419040
 ] 

Barry Becker commented on SPARK-17039:
--

I do specify a schema (.schema(dfSchema)), and it says that the column is a 
date column. I left it out because there were lots of other columns, and I need 
to spend some time to simplify the example. This is from a unit test that 
worked fine using spark 1.6.2, but fails using spark 2.0.0. I'm pretty sure its 
a real bug. The example in the stack overflow post may provide a better 
reproducible case. 

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419017#comment-15419017
 ] 

Sean Owen commented on SPARK-17039:
---

Hm, how are they being parsed as dates -- or is that the issue? you don't infer 
or specify a schema but say the col is indeed a date column. If it's a date 
column, "?" is not valid, and the error is correct.

> cannot read null dates from csv file
> 
>
> Key: SPARK-17039
> URL: https://issues.apache.org/jira/browse/SPARK-17039
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17040) cannot read null dates from csv file

2016-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17040.
---
Resolution: Duplicate

> cannot read null dates from csv file
> 
>
> Key: SPARK-17040
> URL: https://issues.apache.org/jira/browse/SPARK-17040
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
>
> I see this exact same bug as reported in this [stack overflow 
> post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
>   using Spark 2.0.0 (released version).
> In scala, I read a csv using 
> sqlContext.read
>   .format("csv")
>   .option("header", "false")
>   .option("inferSchema", "false") 
>   .option("nullValue", "?")
>   .schema(dfSchema)
>   .csv(dataFile)
> The data contains some null dates (represented with ?).
> The error I get is:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 
> (TID 10, localhost): java.text.ParseException: Unparseable date: "?"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17040) cannot read null dates from csv file

2016-08-12 Thread Barry Becker (JIRA)
Barry Becker created SPARK-17040:


 Summary: cannot read null dates from csv file
 Key: SPARK-17040
 URL: https://issues.apache.org/jira/browse/SPARK-17040
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.0.0
Reporter: Barry Becker


I see this exact same bug as reported in this [stack overflow 
post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
  using Spark 2.0.0 (released version).
In scala, I read a csv using 

sqlContext.read
  .format("csv")
  .option("header", "false")
  .option("inferSchema", "false") 
  .option("nullValue", "?")
  .schema(dfSchema)
  .csv(dataFile)
The data contains some null dates (represented with ?).
The error I get is:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 
10, localhost): java.text.ParseException: Unparseable date: "?"
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17039) cannot read null dates from csv file

2016-08-12 Thread Barry Becker (JIRA)
Barry Becker created SPARK-17039:


 Summary: cannot read null dates from csv file
 Key: SPARK-17039
 URL: https://issues.apache.org/jira/browse/SPARK-17039
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.0.0
Reporter: Barry Becker


I see this exact same bug as reported in this [stack overflow 
post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column]
  using Spark 2.0.0 (released version).
In scala, I read a csv using 

sqlContext.read
  .format("csv")
  .option("header", "false")
  .option("inferSchema", "false") 
  .option("nullValue", "?")
  .schema(dfSchema)
  .csv(dataFile)
The data contains some null dates (represented with ?).
The error I get is:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 
10, localhost): java.text.ParseException: Unparseable date: "?"
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17038) StreamingSource reports metrics for lastCompletedBatch instead of lastReceivedBatch

2016-08-12 Thread Oz Ben-Ami (JIRA)
Oz Ben-Ami created SPARK-17038:
--

 Summary: StreamingSource reports metrics for lastCompletedBatch 
instead of lastReceivedBatch
 Key: SPARK-17038
 URL: https://issues.apache.org/jira/browse/SPARK-17038
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 2.0.0, 1.6.2
Reporter: Oz Ben-Ami
Priority: Minor


StreamingSource's lastReceivedBatch_submissionTime, 
lastReceivedBatch_processingTimeStart, and lastReceivedBatch_processingTimeEnd 
all use data from lastCompletedBatch instead of lastReceivedBatch. In 
particular, this makes it impossible to match lastReceivedBatch_records with a 
batchID/submission time.
This is apparent when looking at StreamingSource.scala, lines 89-94.
I would guess that just replacing Completed->Received in those lines would fix 
the issue, but I haven't tested it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread saurabh paliwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418994#comment-15418994
 ] 

saurabh paliwal commented on SPARK-15044:
-

Hi!
so sorry I mixed it up. So the use case should be more like, I did create some 
partition p1 in hive and after a lot of time, I delete data from hdfs, and the 
query range contains p1. 
Do you suggest clearing the hive metastore too?

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16955:
-
Assignee: Peter Lee

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
>   at 
> 

[jira] [Issue Comment Deleted] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread saurabh paliwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

saurabh paliwal updated SPARK-15044:

Comment: was deleted

(was: Hi!
I am so sorry, I mixed it up. 
So the premise is towards querying for a data which is not there in hdfs and 
has no partition created either in metastore. I think throwing exception when 
your range of partitions in query consists of such non-existent "partitions" is 
not acceptable in some cases.
For ex, I queried a table for p=1 to p=100 but the table was created and 
started making partitions from p=10. now the query  should give me the results 
of 10-100, no?)

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org


[jira] [Comment Edited] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread saurabh paliwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418935#comment-15418935
 ] 

saurabh paliwal edited comment on SPARK-15044 at 8/12/16 2:45 PM:
--

Hi!
I am so sorry, I mixed it up. 
So the premise is towards querying for a data which is not there in hdfs and 
has no partition created either in metastore. I think throwing exception when 
your range of partitions in query consists of such non-existent "partitions" is 
not acceptable in some cases.
For ex, I queried a table for p=1 to p=100 but the table was created and 
started making partitions from p=10. now the query  should give me the results 
of 10-100, no?


was (Author: saurai3h):
Hi!
I am so sorry, I mixed it up. 
So the premise is towards querying for a data which is not there in hdfs and 
has no partition created either in metastore. I think throwing exception when 
your range of partitions in query consists of such non-existent "partitions" is 
not acceptable in some cases.
For ex, I queried a table for p=1 to p=100 but the table was created and 
started making partitions from p=10. now the query for p = 5 to p = 50 should 
give me the results of 10-50, no?

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>

[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread saurabh paliwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418935#comment-15418935
 ] 

saurabh paliwal commented on SPARK-15044:
-

Hi!
I am so sorry, I mixed it up. 
So the premise is towards querying for a data which is not there in hdfs and 
has no partition created either in metastore. I think throwing exception when 
your range of partitions in query consists of such non-existent "partitions" is 
not acceptable in some cases.
For ex, I queried a table for p=1 to p=100 but the table was created and 
started making partitions from p=10. now the query for p = 5 to p = 50 should 
give me the results of 10-50, no?

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418876#comment-15418876
 ] 

Sean Owen commented on SPARK-15044:
---

The partition would still exist in this case, no?

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread saurabh paliwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418872#comment-15418872
 ] 

saurabh paliwal commented on SPARK-15044:
-

Hi!
I agree with Artur. So let's assume there is a sparse table with hourly 
partitions, if in some hour x, there was indeed no data and the partition is 
there (for example, pre creating all partitions rather than using add partition 
or msck repair table) and you query for a range of partitions which includes x, 
you would be better off with no results rather than a query failure.

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17037) distinct() operator fails on Dataframe with column names containing periods

2016-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17037.
---
Resolution: Duplicate

> distinct() operator fails on Dataframe with column names containing periods
> ---
>
> Key: SPARK-17037
> URL: https://issues.apache.org/jira/browse/SPARK-17037
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Michael Styles
>
> Using the distinct() operator on a Dataframe with column names containing 
> periods results in an AnalysisException. For example:
> {noformat}
> d = [{'pageview.count': 100, 'exit_page': 'example.com/landing'}
> df = sqlContext.createDataFrame(d)]
> df.distinct()
> {noformat}
> results in the following error:
> pyspark.sql.utils.AnalysisException: u'Cannot resolve column name 
> "pageview.count" among (exit_page, pageview.count);'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418862#comment-15418862
 ] 

Sean Owen commented on SPARK-15044:
---

That's the problem. If the semantics were, "query anything that happens to 
exist on HDFS" then this would be the right behavior. But the operation that 
created this (immutable) result set said there were, say, 10 partition, and so 
does the metastore. Is it correct to return any result that can't query all 10 
partitions as "correct"?  Granted, you show Hive happily does so. I think 
that's probably bad though. Or: what's the use case for this behavior?

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread Artur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418841#comment-15418841
 ] 

Artur commented on SPARK-15044:
---

Why does result miss something?
If data doesn't exist in hdfs - it should not show up in the result. (partition 
still exists in metadata)

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17036) Hadoop config caching could lead to memory pressure and high CPU usage in thrift server

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418828#comment-15418828
 ] 

Sean Owen commented on SPARK-17036:
---

[~rajesh.balamohan] please summarize the issue here.

> Hadoop config caching could lead to memory pressure and high CPU usage in 
> thrift server
> ---
>
> Key: SPARK-17036
> URL: https://issues.apache.org/jira/browse/SPARK-17036
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Creating this as a follow up jira to SPARK-12920.  Profiler output on the 
> caching is attached in SPARK-12920.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17037) distinct() operator fails on Dataframe with column names containing periods

2016-08-12 Thread Michael Styles (JIRA)
Michael Styles created SPARK-17037:
--

 Summary: distinct() operator fails on Dataframe with column names 
containing periods
 Key: SPARK-17037
 URL: https://issues.apache.org/jira/browse/SPARK-17037
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Michael Styles


Using the distinct() operator on a Dataframe with column names containing 
periods results in an AnalysisException. For example:

{noformat}
d = [{'pageview.count': 100, 'exit_page': 'example.com/landing'}
df = sqlContext.createDataFrame(d)]
df.distinct()
{noformat}

results in the following error:
pyspark.sql.utils.AnalysisException: u'Cannot resolve column name 
"pageview.count" among (exit_page, pageview.count);'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12920) Honor "spark.ui.retainedStages" to reduce mem-pressure

2016-08-12 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418827#comment-15418827
 ] 

Rajesh Balamohan commented on SPARK-12920:
--

Thanks [~vanzin] . I have created SPARK-17036 for caching issue.

> Honor "spark.ui.retainedStages" to reduce mem-pressure
> --
>
> Key: SPARK-12920
> URL: https://issues.apache.org/jira/browse/SPARK-12920
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 2.1.0
>
> Attachments: SPARK-12920.profiler.png, 
> SPARK-12920.profiler_job_progress_listner.png
>
>
> - Configured with fair-share-scheduler.
> - 4-5 users submitting/running jobs concurrently via spark-thrift-server
> - Spark thrift server spikes to1600+% CPU and stays there for long time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17036) Hadoop config caching could lead to memory pressure and high CPU usage in thrift server

2016-08-12 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-17036:


 Summary: Hadoop config caching could lead to memory pressure and 
high CPU usage in thrift server
 Key: SPARK-17036
 URL: https://issues.apache.org/jira/browse/SPARK-17036
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan
Priority: Minor


Creating this as a follow up jira to SPARK-12920.  Profiler output on the 
caching is attached in SPARK-12920.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16955.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
> Fix For: 2.0.1, 2.1.0
>
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
>   at 
> 

[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals

2016-08-12 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418817#comment-15418817
 ] 

Wenchen Fan commented on SPARK-16955:
-

this bug is already fixed by https://github.com/apache/spark/pull/14595 by 
accident.

> Using ordinals in ORDER BY causes an analysis error when the query has a 
> GROUP BY clause using ordinals
> ---
>
> Key: SPARK-16955
> URL: https://issues.apache.org/jira/browse/SPARK-16955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> The following queries work
> {code}
> select a from (select 1 as a) tmp order by 1
> select a, count(*) from (select 1 as a) tmp group by 1
> select a, count(*) from (select 1 as a) tmp group by 1 order by a
> {code}
> However, the following query does not
> {code}
> select a, count(*) from (select 1 as a) tmp group by 1 order by 1
> {code}
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> Group by position: '1' exceeds the size of the select list '0'. on unresolved 
> object, tree:
> Aggregate [1]
> +- SubqueryAlias tmp
>+- Project [1 AS a#82]
>   +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181)
>   at 
> 

[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418788#comment-15418788
 ] 

Sean Owen commented on SPARK-15044:
---

That seems like worse behavior, because it silently 'succeeds' when the result 
is missing part of the result set.

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread Artur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418784#comment-15418784
 ] 

Artur commented on SPARK-15044:
---

I know that you should not do this.
But if someone does - spark will not work until the metadata is fixed.

However hive does it without error:
hive> select * from test;
OK
Time taken: 0.035 seconds


In this case after:
hive> alter table test drop partition (p=1);

spark-sql> select * from test;
works without problems.



> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16917) Spark streaming kafka version compatibility.

2016-08-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418747#comment-15418747
 ] 

Cody Koeninger commented on SPARK-16917:


It sounds to me like the documentation is clear, because you have
interpreted things correctly.

As it says, the 0.10 version works with broker version 0.10 and higher.
The 0.8 version works with broker version 0.8 and higher.

There is no version specifically for 0.9, nor do I expect there ever will
be.



> Spark streaming kafka version compatibility. 
> -
>
> Key: SPARK-16917
> URL: https://issues.apache.org/jira/browse/SPARK-16917
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Sudev
>Priority: Trivial
>  Labels: documentation
>
> It would be nice to have Kafka version compatibility information in the 
> official documentation. 
> It's very confusing now. 
> * If you look at this JIRA[1], it seems like Kafka is supported in Spark 
> 2.0.0.
> * The documentation lists artifact for (Kafka 0.8)  
> spark-streaming-kafka-0-8_2.11
> Is Kafka 0.9 supported by Spark 2.0.0 ?
> Since I'm confused here even after an hours effort googling on the same, I 
> think someone should help add the compatibility matrix.
> [1] https://issues.apache.org/jira/browse/SPARK-12177



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418746#comment-15418746
 ] 

Sean Owen commented on SPARK-15044:
---

Is it really an error? the files were manually deleted and unsurprisingly this 
causes an exception. You shouldn't delete the files.

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread Artur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418735#comment-15418735
 ] 

Artur commented on SPARK-15044:
---

Tested on spark 2.0.X (master branch) (latest commit: 
685b08e2611b69f8db60a00c0c94aecd315e2a3e Wed Aug 3 13:15:13 2016) :
hive>create table test(n string) partitioned by (p string);
spark-sql> insert into table test PARTITION(p=1) VALUES (1);
hadoop fs -rmr /user/hive/warehouse/test/p=1
spark-sql> select * from test;
{code:java}
16/08/12 20:48:45 ERROR SparkSQLDriver: Failed in [select * from test]
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
./test/p=1
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:289)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:317)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:82)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:82)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:82)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1908)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:899)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:898)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:310)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$3.apply(QueryExecution.scala:131)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$3.apply(QueryExecution.scala:130)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:130)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
at 

[jira] [Updated] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-08-12 Thread Artur (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artur updated SPARK-15044:
--
Affects Version/s: 2.0.0

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-12 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-17035:
---
Description: 
Conversion of datetime.max to microseconds produces incorrect value. For 
example,

{noformat}
from datetime import datetime
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, TimestampType

schema = StructType([StructField("dt", TimestampType(), False)])
data = [{"dt": datetime.max}]

# convert python objects to sql data
sql_data = [schema.toInternal(row) for row in data]

# Value is wrong.
sql_data
[(2.534023188e+17,)]
{noformat}

This value should be [(2534023187,)].

  was:
Conversion of datetime.max to microseconds produces incorrect value. For 
example,

from datetime import datetime
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, TimestampType

schema = StructType([StructField("dt", TimestampType(), False)])
data = [{"dt": datetime.max}]

# convert python objects to sql data
sql_data = [schema.toInternal(row) for row in data]

# Value is wrong.
sql_data
[(2.534023188e+17,)]

This value should be [(2534023187,)].


> Conversion of datetime.max to microseconds produces incorrect value
> ---
>
> Key: SPARK-17035
> URL: https://issues.apache.org/jira/browse/SPARK-17035
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Michael Styles
>Priority: Minor
>
> Conversion of datetime.max to microseconds produces incorrect value. For 
> example,
> {noformat}
> from datetime import datetime
> from pyspark.sql import Row
> from pyspark.sql.types import StructType, StructField, TimestampType
> schema = StructType([StructField("dt", TimestampType(), False)])
> data = [{"dt": datetime.max}]
> # convert python objects to sql data
> sql_data = [schema.toInternal(row) for row in data]
> # Value is wrong.
> sql_data
> [(2.534023188e+17,)]
> {noformat}
> This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-12 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-17035:
---
Priority: Minor  (was: Major)

> Conversion of datetime.max to microseconds produces incorrect value
> ---
>
> Key: SPARK-17035
> URL: https://issues.apache.org/jira/browse/SPARK-17035
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Michael Styles
>Priority: Minor
>
> Conversion of datetime.max to microseconds produces incorrect value. For 
> example,
> from datetime import datetime
> from pyspark.sql import Row
> from pyspark.sql.types import StructType, StructField, TimestampType
> schema = StructType([StructField("dt", TimestampType(), False)])
> data = [{"dt": datetime.max}]
> # convert python objects to sql data
> sql_data = [schema.toInternal(row) for row in data]
> # Value is wrong.
> sql_data
> [(2.534023188e+17,)]
> This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-12 Thread Michael Styles (JIRA)
Michael Styles created SPARK-17035:
--

 Summary: Conversion of datetime.max to microseconds produces 
incorrect value
 Key: SPARK-17035
 URL: https://issues.apache.org/jira/browse/SPARK-17035
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Michael Styles


Conversion of datetime.max to microseconds produces incorrect value. For 
example,

from datetime import datetime
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, TimestampType

schema = StructType([StructField("dt", TimestampType(), False)])
data = [{"dt": datetime.max}]

# convert python objects to sql data
sql_data = [schema.toInternal(row) for row in data]

# Value is wrong.
sql_data
[(2.534023188e+17,)]

This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17032) Add test cases for methods in ParserUtils

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17032:


Assignee: Apache Spark

> Add test cases for methods in ParserUtils
> -
>
> Key: SPARK-17032
> URL: https://issues.apache.org/jira/browse/SPARK-17032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Minor
>
> Currently methods in `ParserUtils` are tested indirectly, we should add test 
> cases in `ParserUtilsSuite` to verify their integrity directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17032) Add test cases for methods in ParserUtils

2016-08-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418677#comment-15418677
 ] 

Apache Spark commented on SPARK-17032:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/14620

> Add test cases for methods in ParserUtils
> -
>
> Key: SPARK-17032
> URL: https://issues.apache.org/jira/browse/SPARK-17032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> Currently methods in `ParserUtils` are tested indirectly, we should add test 
> cases in `ParserUtilsSuite` to verify their integrity directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17032) Add test cases for methods in ParserUtils

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17032:


Assignee: (was: Apache Spark)

> Add test cases for methods in ParserUtils
> -
>
> Key: SPARK-17032
> URL: https://issues.apache.org/jira/browse/SPARK-17032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> Currently methods in `ParserUtils` are tested indirectly, we should add test 
> cases in `ParserUtilsSuite` to verify their integrity directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17034:


Assignee: (was: Apache Spark)

> Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression
> -
>
> Key: SPARK-17034
> URL: https://issues.apache.org/jira/browse/SPARK-17034
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>
> Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" 
> should be considered as unresolved before analysis. But in current code, it 
> uses "Literal" expression to store the ordinal. This is inappropriate as 
> "Literal" itself is a resolved expression, it gives the user a wrong message 
> that the ordinals has already been resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression

2016-08-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418623#comment-15418623
 ] 

Apache Spark commented on SPARK-17034:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14616

> Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression
> -
>
> Key: SPARK-17034
> URL: https://issues.apache.org/jira/browse/SPARK-17034
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>
> Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" 
> should be considered as unresolved before analysis. But in current code, it 
> uses "Literal" expression to store the ordinal. This is inappropriate as 
> "Literal" itself is a resolved expression, it gives the user a wrong message 
> that the ordinals has already been resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17034:


Assignee: Apache Spark

> Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression
> -
>
> Key: SPARK-17034
> URL: https://issues.apache.org/jira/browse/SPARK-17034
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Assignee: Apache Spark
>
> Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" 
> should be considered as unresolved before analysis. But in current code, it 
> uses "Literal" expression to store the ordinal. This is inappropriate as 
> "Literal" itself is a resolved expression, it gives the user a wrong message 
> that the ordinals has already been resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression

2016-08-12 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17034:
--

 Summary: Ordinal in ORDER BY or GROUP BY should be treated as an 
unresolved expression
 Key: SPARK-17034
 URL: https://issues.apache.org/jira/browse/SPARK-17034
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong


Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" 
should be considered as unresolved before analysis. But in current code, it 
uses "Literal" expression to store the ordinal. This is inappropriate as 
"Literal" itself is a resolved expression, it gives the user a wrong message 
that the ordinals has already been resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in spark.ml package

2016-08-12 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418585#comment-15418585
 ] 

Jeff Zhang commented on SPARK-15882:


I think it is better to keep RDD api underneath as I don't see much benefit 
from Dataset here. Although linear algebra api is public, I think most of time 
it is used by spark ml internally.  Regarding the interface between linear 
algebra and spark.ml Transformers/Estimater, I would suggest to use  Dataset 
(e.g.  PCA.fit) because I think the user facing interface is mostly on 
Transformer/Estimator, we should keep them stick on Dataset. 



> Discuss distributed linear algebra in spark.ml package
> --
>
> Key: SPARK-15882
> URL: https://issues.apache.org/jira/browse/SPARK-15882
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* 
> should be migrated to org.apache.spark.ml.
> Initial questions:
> * Should we use Datasets or RDDs underneath?
> * If Datasets, are there missing features needed for the migration?
> * Do we want to redesign any aspects of the distributed matrices during this 
> move?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-08-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418526#comment-15418526
 ] 

胡振宇 commented on SPARK-14850:
-

I try to run your code on spark1.6.1 but i found that "toDF" cannot be used in 
this example
Here are my code 
object Example{
def main (args:Array[String]){
  case class Test(num:Int,vector:Vector)
  val conf = new SparkConf.setAppname("Example")
  val sqlContext=new SQLContext(sc)
  import sqlContext.implicts._
  val temp=sqlContext.sparkContext.parallelize(0,until 
1e4.toInt,1).map(i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0.toDF() 
//at this step toDF can be used I do
 }
}



sc.parallelize(0 until 1e4.toInt, 1).map { i =>
  (i, Vectors.dense(Array.fill(1e6.toInt)(1.0)))
}.toDF.rdd.count()

I even use sparkcontext but toDF cannot be used too

Do you have a solution to run the example on spark1.6.1? Thank you 

} 

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8717) Update mllib-data-types docs to include missing "matrix" Python examples

2016-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8717.
--
Resolution: Duplicate

This was actually a duplicate of an issue already fixed. Look at the docs in 
master.

> Update mllib-data-types docs to include missing "matrix" Python examples
> 
>
> Key: SPARK-8717
> URL: https://issues.apache.org/jira/browse/SPARK-8717
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Rosstin Murphy
>Priority: Minor
>
> Currently, the documentation for MLLib Data Types (docs/mllib-data-types.md 
> in the repo, https://spark.apache.org/docs/latest/mllib-data-types.html in 
> the latest online docs) stops listing Python examples at "Labeled point".
> "Local vector" and "Labeled point" have Python examples, however none of the 
> "matrix" entries have Python examples.
> The "matrix" entries could be updated to include python examples.
> I'm not 100% sure that all the matrices currently have implemented Python 
> equivalents, but I'm pretty sure that at least the first one ("Local matrix") 
> could have an entry.
> from pyspark.mllib.linalg import DenseMatrix
> dm = DenseMatrix(3, 2, [1.0, 3.0, 5.0, 2.0, 4.0, 6.0])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16598) Added a test case for verifying the table identifier parsing

2016-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16598.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14244
[https://github.com/apache/spark/pull/14244]

> Added a test case for verifying the table identifier parsing
> 
>
> Key: SPARK-16598
> URL: https://issues.apache.org/jira/browse/SPARK-16598
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.1.0
>
>
> So far, the test cases of TableIdentifierParserSuite do not cover the quoted 
> cases. We should add one for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16985) SQL Output maybe overrided

2016-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16985.
---
   Resolution: Fixed
 Assignee: Hong Shen
Fix Version/s: 2.1.0

Resolved by https://github.com/apache/spark/pull/14574

> SQL Output maybe overrided
> --
>
> Key: SPARK-16985
> URL: https://issues.apache.org/jira/browse/SPARK-16985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hong Shen
>Assignee: Hong Shen
> Fix For: 2.1.0
>
>
> In our cluster, sometimes the sql output maybe overrided. When I submit some 
> sql, all insert into the same table, and the sql will cost less one minute, 
> here is the detail,
> 1 sql1, 11:03 insert into table.
> 2 sql2, 11:04:11 insert into table.
> 3 sql3, 11:04:48 insert into table.
> 4 sql4, 11:05 insert into table.
> 5 sql5, 11:06 insert into table.
> The sql3's output file will override the sql2's output file. here is the log:
> {code}
> 16/05/04 11:04:11 INFO hive.SparkHiveHadoopWriter: 
> XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_1204544348/1/_tmp.p_20160428/attempt_201605041104_0001_m_00_1
> 16/05/04 11:04:48 INFO hive.SparkHiveHadoopWriter: 
> XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_212180468/1/_tmp.p_20160428/attempt_201605041104_0001_m_00_1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2

2016-08-12 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16975.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14585
[https://github.com/apache/spark/pull/14585]

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --
>
> Key: SPARK-16975
> URL: https://issues.apache.org/jira/browse/SPARK-16975
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Ubuntu Linux 14.04
>Reporter: immerrr again
>Assignee: Dongjoon Hyun
>  Labels: parquet
> Fix For: 2.0.1, 2.1.0
>
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated 
> by 1.6.2. 
> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data. It must be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the 
> partitions are empty. I have narrowed the failing dataset to the first 32 
> partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be 
> specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + 
> subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 
> partitions of the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == 
> spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be 
> specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be 
> specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at 
> /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be 
> specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same 
> manner, everything is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last 
> discoveries this clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16598) Added a test case for verifying the table identifier parsing

2016-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16598:
--
Assignee: Xiao Li
Priority: Minor  (was: Major)

> Added a test case for verifying the table identifier parsing
> 
>
> Key: SPARK-16598
> URL: https://issues.apache.org/jira/browse/SPARK-16598
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.1.0
>
>
> So far, the test cases of TableIdentifierParserSuite do not cover the quoted 
> cases. We should add one for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-08-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418547#comment-15418547
 ] 

胡振宇 edited comment on SPARK-14850 at 8/12/16 9:02 AM:
--

/*code is for spark 1.6.1*/ 
object Example{
def main (args:Array[String]){

val conf = new SparkConf.setAppname("Example")
val sc=new sparkContext(conf)
val sqlContext=new SQLContext(sc) 
import sqlContext.implicts._ 
val count=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map{
i=>(i,Vectors.dense(Array.fill(1e6.toInt)(1.0)))
 }.toDF().rdd.count() //at this step toDF can be used on 
Spark1.6.1
}
} 

so I am not able to  test the simple serialization example 



was (Author: fox19960207):
/*code is for spark 1.6.1*/ 
object Example{
def main (args:Array[String]){

val conf = new SparkConf.setAppname("Example")
val sc=new sparkContext(conf)
val sqlContext=new SQLContext(sc) 
import sqlContext.implicts._ 
val count=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map{
i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0)))
 }.toDF().rdd.count() //at this step toDF can be used on 
Spark1.6.1
}
} 

so I am not able to  test the simple serialization example 


> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8717) Update mllib-data-types docs to include missing "matrix" Python examples

2016-08-12 Thread Jagadeesan A S (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418553#comment-15418553
 ] 

Jagadeesan A S commented on SPARK-8717:
---

I would like to raise PR for this issue. [~srowen] can u add you input..?

> Update mllib-data-types docs to include missing "matrix" Python examples
> 
>
> Key: SPARK-8717
> URL: https://issues.apache.org/jira/browse/SPARK-8717
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Rosstin Murphy
>Priority: Minor
>
> Currently, the documentation for MLLib Data Types (docs/mllib-data-types.md 
> in the repo, https://spark.apache.org/docs/latest/mllib-data-types.html in 
> the latest online docs) stops listing Python examples at "Labeled point".
> "Local vector" and "Labeled point" have Python examples, however none of the 
> "matrix" entries have Python examples.
> The "matrix" entries could be updated to include python examples.
> I'm not 100% sure that all the matrices currently have implemented Python 
> equivalents, but I'm pretty sure that at least the first one ("Local matrix") 
> could have an entry.
> from pyspark.mllib.linalg import DenseMatrix
> dm = DenseMatrix(3, 2, [1.0, 3.0, 5.0, 2.0, 4.0, 6.0])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-08-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418547#comment-15418547
 ] 

胡振宇 commented on SPARK-14850:
-

/*code is for spark 1.6.1*/ 
object Example{
def main (args:Array[String]){

val conf = new SparkConf.setAppname("Example")
val sc=new sparkContext(conf)
val sqlContext=new SQLContext(sc) 
import sqlContext.implicts._ 
val count=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map{
i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0)))
 }.toDF().rdd.count() //at this step toDF can be used on 
Spark1.6.1
}
} 

so I am not able to  test the simple serialization example 


> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418537#comment-15418537
 ] 

Apache Spark commented on SPARK-17033:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/14621

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there is 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow

2016-08-12 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418071#comment-15418071
 ] 

Maciej Szymkiewicz commented on SPARK-17027:


Yes, this exactly the problem. 


{code}
choose(14, 10)
// res0: Int = -182
{code}

> PolynomialExpansion.choose is prone to integer overflow 
> 
>
> Key: SPARK-17027
> URL: https://issues.apache.org/jira/browse/SPARK-17027
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Current implementation computes power of k directly and because of that it is 
> susceptible to integer overflow on relatively small input (4 features, degree 
> equal 10).  It would be better to use recursive formula instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17033:


Assignee: (was: Apache Spark)

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there is 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17033:

Description: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there are 20% increased performance. 
 (was: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there are 15% increased performance.)

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there are 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17033:

Description: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there are 15% increased performance. 
 (was: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there are 20% increased performance.)

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there are 15% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17033:

Description: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there is 20% increased performance.  
(was: {{GaussianMixture}} should use {{treeAggregate}} rather than 
{{aggregate}} to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there are 20% increased performance.)

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there is 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-17033:
---

 Summary: GaussianMixture should use treeAggregate to improve 
performance
 Key: SPARK-17033
 URL: https://issues.apache.org/jira/browse/SPARK-17033
 Project: Spark
  Issue Type: Improvement
Reporter: Yanbo Liang
Priority: Minor


{{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
improve performance and scalability. In my test of dataset with 200 features 
and 1M instance, I found there are 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-08-12 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418534#comment-15418534
 ] 

Wenchen Fan commented on SPARK-14850:
-

format your code please, it's unreadable

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance

2016-08-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17033:

Component/s: MLlib
 ML

> GaussianMixture should use treeAggregate to improve performance
> ---
>
> Key: SPARK-17033
> URL: https://issues.apache.org/jira/browse/SPARK-17033
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Priority: Minor
>
> {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to 
> improve performance and scalability. In my test of dataset with 200 features 
> and 1M instance, I found there are 20% increased performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >