[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419798#comment-15419798 ] huangyu commented on SPARK-15044: - Hi, I know it isn't Spark's fault. However I think maybe it's better to just log some error information, rather than throwing an exception. Because I always in the situation that, a table with many partitions(maybe one partition per hour), someone deletes the paths of many partitions(I really don't know why they do this, maybe there are some bugs in their program). Spark-sql can't work util I fix the hive metadata, but I can't run "alter table drop partition.." for all missing partitions(too many to run). So I have no choice but to catch the exception and rebuild spark. > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Closed] (SPARK-16716) calling cache on joined dataframe can lead to data being blanked
[ https://issues.apache.org/jira/browse/SPARK-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning closed SPARK-16716. -- Resolution: Duplicate This looks like it was fixed by SPARK-16664 > calling cache on joined dataframe can lead to data being blanked > > > Key: SPARK-16716 > URL: https://issues.apache.org/jira/browse/SPARK-16716 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: PJ Fanning > > I have reproduced the issue in Spark 1.6.2 and latest 1.6.3-SNAPSHOT code. > The code works ok on Spark 1.6.1. > I have a notebook up on Databricks Community Edition that demonstrates the > issue. The notebook depends on the library com.databricks:spark-csv_2.10:1.4.0 > The code uses some custom code to join 4 dataframes. > It calls show on this dataframe and the data is as expected. > After calling .cache, the data is blanked. > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5458351705459939/3760010872339805/5521341683971298/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419768#comment-15419768 ] Barry Becker commented on SPARK-17039: -- I read the comments in SPARK-16462. It looks like it would fix this issue, but I have not tried it yet. I will look into trying to local build the tip of 2.0.1 later next week and try it out. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419762#comment-15419762 ] Shivaram Venkataraman commented on SPARK-16519: --- FYI [~aloknsingh] [~clarkfitzg] some of the RDD methods are being renamed in this JIRA > Handle SparkR RDD generics that create warnings in R CMD check > -- > > Key: SPARK-16519 > URL: https://issues.apache.org/jira/browse/SPARK-16519 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the warnings we get from R CMD check is that RDD implementations of > some of the generics are not documented. These generics are shared between > RDD, DataFrames in SparkR. The list includes > {quote} > WARNING > Undocumented S4 methods: > generic 'cache' and siglist 'RDD' > generic 'collect' and siglist 'RDD' > generic 'count' and siglist 'RDD' > generic 'distinct' and siglist 'RDD' > generic 'first' and siglist 'RDD' > generic 'join' and siglist 'RDD,RDD' > generic 'length' and siglist 'RDD' > generic 'partitionBy' and siglist 'RDD' > generic 'persist' and siglist 'RDD,character' > generic 'repartition' and siglist 'RDD' > generic 'show' and siglist 'RDD' > generic 'take' and siglist 'RDD,numeric' > generic 'unpersist' and siglist 'RDD' > {quote} > As described in > https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks > like a limitation of R where exporting a generic from a package also exports > all the implementations of that generic. > One way to get around this is to remove the RDD API or rename the methods in > Spark 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
[ https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419755#comment-15419755 ] Apache Spark commented on SPARK-16975: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14627 > Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2 > -- > > Key: SPARK-16975 > URL: https://issues.apache.org/jira/browse/SPARK-16975 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Ubuntu Linux 14.04 >Reporter: immerrr again >Assignee: Dongjoon Hyun > Labels: parquet > Fix For: 2.0.1, 2.1.0 > > > Spark-2.0.0 seems to have some problems reading a parquet dataset generated > by 1.6.2. > {code} > In [80]: spark.read.parquet('/path/to/data') > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data. It must be specified manually;' > {code} > The dataset is ~150G and partitioned by _locality_code column. None of the > partitions are empty. I have narrowed the failing dataset to the first 32 > partitions of the data: > {code} > In [82]: spark.read.parquet(*subdirs[:32]) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be > specified manually;' > {code} > Interestingly, it works OK if you remove any of the partitions from the list: > {code} > In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + > subdirs[i+1:32])) > {code} > Another strange thing is that the schemas for the first and the last 31 > partitions of the subset are identical: > {code} > In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == > spark.read.parquet(*subdirs[1:32]).schema.fields > Out[84]: True > {code} > Which got me interested and I tried this: > {code} > In [87]: spark.read.parquet(*([subdirs[0]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be > specified manually;' > In [88]: spark.read.parquet(*([subdirs[15]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be > specified manually;' > In [89]: spark.read.parquet(*([subdirs[31]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be > specified manually;' > {code} > If I read the first partition, save it in 2.0 and try to read in the same > manner, everything is fine: > {code} > In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test') > 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32)) > {code} > I have originally posted it to user mailing list, but with the last > discoveries this clearly seems like a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419730#comment-15419730 ] Liwei Lin edited comment on SPARK-17039 at 8/13/16 12:36 AM: - Thanks [~barrybecker4] for reporting this! Please also see https://issues.apache.org/jira/browse/SPARK-16462. It'd be great if you could try this patch https://github.com/apache/spark/pull/14118 and provide some feedback. was (Author: proflin): Thanks [~barrybecker4] for reporting this. Please also see https://issues.apache.org/jira/browse/SPARK-16462. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419730#comment-15419730 ] Liwei Lin commented on SPARK-17039: --- Thanks [~barrybecker4] for reporting this. Please also see https://issues.apache.org/jira/browse/SPARK-16462. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16519: Assignee: (was: Apache Spark) > Handle SparkR RDD generics that create warnings in R CMD check > -- > > Key: SPARK-16519 > URL: https://issues.apache.org/jira/browse/SPARK-16519 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the warnings we get from R CMD check is that RDD implementations of > some of the generics are not documented. These generics are shared between > RDD, DataFrames in SparkR. The list includes > {quote} > WARNING > Undocumented S4 methods: > generic 'cache' and siglist 'RDD' > generic 'collect' and siglist 'RDD' > generic 'count' and siglist 'RDD' > generic 'distinct' and siglist 'RDD' > generic 'first' and siglist 'RDD' > generic 'join' and siglist 'RDD,RDD' > generic 'length' and siglist 'RDD' > generic 'partitionBy' and siglist 'RDD' > generic 'persist' and siglist 'RDD,character' > generic 'repartition' and siglist 'RDD' > generic 'show' and siglist 'RDD' > generic 'take' and siglist 'RDD,numeric' > generic 'unpersist' and siglist 'RDD' > {quote} > As described in > https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks > like a limitation of R where exporting a generic from a package also exports > all the implementations of that generic. > One way to get around this is to remove the RDD API or rename the methods in > Spark 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419716#comment-15419716 ] Apache Spark commented on SPARK-16519: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/14626 > Handle SparkR RDD generics that create warnings in R CMD check > -- > > Key: SPARK-16519 > URL: https://issues.apache.org/jira/browse/SPARK-16519 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the warnings we get from R CMD check is that RDD implementations of > some of the generics are not documented. These generics are shared between > RDD, DataFrames in SparkR. The list includes > {quote} > WARNING > Undocumented S4 methods: > generic 'cache' and siglist 'RDD' > generic 'collect' and siglist 'RDD' > generic 'count' and siglist 'RDD' > generic 'distinct' and siglist 'RDD' > generic 'first' and siglist 'RDD' > generic 'join' and siglist 'RDD,RDD' > generic 'length' and siglist 'RDD' > generic 'partitionBy' and siglist 'RDD' > generic 'persist' and siglist 'RDD,character' > generic 'repartition' and siglist 'RDD' > generic 'show' and siglist 'RDD' > generic 'take' and siglist 'RDD,numeric' > generic 'unpersist' and siglist 'RDD' > {quote} > As described in > https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks > like a limitation of R where exporting a generic from a package also exports > all the implementations of that generic. > One way to get around this is to remove the RDD API or rename the methods in > Spark 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16519) Handle SparkR RDD generics that create warnings in R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16519: Assignee: Apache Spark > Handle SparkR RDD generics that create warnings in R CMD check > -- > > Key: SPARK-16519 > URL: https://issues.apache.org/jira/browse/SPARK-16519 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Apache Spark > > One of the warnings we get from R CMD check is that RDD implementations of > some of the generics are not documented. These generics are shared between > RDD, DataFrames in SparkR. The list includes > {quote} > WARNING > Undocumented S4 methods: > generic 'cache' and siglist 'RDD' > generic 'collect' and siglist 'RDD' > generic 'count' and siglist 'RDD' > generic 'distinct' and siglist 'RDD' > generic 'first' and siglist 'RDD' > generic 'join' and siglist 'RDD,RDD' > generic 'length' and siglist 'RDD' > generic 'partitionBy' and siglist 'RDD' > generic 'persist' and siglist 'RDD,character' > generic 'repartition' and siglist 'RDD' > generic 'show' and siglist 'RDD' > generic 'take' and siglist 'RDD,numeric' > generic 'unpersist' and siglist 'RDD' > {quote} > As described in > https://stat.ethz.ch/pipermail/r-devel/2003-September/027490.html this looks > like a limitation of R where exporting a generic from a package also exports > all the implementations of that generic. > One way to get around this is to remove the RDD API or rename the methods in > Spark 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419705#comment-15419705 ] Matt Sicker commented on SPARK-6305: Perhaps [composite configuration|https://logging.apache.org/log4j/2.x/manual/configuration.html#CompositeConfiguration] could be helpful here? This way you can override some default loggers and whatnot while still allowing user-defined logging config files. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17001) Enable standardScaler to standardize sparse vectors when withMean=True
[ https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419592#comment-15419592 ] Tobi Bosede commented on SPARK-17001: - This can be implemented in a similar fashion to scikit learn's maxabs_scale. See http://scikit-learn.org/stable/modules/preprocessing.html#scaling-sparse-data for more info. > Enable standardScaler to standardize sparse vectors when withMean=True > -- > > Key: SPARK-17001 > URL: https://issues.apache.org/jira/browse/SPARK-17001 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Tobi Bosede >Priority: Minor > > When withMean = true, StandardScaler will not handle sparse vectors, and > instead throw an exception. This is presumably because subtracting the mean > makes a sparse vector dense, and this can be undesirable. > However, VectorAssembler generates vectors that may be a mix of sparse and > dense, even when vectors are smallish, depending on their values. It's common > to feed this into StandardScaler, but it would fail sometimes depending on > the input if withMean = true. This is kind of surprising. > StandardScaler should go ahead and operate on sparse vectors and subtract the > mean, if explicitly asked to do so with withMean, on the theory that the user > knows what he/she is doing, and there is otherwise no way to make this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17038) StreamingSource reports metrics for lastCompletedBatch instead of lastReceivedBatch
[ https://issues.apache.org/jira/browse/SPARK-17038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419571#comment-15419571 ] Shixiong Zhu commented on SPARK-17038: -- Good catch. Could you submit a PR to fix it, please? > StreamingSource reports metrics for lastCompletedBatch instead of > lastReceivedBatch > --- > > Key: SPARK-17038 > URL: https://issues.apache.org/jira/browse/SPARK-17038 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.2, 2.0.0 >Reporter: Oz Ben-Ami >Priority: Minor > Labels: metrics > > StreamingSource's lastReceivedBatch_submissionTime, > lastReceivedBatch_processingTimeStart, and > lastReceivedBatch_processingTimeEnd all use data from lastCompletedBatch > instead of lastReceivedBatch. In particular, this makes it impossible to > match lastReceivedBatch_records with a batchID/submission time. > This is apparent when looking at StreamingSource.scala, lines 89-94. > I would guess that just replacing Completed->Received in those lines would > fix the issue, but I haven't tested it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419458#comment-15419458 ] Sital Kedia commented on SPARK-16922: - I am using the fix in https://github.com/apache/spark/pull/14464/files, still the issue remains. The joining key lies in the range [43304915L to 10150946266075397L] > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419438#comment-15419438 ] Davies Liu commented on SPARK-16922: I think it's fixed by https://github.com/apache/spark/pull/14464/files > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419434#comment-15419434 ] Davies Liu commented on SPARK-16922: [~sitalke...@gmail.com] There are two integer overflow bugs fixed recently in LongHashedRelation, could you test with latest master? How is the range of your joining key? > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16716) calling cache on joined dataframe can lead to data being blanked
[ https://issues.apache.org/jira/browse/SPARK-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419430#comment-15419430 ] PJ Fanning commented on SPARK-16716: I set up an equivalent notebook for spark 2.0 in Databricks community edition and the join and cache worked out. It appears that issue is just in Spark 1.6.2. > calling cache on joined dataframe can lead to data being blanked > > > Key: SPARK-16716 > URL: https://issues.apache.org/jira/browse/SPARK-16716 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: PJ Fanning > > I have reproduced the issue in Spark 1.6.2 and latest 1.6.3-SNAPSHOT code. > The code works ok on Spark 1.6.1. > I have a notebook up on Databricks Community Edition that demonstrates the > issue. The notebook depends on the library com.databricks:spark-csv_2.10:1.4.0 > The code uses some custom code to join 4 dataframes. > It calls show on this dataframe and the data is as expected. > After calling .cache, the data is blanked. > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5458351705459939/3760010872339805/5521341683971298/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-16922: Summary: Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 (was: Query failure due to executor OOM in Spark 2.0) > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16922) Query failure due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419383#comment-15419383 ] Sital Kedia edited comment on SPARK-16922 at 8/12/16 8:06 PM: -- I found that the regression was introduced in https://github.com/apache/spark/pull/12278, which introduced a new data structure (LongHashedRelation) for long types. I made a hack to use UnsafeHashedRelation instead of LongHashedRelation in https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L105 and things started working fine. This might be due to some data corruption happening in LongHashedRelation. cc - [~davies] was (Author: sitalke...@gmail.com): I found that the regression was introduced in https://github.com/apache/spark/pull/12278, whichintroduced a new data structure (LongHashedRelation) for long types. I made a hack to use UnsafeHashedRelation instead of LongHashedRelation in https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L105 and things started working fine. This might be due to some data corruption happening in LongHashedRelation. cc - [~davies] > Query failure due to executor OOM in Spark 2.0 > -- > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange >
[jira] [Commented] (SPARK-16922) Query failure due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419383#comment-15419383 ] Sital Kedia commented on SPARK-16922: - I found that the regression was introduced in https://github.com/apache/spark/pull/12278, whichintroduced a new data structure (LongHashedRelation) for long types. I made a hack to use UnsafeHashedRelation instead of LongHashedRelation in https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L105 and things started working fine. This might be due to some data corruption happening in LongHashedRelation. cc - [~davies] > Query failure due to executor OOM in Spark 2.0 > -- > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17045) Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17045: Assignee: Apache Spark > Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite > -- > > Key: SPARK-17045 > URL: https://issues.apache.org/jira/browse/SPARK-17045 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > https://github.com/apache/spark/pull/14498 plans to remove Hive Built-in Hash > Functions. 10+ test cases are broken because the results are different from > the Hive golden answer files. These broken test cases are not Hive specific. > Thus, it makes more sense to move them to `SQLQueryTestSuite` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17045) Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17045: Assignee: (was: Apache Spark) > Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite > -- > > Key: SPARK-17045 > URL: https://issues.apache.org/jira/browse/SPARK-17045 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > https://github.com/apache/spark/pull/14498 plans to remove Hive Built-in Hash > Functions. 10+ test cases are broken because the results are different from > the Hive golden answer files. These broken test cases are not Hive specific. > Thus, it makes more sense to move them to `SQLQueryTestSuite` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17045) Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-17045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419320#comment-15419320 ] Apache Spark commented on SPARK-17045: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14625 > Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite > -- > > Key: SPARK-17045 > URL: https://issues.apache.org/jira/browse/SPARK-17045 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > https://github.com/apache/spark/pull/14498 plans to remove Hive Built-in Hash > Functions. 10+ test cases are broken because the results are different from > the Hive golden answer files. These broken test cases are not Hive specific. > Thus, it makes more sense to move them to `SQLQueryTestSuite` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17045) Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite
Xiao Li created SPARK-17045: --- Summary: Moving Auto_Joins from HiveCompatibilitySuite to SQLQueryTestSuite Key: SPARK-17045 URL: https://issues.apache.org/jira/browse/SPARK-17045 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li https://github.com/apache/spark/pull/14498 plans to remove Hive Built-in Hash Functions. 10+ test cases are broken because the results are different from the Hive golden answer files. These broken test cases are not Hive specific. Thus, it makes more sense to move them to `SQLQueryTestSuite` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17042) Repl-defined classes cannot be replicated
[ https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419267#comment-15419267 ] Sean Owen commented on SPARK-17042: --- Scala 2.10 or 2.11? I'm pretty sure this is a duplicate. > Repl-defined classes cannot be replicated > - > > Key: SPARK-17042 > URL: https://issues.apache.org/jira/browse/SPARK-17042 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Reporter: Eric Liang > > A simple fix is to erase the classTag when using the default serializer, > since it's not needed in that case, and the classTag was failing to > deserialize on the remote end. > The proper fix is actually to use the right classloader when deserializing > the classtags, but that is a much more invasive change for 2.0. > The following test can be added to ReplSuite to reproduce the bug: > {code} > test("replicating blocks of object with class defined in repl") { > val output = runInterpreter("local-cluster[2,1,1024]", > """ > |import org.apache.spark.storage.StorageLevel._ > |case class Foo(i: Int) > |val ret = sc.parallelize((1 to 100).map(Foo), > 10).persist(MEMORY_ONLY_2) > |ret.count() > |sc.getExecutorStorageStatus.map(s => > s.rddBlocksById(ret.id).size).sum > """.stripMargin) > assertDoesNotContain("error:", output) > assertDoesNotContain("Exception", output) > assertContains(": Int = 20", output) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17043) Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result)
[ https://issues.apache.org/jira/browse/SPARK-17043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17043. --- Resolution: Duplicate > Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result) > - > > Key: SPARK-17043 > URL: https://issues.apache.org/jira/browse/SPARK-17043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Barry Becker > > I have a method that adds a row index column to a dataframe. It only works > correctly if the dataframe has less than 200 columns. When more than 200 > columns nearly all the data becomes empty (""'s for values). > {code} > def zipWithIndex(df: DataFrame, rowIdxColName: String): DataFrame = { > val nullable = false > df.sparkSession.createDataFrame( > df.rdd.zipWithIndex.map{case (row, i) => Row.fromSeq(row.toSeq :+ i)}, > StructType(df.schema.fields :+ StructField(rowIdxColName, LongType, > nullable)) > ) > } > {code} > This might be related to https://issues.apache.org/jira/browse/SPARK-16664 > but I'm not sure. I saw the 200 column threshold and it made me think it > might be related. I saw this problem in spark 1.6.2 and 2.0.0. Maybe it is > fixed in 2.0.1 (have not tried yet). I have no idea why the 200 column > threshold is significant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17044) Add window function test in SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17044: Assignee: (was: Apache Spark) > Add window function test in SQLQueryTestSuite > - > > Key: SPARK-17044 > URL: https://issues.apache.org/jira/browse/SPARK-17044 > Project: Spark > Issue Type: Improvement >Reporter: Dongjoon Hyun >Priority: Minor > > New `SQLQueryTestSuite` simplifies SQL testcases. > This issue aims to replace `WindowQuerySuite.scala` of `sql/hive` module with > `window_functions.sql` in `sql/core` module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17044) Add window function test in SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419238#comment-15419238 ] Apache Spark commented on SPARK-17044: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/14623 > Add window function test in SQLQueryTestSuite > - > > Key: SPARK-17044 > URL: https://issues.apache.org/jira/browse/SPARK-17044 > Project: Spark > Issue Type: Improvement >Reporter: Dongjoon Hyun >Priority: Minor > > New `SQLQueryTestSuite` simplifies SQL testcases. > This issue aims to replace `WindowQuerySuite.scala` of `sql/hive` module with > `window_functions.sql` in `sql/core` module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17044) Add window function test in SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17044: Assignee: Apache Spark > Add window function test in SQLQueryTestSuite > - > > Key: SPARK-17044 > URL: https://issues.apache.org/jira/browse/SPARK-17044 > Project: Spark > Issue Type: Improvement >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > New `SQLQueryTestSuite` simplifies SQL testcases. > This issue aims to replace `WindowQuerySuite.scala` of `sql/hive` module with > `window_functions.sql` in `sql/core` module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17044) Add window function test in SQLQueryTestSuite
Dongjoon Hyun created SPARK-17044: - Summary: Add window function test in SQLQueryTestSuite Key: SPARK-17044 URL: https://issues.apache.org/jira/browse/SPARK-17044 Project: Spark Issue Type: Improvement Reporter: Dongjoon Hyun Priority: Minor New `SQLQueryTestSuite` simplifies SQL testcases. This issue aims to replace `WindowQuerySuite.scala` of `sql/hive` module with `window_functions.sql` in `sql/core` module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17042) Repl-defined classes cannot be replicated
[ https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17042: --- Description: A simple fix is to erase the classTag when using the default serializer, since it's not needed in that case, and the classTag was failing to deserialize on the remote end. The proper fix is actually to use the right classloader when deserializing the classtags, but that is a much more invasive change for 2.0. The following test can be added to ReplSuite to reproduce the bug: {code} test("replicating blocks of object with class defined in repl") { val output = runInterpreter("local-cluster[2,1,1024]", """ |import org.apache.spark.storage.StorageLevel._ |case class Foo(i: Int) |val ret = sc.parallelize((1 to 100).map(Foo), 10).persist(MEMORY_ONLY_2) |ret.count() |sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum """.stripMargin) assertDoesNotContain("error:", output) assertDoesNotContain("Exception", output) assertContains(": Int = 20", output) } {code} was: The following test can be added to ReplSuite to reproduce the bug: {code} test("replicating blocks of object with class defined in repl") { val output = runInterpreter("local-cluster[2,1,1024]", """ |import org.apache.spark.storage.StorageLevel._ |case class Foo(i: Int) |val ret = sc.parallelize((1 to 100).map(Foo), 10).persist(MEMORY_ONLY_2) |ret.count() |sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum """.stripMargin) assertDoesNotContain("error:", output) assertDoesNotContain("Exception", output) assertContains(": Int = 20", output) } {code} > Repl-defined classes cannot be replicated > - > > Key: SPARK-17042 > URL: https://issues.apache.org/jira/browse/SPARK-17042 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Reporter: Eric Liang > > A simple fix is to erase the classTag when using the default serializer, > since it's not needed in that case, and the classTag was failing to > deserialize on the remote end. > The proper fix is actually to use the right classloader when deserializing > the classtags, but that is a much more invasive change for 2.0. > The following test can be added to ReplSuite to reproduce the bug: > {code} > test("replicating blocks of object with class defined in repl") { > val output = runInterpreter("local-cluster[2,1,1024]", > """ > |import org.apache.spark.storage.StorageLevel._ > |case class Foo(i: Int) > |val ret = sc.parallelize((1 to 100).map(Foo), > 10).persist(MEMORY_ONLY_2) > |ret.count() > |sc.getExecutorStorageStatus.map(s => > s.rddBlocksById(ret.id).size).sum > """.stripMargin) > assertDoesNotContain("error:", output) > assertDoesNotContain("Exception", output) > assertContains(": Int = 20", output) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17043) Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result)
Barry Becker created SPARK-17043: Summary: Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result) Key: SPARK-17043 URL: https://issues.apache.org/jira/browse/SPARK-17043 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0, 1.6.2 Reporter: Barry Becker I have a method that adds a row index column to a dataframe. It only works correctly if the dataframe has less than 200 columns. When more than 200 columns nearly all the data becomes empty (""'s for values). {code} def zipWithIndex(df: DataFrame, rowIdxColName: String): DataFrame = { val nullable = false df.sparkSession.createDataFrame( df.rdd.zipWithIndex.map{case (row, i) => Row.fromSeq(row.toSeq :+ i)}, StructType(df.schema.fields :+ StructField(rowIdxColName, LongType, nullable)) ) } {code} This might be related to https://issues.apache.org/jira/browse/SPARK-16664 but I'm not sure. I saw the 200 column threshold and it made me think it might be related. I saw this problem in spark 1.6.2 and 2.0.0. Maybe it is fixed in 2.0.1 (have not tried yet). I have no idea why the 200 column threshold is significant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17042) Repl-defined classes cannot be replicated
Eric Liang created SPARK-17042: -- Summary: Repl-defined classes cannot be replicated Key: SPARK-17042 URL: https://issues.apache.org/jira/browse/SPARK-17042 Project: Spark Issue Type: Sub-task Reporter: Eric Liang The following test can be added to ReplSuite to reproduce the bug: {code} test("replicating blocks of object with class defined in repl") { val output = runInterpreter("local-cluster[2,1,1024]", """ |import org.apache.spark.storage.StorageLevel._ |case class Foo(i: Int) |val ret = sc.parallelize((1 to 100).map(Foo), 10).persist(MEMORY_ONLY_2) |ret.count() |sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum """.stripMargin) assertDoesNotContain("error:", output) assertDoesNotContain("Exception", output) assertContains(": Int = 20", output) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17003) release-build.sh is missing hive-thriftserver for scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-17003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-17003. -- Resolution: Fixed Fix Version/s: 1.6.3 Issue resolved by pull request 14586 [https://github.com/apache/spark/pull/14586] > release-build.sh is missing hive-thriftserver for scala 2.11 > > > Key: SPARK-17003 > URL: https://issues.apache.org/jira/browse/SPARK-17003 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.2 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.6.3 > > > The same issue as SPARK-16453. But for branch 1.6, we are missing the profile > for scala 2.11 build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419158#comment-15419158 ] Barry Becker commented on SPARK-17041: -- I'm not sure either. How can we find out? I think it would be better if columns were case sensitive. Was the change intentional? This was a dataset from a client, and for whatever reason, they thought it was reasonable to have columns that varied only by case. Since it was something that worked before, I thought it might be considered a regression, but maybe it should be a feature request. In our product, someone may rename a column from "output" to "Output". If the column names are not case sensitive, I'm not sure what problems this might cause. At a minimum, the rename will probably not work. > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419151#comment-15419151 ] Herman van Hovell commented on SPARK-6235: -- [~gq] it might be a good idea to share some design before pressing ahead. This seems to be a complex issue, that probably needs some discussion on the approach, before pressing ahead with a PR. If we don't take this precaution, you might end up putting a lot of time in a very complex and very difficult to review PR. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16771) Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when table name collides.
[ https://issues.apache.org/jira/browse/SPARK-16771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-16771. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.1.0 > Infinite recursion loop in org.apache.spark.sql.catalyst.trees.TreeNode when > table name collides. > - > > Key: SPARK-16771 > URL: https://issues.apache.org/jira/browse/SPARK-16771 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0 >Reporter: Furcy Pin >Assignee: Dongjoon Hyun > Fix For: 2.1.0 > > > How to reproduce: > In spark-sql on Hive > {code} > DROP TABLE IF EXISTS t1 ; > CREATE TABLE test.t1(col1 string) ; > WITH t1 AS ( > SELECT col1 > FROM t1 > ) > SELECT col1 > FROM t1 > LIMIT 2 > ; > {code} > This make a nice StackOverflowError: > {code} > java.lang.StackOverflowError > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:230) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:348) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:170) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:147) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution$$anonfun$substituteCTE$1.applyOrElse(Analyzer.scala:133) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$2.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > ... > {code} > This does not happen if I change
[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419133#comment-15419133 ] Sean Owen commented on SPARK-17041: --- Behavior changes across major versions. I'm not sure this is a bug just because behavior is different. > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
Barry Becker created SPARK-17041: Summary: Columns in schema are no longer case sensitive when reading csv file Key: SPARK-17041 URL: https://issues.apache.org/jira/browse/SPARK-17041 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.0 Reporter: Barry Becker It used to be (in spark 1.6.2) that I could read a csv file that had columns with names that differed only by case. For example, one column may be "output" and another called "Output". Now (with spark 2.0.0) if I try to read such a file, I get an error like this: {code} org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, could be: Output#1263, Output#1295.; {code} The schema (dfSchema below) that I pass to the csv read looks like this: {code} StructType( StructField(Output,StringType,true), ... StructField(output,StringType,true), ...) {code} The code that does the read is this {code} sqlContext.read .format("csv") .option("header", "false") // Use first line of all files as header .option("inferSchema", "false") // Automatically infer data types .schema(dfSchema) .csv(dataFile) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419079#comment-15419079 ] Artur commented on SPARK-15044: --- I don't know what should we do with this issue. The root cause is invalid hive metadata and it isn't Spark's fault. I've built spark where I catch InvalidInputException in HadoopRDD and just log it and here are results: Spark don't fail now. spark-sql> select * from test; 16/08/13 00:58:08 ERROR HadoopRDD: Input path does not exist: .../test/p=1 2 2 3 3 Time taken: 4.494 seconds, Fetched 2 row(s) hive> select * from test; OK 2 2 3 3 Time taken: 0.041 seconds, Fetched: 2 row(s) However I don't know if this is correct behavior not to fail if file doesn't exists in hfds. > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419073#comment-15419073 ] Barry Becker commented on SPARK-17039: -- There are literal ?'s in the datafile. The "nullValue" option indicates that those ?'s should be read as null values. I also added the "dateFormat" option which describes how the dates in the file should be read. Let me try to provide more information so you can reproduce. Here is the schema that I am specifiying (dfSchema above): {code} StructType(StructField(string normal,StringType,true), StructField(Years,TimestampType,true), StructField(Months,TimestampType,true), StructField(WeekDays,TimestampType,true), StructField(Days,TimestampType,true), StructField(DaysWithNull,TimestampType,true), StructField(Hours,TimestampType,true), StructField(Minutes,TimestampType,true), StructField(normal dates,TimestampType,true), StructField(Wide Range Dates,TimestampType,true), StructField(Narrow,TimestampType,true), StructField(Far Future,TimestampType,true), StructField(Mostly Null,TimestampType,true), StructField(All Same Date,TimestampType,true), StructField(Past/Future,TimestampType,true), StructField(All nulls,TimestampType,true), StructField(Seconds,TimestampType,true)) {code} and here is the contents of the csv datafile (note that there are lots of nulls). This worked using databricks spark-csv lib as a dependency in spark 1.6.2 {code} foo 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 2015-03-09T00:01:00 2007-11-09T00:00:00 1967-11-09T00:00:00 2015-03-09T12:00:00 2700-01-01T00:00:00 2015-03-09T00:00:00 2015-03-09T00:00:00 1983-03-09T00:00:00 ? 2015-03-09T12:01:00 bar 2016-03-09T00:00:00 2015-04-09T00:00:00 2015-03-10T00:00:00 2015-03-10T00:00:00 ? 2015-03-09T01:00:00 2015-03-09T00:03:00 2007-10-02T00:00:00 1987-10-02T00:00:00 2015-03-09T12:03:00 3701-01-01T00:00:00 2015-04-09T00:00:00 2015-03-09T00:00:00 1865-04-09T00:00:00 ? 2015-03-09T12:01:01 baz 2017-03-09T00:00:00 2015-05-09T00:00:00 2015-03-11T00:00:00 2015-03-11T00:00:00 2015-03-11T00:00:00 2015-03-09T02:00:00 2015-03-09T00:05:00 1999-04-04T03:00:00 1999-02-03T00:00:00 2015-03-09T12:08:00 4702-01-01T00:00:00 ? 2015-03-09T00:00:00 1777-05-09T00:00:00 ? 2015-03-09T12:01:03 but 2018-03-09T00:00:00 2015-06-09T00:00:00 2015-03-12T00:00:00 2015-03-12T00:00:00 2015-03-12T00:00:00 2015-03-09T03:00:00 2015-03-09T00:08:00 2025-10-10T00:00:00 2025-10-10T00:00:00 2015-03-09T12:10:00 4103-01-01T00:00:00 2015-06-09T00:00:00 2015-03-09T00:00:00 2089-06-09T00:00:00 ? 2015-03-09T12:01:05 fooo2019-03-09T00:00:00 2015-07-09T00:00:00 2015-03-13T00:00:00 2015-03-13T00:00:00 2015-03-13T00:00:00 2015-03-09T04:00:00 2015-03-09T00:09:00 2004-02-23T00:00:00 2004-02-23T00:00:00 2015-03-09T12:15:00 4204-01-01T00:00:00 ? 2015-03-09T00:00:00 2125-07-09T00:00:00 ? 2015-03-09T12:01:07 bar 2020-03-09T00:00:00 2015-08-09T00:00:00 2015-03-16T00:00:00 2015-03-14T00:00:00 2015-03-14T00:00:00 2015-03-09T05:00:00 2015-03-09T00:12:00 2019-03-04T00:00:00 3019-03-04T00:00:00 2015-03-09T12:20:00 4305-01-01T00:00:00 2015-08-09T00:00:00 2015-03-09T00:00:00 2215-08-09T00:00:00 ? 2015-03-09T12:01:09 baz 2021-03-09T00:00:00 2015-09-09T00:00:00 2015-03-17T00:00:00 2015-03-15T00:00:00 2015-03-15T00:00:00 2015-03-09T06:00:00 2015-03-09T00:20:00 1999-04-04T02:34:00 ? 2015-03-09T12:25:00 4406-01-01T00:00:00 2015-09-09T00:00:00 2015-03-09T00:00:00 1754-09-09T00:00:00 ? 2015-03-09T12:01:11 but 2022-03-09T00:00:00 2015-10-09T00:00:00 2015-03-18T00:00:00 2015-03-16T00:00:00 ? 2015-03-09T07:00:00 2015-03-09T00:30:00 1999-03-01T00:00:00 1909-03-01T00:00:00 2015-03-09T12:30:00 4507-01-01T00:00:00 ? 2015-03-09T00:00:00 1958-10-09T00:00:00 ? 2015-03-09T12:01:00 bar 2023-03-09T00:00:00 2015-11-09T00:00:00 2015-03-19T00:00:00 2015-03-17T00:00:00 2015-03-17T00:00:00 2015-03-09T08:00:00 2015-03-09T00:35:00 2001-02-12T00:00:00 ? 2015-03-09T12:35:00 4608-01-01T00:00:00 2015-11-09T00:00:00 2015-03-09T00:00:00 3000-11-09T00:00:00 ? 2015-03-09T12:01:00 here is a really really really long string value2024-03-09T00:00:00 2015-12-09T00:00:00 2015-03-20T00:00:00 2015-03-18T00:00:00 2015-03-18T00:00:00 2015-03-09T09:00:00
[jira] [Updated] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-17039: - Description: I see this exact same bug as reported in this [stack overflow post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] using Spark 2.0.0 (released version). In scala, I read a csv using sqlContext.read .format("csv") .option("header", "false") .option("inferSchema", "false") .option("nullValue", "?") .option("dateFormat", "-MM-dd'T'HH:mm:ss") .schema(dfSchema) .csv(dataFile) The data contains some null dates (represented with ?). The error I get is: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 10, localhost): java.text.ParseException: Unparseable date: "?" {code} was: I see this exact same bug as reported in this [stack overflow post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] using Spark 2.0.0 (released version). In scala, I read a csv using sqlContext.read .format("csv") .option("header", "false") .option("inferSchema", "false") .option("nullValue", "?") .schema(dfSchema) .csv(dataFile) The data contains some null dates (represented with ?). The error I get is: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 10, localhost): java.text.ParseException: Unparseable date: "?" {code} > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419062#comment-15419062 ] Sean Owen commented on SPARK-17039: --- Oh right looked right past that. But if a date is null, converted to "?" per config, that wouldn't be a valid date right? > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419040#comment-15419040 ] Barry Becker commented on SPARK-17039: -- I do specify a schema (.schema(dfSchema)), and it says that the column is a date column. I left it out because there were lots of other columns, and I need to spend some time to simplify the example. This is from a unit test that worked fine using spark 1.6.2, but fails using spark 2.0.0. I'm pretty sure its a real bug. The example in the stack overflow post may provide a better reproducible case. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17039) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419017#comment-15419017 ] Sean Owen commented on SPARK-17039: --- Hm, how are they being parsed as dates -- or is that the issue? you don't infer or specify a schema but say the col is indeed a date column. If it's a date column, "?" is not valid, and the error is correct. > cannot read null dates from csv file > > > Key: SPARK-17039 > URL: https://issues.apache.org/jira/browse/SPARK-17039 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17040) cannot read null dates from csv file
[ https://issues.apache.org/jira/browse/SPARK-17040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17040. --- Resolution: Duplicate > cannot read null dates from csv file > > > Key: SPARK-17040 > URL: https://issues.apache.org/jira/browse/SPARK-17040 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > I see this exact same bug as reported in this [stack overflow > post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] > using Spark 2.0.0 (released version). > In scala, I read a csv using > sqlContext.read > .format("csv") > .option("header", "false") > .option("inferSchema", "false") > .option("nullValue", "?") > .schema(dfSchema) > .csv(dataFile) > The data contains some null dates (represented with ?). > The error I get is: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 > (TID 10, localhost): java.text.ParseException: Unparseable date: "?" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17040) cannot read null dates from csv file
Barry Becker created SPARK-17040: Summary: cannot read null dates from csv file Key: SPARK-17040 URL: https://issues.apache.org/jira/browse/SPARK-17040 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.0 Reporter: Barry Becker I see this exact same bug as reported in this [stack overflow post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] using Spark 2.0.0 (released version). In scala, I read a csv using sqlContext.read .format("csv") .option("header", "false") .option("inferSchema", "false") .option("nullValue", "?") .schema(dfSchema) .csv(dataFile) The data contains some null dates (represented with ?). The error I get is: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 10, localhost): java.text.ParseException: Unparseable date: "?" {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17039) cannot read null dates from csv file
Barry Becker created SPARK-17039: Summary: cannot read null dates from csv file Key: SPARK-17039 URL: https://issues.apache.org/jira/browse/SPARK-17039 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.0 Reporter: Barry Becker I see this exact same bug as reported in this [stack overflow post|http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column] using Spark 2.0.0 (released version). In scala, I read a csv using sqlContext.read .format("csv") .option("header", "false") .option("inferSchema", "false") .option("nullValue", "?") .schema(dfSchema) .csv(dataFile) The data contains some null dates (represented with ?). The error I get is: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 10, localhost): java.text.ParseException: Unparseable date: "?" {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17038) StreamingSource reports metrics for lastCompletedBatch instead of lastReceivedBatch
Oz Ben-Ami created SPARK-17038: -- Summary: StreamingSource reports metrics for lastCompletedBatch instead of lastReceivedBatch Key: SPARK-17038 URL: https://issues.apache.org/jira/browse/SPARK-17038 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 2.0.0, 1.6.2 Reporter: Oz Ben-Ami Priority: Minor StreamingSource's lastReceivedBatch_submissionTime, lastReceivedBatch_processingTimeStart, and lastReceivedBatch_processingTimeEnd all use data from lastCompletedBatch instead of lastReceivedBatch. In particular, this makes it impossible to match lastReceivedBatch_records with a batchID/submission time. This is apparent when looking at StreamingSource.scala, lines 89-94. I would guess that just replacing Completed->Received in those lines would fix the issue, but I haven't tested it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418994#comment-15418994 ] saurabh paliwal commented on SPARK-15044: - Hi! so sorry I mixed it up. So the use case should be more like, I did create some partition p1 in hive and after a lot of time, I delete data from hdfs, and the query range contains p1. Do you suggest clearing the hive metastore too? > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16955: - Assignee: Peter Lee > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Peter Lee > Fix For: 2.0.1, 2.1.0 > > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) > at >
[jira] [Issue Comment Deleted] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] saurabh paliwal updated SPARK-15044: Comment: was deleted (was: Hi! I am so sorry, I mixed it up. So the premise is towards querying for a data which is not there in hdfs and has no partition created either in metastore. I think throwing exception when your range of partitions in query consists of such non-existent "partitions" is not acceptable in some cases. For ex, I queried a table for p=1 to p=100 but the table was created and started making partitions from p=10. now the query should give me the results of 10-100, no?) > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418935#comment-15418935 ] saurabh paliwal edited comment on SPARK-15044 at 8/12/16 2:45 PM: -- Hi! I am so sorry, I mixed it up. So the premise is towards querying for a data which is not there in hdfs and has no partition created either in metastore. I think throwing exception when your range of partitions in query consists of such non-existent "partitions" is not acceptable in some cases. For ex, I queried a table for p=1 to p=100 but the table was created and started making partitions from p=10. now the query should give me the results of 10-100, no? was (Author: saurai3h): Hi! I am so sorry, I mixed it up. So the premise is towards querying for a data which is not there in hdfs and has no partition created either in metastore. I think throwing exception when your range of partitions in query consists of such non-existent "partitions" is not acceptable in some cases. For ex, I queried a table for p=1 to p=100 but the table was created and started making partitions from p=10. now the query for p = 5 to p = 50 should give me the results of 10-50, no? > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418935#comment-15418935 ] saurabh paliwal commented on SPARK-15044: - Hi! I am so sorry, I mixed it up. So the premise is towards querying for a data which is not there in hdfs and has no partition created either in metastore. I think throwing exception when your range of partitions in query consists of such non-existent "partitions" is not acceptable in some cases. For ex, I queried a table for p=1 to p=100 but the table was created and started making partitions from p=10. now the query for p = 5 to p = 50 should give me the results of 10-50, no? > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418876#comment-15418876 ] Sean Owen commented on SPARK-15044: --- The partition would still exist in this case, no? > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418872#comment-15418872 ] saurabh paliwal commented on SPARK-15044: - Hi! I agree with Artur. So let's assume there is a sparse table with hourly partitions, if in some hour x, there was indeed no data and the partition is there (for example, pre creating all partitions rather than using add partition or msck repair table) and you query for a range of partitions which includes x, you would be better off with no results rather than a query failure. > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17037) distinct() operator fails on Dataframe with column names containing periods
[ https://issues.apache.org/jira/browse/SPARK-17037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17037. --- Resolution: Duplicate > distinct() operator fails on Dataframe with column names containing periods > --- > > Key: SPARK-17037 > URL: https://issues.apache.org/jira/browse/SPARK-17037 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Michael Styles > > Using the distinct() operator on a Dataframe with column names containing > periods results in an AnalysisException. For example: > {noformat} > d = [{'pageview.count': 100, 'exit_page': 'example.com/landing'} > df = sqlContext.createDataFrame(d)] > df.distinct() > {noformat} > results in the following error: > pyspark.sql.utils.AnalysisException: u'Cannot resolve column name > "pageview.count" among (exit_page, pageview.count);' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418862#comment-15418862 ] Sean Owen commented on SPARK-15044: --- That's the problem. If the semantics were, "query anything that happens to exist on HDFS" then this would be the right behavior. But the operation that created this (immutable) result set said there were, say, 10 partition, and so does the metastore. Is it correct to return any result that can't query all 10 partitions as "correct"? Granted, you show Hive happily does so. I think that's probably bad though. Or: what's the use case for this behavior? > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418841#comment-15418841 ] Artur commented on SPARK-15044: --- Why does result miss something? If data doesn't exist in hdfs - it should not show up in the result. (partition still exists in metadata) > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17036) Hadoop config caching could lead to memory pressure and high CPU usage in thrift server
[ https://issues.apache.org/jira/browse/SPARK-17036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418828#comment-15418828 ] Sean Owen commented on SPARK-17036: --- [~rajesh.balamohan] please summarize the issue here. > Hadoop config caching could lead to memory pressure and high CPU usage in > thrift server > --- > > Key: SPARK-17036 > URL: https://issues.apache.org/jira/browse/SPARK-17036 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Rajesh Balamohan >Priority: Minor > > Creating this as a follow up jira to SPARK-12920. Profiler output on the > caching is attached in SPARK-12920. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17037) distinct() operator fails on Dataframe with column names containing periods
Michael Styles created SPARK-17037: -- Summary: distinct() operator fails on Dataframe with column names containing periods Key: SPARK-17037 URL: https://issues.apache.org/jira/browse/SPARK-17037 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Michael Styles Using the distinct() operator on a Dataframe with column names containing periods results in an AnalysisException. For example: {noformat} d = [{'pageview.count': 100, 'exit_page': 'example.com/landing'} df = sqlContext.createDataFrame(d)] df.distinct() {noformat} results in the following error: pyspark.sql.utils.AnalysisException: u'Cannot resolve column name "pageview.count" among (exit_page, pageview.count);' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12920) Honor "spark.ui.retainedStages" to reduce mem-pressure
[ https://issues.apache.org/jira/browse/SPARK-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418827#comment-15418827 ] Rajesh Balamohan commented on SPARK-12920: -- Thanks [~vanzin] . I have created SPARK-17036 for caching issue. > Honor "spark.ui.retainedStages" to reduce mem-pressure > -- > > Key: SPARK-12920 > URL: https://issues.apache.org/jira/browse/SPARK-12920 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 2.1.0 > > Attachments: SPARK-12920.profiler.png, > SPARK-12920.profiler_job_progress_listner.png > > > - Configured with fair-share-scheduler. > - 4-5 users submitting/running jobs concurrently via spark-thrift-server > - Spark thrift server spikes to1600+% CPU and stays there for long time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17036) Hadoop config caching could lead to memory pressure and high CPU usage in thrift server
Rajesh Balamohan created SPARK-17036: Summary: Hadoop config caching could lead to memory pressure and high CPU usage in thrift server Key: SPARK-17036 URL: https://issues.apache.org/jira/browse/SPARK-17036 Project: Spark Issue Type: Improvement Components: SQL Reporter: Rajesh Balamohan Priority: Minor Creating this as a follow up jira to SPARK-12920. Profiler output on the caching is attached in SPARK-12920. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16955. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai > Fix For: 2.0.1, 2.1.0 > > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) > at >
[jira] [Commented] (SPARK-16955) Using ordinals in ORDER BY causes an analysis error when the query has a GROUP BY clause using ordinals
[ https://issues.apache.org/jira/browse/SPARK-16955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418817#comment-15418817 ] Wenchen Fan commented on SPARK-16955: - this bug is already fixed by https://github.com/apache/spark/pull/14595 by accident. > Using ordinals in ORDER BY causes an analysis error when the query has a > GROUP BY clause using ordinals > --- > > Key: SPARK-16955 > URL: https://issues.apache.org/jira/browse/SPARK-16955 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai > > The following queries work > {code} > select a from (select 1 as a) tmp order by 1 > select a, count(*) from (select 1 as a) tmp group by 1 > select a, count(*) from (select 1 as a) tmp group by 1 order by a > {code} > However, the following query does not > {code} > select a, count(*) from (select 1 as a) tmp group by 1 order by 1 > {code} > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > Group by position: '1' exceeds the size of the select list '0'. on unresolved > object, tree: > Aggregate [1] > +- SubqueryAlias tmp >+- Project [1 AS a#82] > +- OneRowRelation$ > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:749) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$34.apply(Analyzer.scala:739) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:739) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11.applyOrElse(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:715) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$.apply(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1237) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$$anonfun$apply$20.applyOrElse(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1182) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions$.apply(Analyzer.scala:1181) > at >
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418788#comment-15418788 ] Sean Owen commented on SPARK-15044: --- That seems like worse behavior, because it silently 'succeeds' when the result is missing part of the result set. > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418784#comment-15418784 ] Artur commented on SPARK-15044: --- I know that you should not do this. But if someone does - spark will not work until the metadata is fixed. However hive does it without error: hive> select * from test; OK Time taken: 0.035 seconds In this case after: hive> alter table test drop partition (p=1); spark-sql> select * from test; works without problems. > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16917) Spark streaming kafka version compatibility.
[ https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418747#comment-15418747 ] Cody Koeninger commented on SPARK-16917: It sounds to me like the documentation is clear, because you have interpreted things correctly. As it says, the 0.10 version works with broker version 0.10 and higher. The 0.8 version works with broker version 0.8 and higher. There is no version specifically for 0.9, nor do I expect there ever will be. > Spark streaming kafka version compatibility. > - > > Key: SPARK-16917 > URL: https://issues.apache.org/jira/browse/SPARK-16917 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Sudev >Priority: Trivial > Labels: documentation > > It would be nice to have Kafka version compatibility information in the > official documentation. > It's very confusing now. > * If you look at this JIRA[1], it seems like Kafka is supported in Spark > 2.0.0. > * The documentation lists artifact for (Kafka 0.8) > spark-streaming-kafka-0-8_2.11 > Is Kafka 0.9 supported by Spark 2.0.0 ? > Since I'm confused here even after an hours effort googling on the same, I > think someone should help add the compatibility matrix. > [1] https://issues.apache.org/jira/browse/SPARK-12177 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418746#comment-15418746 ] Sean Owen commented on SPARK-15044: --- Is it really an error? the files were manually deleted and unsurprisingly this causes an exception. You shouldn't delete the files. > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418735#comment-15418735 ] Artur commented on SPARK-15044: --- Tested on spark 2.0.X (master branch) (latest commit: 685b08e2611b69f8db60a00c0c94aecd315e2a3e Wed Aug 3 13:15:13 2016) : hive>create table test(n string) partitioned by (p string); spark-sql> insert into table test PARTITION(p=1) VALUES (1); hadoop fs -rmr /user/hive/warehouse/test/p=1 spark-sql> select * from test; {code:java} 16/08/12 20:48:45 ERROR SparkSQLDriver: Failed in [select * from test] org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: ./test/p=1 at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:289) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:317) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:82) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:82) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:82) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1908) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:899) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.collect(RDD.scala:898) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:310) at org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$3.apply(QueryExecution.scala:131) at org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$3.apply(QueryExecution.scala:130) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:130) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331) at
[jira] [Updated] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually
[ https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artur updated SPARK-15044: -- Affects Version/s: 2.0.0 > spark-sql will throw "input path does not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually > - > > Key: SPARK-15044 > URL: https://issues.apache.org/jira/browse/SPARK-15044 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: huangyu > > spark-sql will throw "input path not exist" exception if it handles a > partition which exists in hive table, but the path is removed manually.The > situation is as follows: > 1) Create a table "test". "create table test (n string) partitioned by (p > string)" > 2) Load some data into partition(p='1') > 3)Remove the path related to partition(p='1') of table test manually. "hadoop > fs -rmr /warehouse//test/p=1" > 4)Run spark sql, spark-sql -e "select n from test where p='1';" > Then it throws exception: > {code} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > ./test/p=1 > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > {code} > The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK > I think spark-sql should ignore the path, just like hive or it dose in early > versions, rather than throw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
[ https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-17035: --- Description: Conversion of datetime.max to microseconds produces incorrect value. For example, {noformat} from datetime import datetime from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, TimestampType schema = StructType([StructField("dt", TimestampType(), False)]) data = [{"dt": datetime.max}] # convert python objects to sql data sql_data = [schema.toInternal(row) for row in data] # Value is wrong. sql_data [(2.534023188e+17,)] {noformat} This value should be [(2534023187,)]. was: Conversion of datetime.max to microseconds produces incorrect value. For example, from datetime import datetime from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, TimestampType schema = StructType([StructField("dt", TimestampType(), False)]) data = [{"dt": datetime.max}] # convert python objects to sql data sql_data = [schema.toInternal(row) for row in data] # Value is wrong. sql_data [(2.534023188e+17,)] This value should be [(2534023187,)]. > Conversion of datetime.max to microseconds produces incorrect value > --- > > Key: SPARK-17035 > URL: https://issues.apache.org/jira/browse/SPARK-17035 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Michael Styles >Priority: Minor > > Conversion of datetime.max to microseconds produces incorrect value. For > example, > {noformat} > from datetime import datetime > from pyspark.sql import Row > from pyspark.sql.types import StructType, StructField, TimestampType > schema = StructType([StructField("dt", TimestampType(), False)]) > data = [{"dt": datetime.max}] > # convert python objects to sql data > sql_data = [schema.toInternal(row) for row in data] > # Value is wrong. > sql_data > [(2.534023188e+17,)] > {noformat} > This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
[ https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-17035: --- Priority: Minor (was: Major) > Conversion of datetime.max to microseconds produces incorrect value > --- > > Key: SPARK-17035 > URL: https://issues.apache.org/jira/browse/SPARK-17035 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Michael Styles >Priority: Minor > > Conversion of datetime.max to microseconds produces incorrect value. For > example, > from datetime import datetime > from pyspark.sql import Row > from pyspark.sql.types import StructType, StructField, TimestampType > schema = StructType([StructField("dt", TimestampType(), False)]) > data = [{"dt": datetime.max}] > # convert python objects to sql data > sql_data = [schema.toInternal(row) for row in data] > # Value is wrong. > sql_data > [(2.534023188e+17,)] > This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
Michael Styles created SPARK-17035: -- Summary: Conversion of datetime.max to microseconds produces incorrect value Key: SPARK-17035 URL: https://issues.apache.org/jira/browse/SPARK-17035 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Michael Styles Conversion of datetime.max to microseconds produces incorrect value. For example, from datetime import datetime from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, TimestampType schema = StructType([StructField("dt", TimestampType(), False)]) data = [{"dt": datetime.max}] # convert python objects to sql data sql_data = [schema.toInternal(row) for row in data] # Value is wrong. sql_data [(2.534023188e+17,)] This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17032) Add test cases for methods in ParserUtils
[ https://issues.apache.org/jira/browse/SPARK-17032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17032: Assignee: Apache Spark > Add test cases for methods in ParserUtils > - > > Key: SPARK-17032 > URL: https://issues.apache.org/jira/browse/SPARK-17032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Minor > > Currently methods in `ParserUtils` are tested indirectly, we should add test > cases in `ParserUtilsSuite` to verify their integrity directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17032) Add test cases for methods in ParserUtils
[ https://issues.apache.org/jira/browse/SPARK-17032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418677#comment-15418677 ] Apache Spark commented on SPARK-17032: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/14620 > Add test cases for methods in ParserUtils > - > > Key: SPARK-17032 > URL: https://issues.apache.org/jira/browse/SPARK-17032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo >Priority: Minor > > Currently methods in `ParserUtils` are tested indirectly, we should add test > cases in `ParserUtilsSuite` to verify their integrity directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17032) Add test cases for methods in ParserUtils
[ https://issues.apache.org/jira/browse/SPARK-17032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17032: Assignee: (was: Apache Spark) > Add test cases for methods in ParserUtils > - > > Key: SPARK-17032 > URL: https://issues.apache.org/jira/browse/SPARK-17032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo >Priority: Minor > > Currently methods in `ParserUtils` are tested indirectly, we should add test > cases in `ParserUtilsSuite` to verify their integrity directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression
[ https://issues.apache.org/jira/browse/SPARK-17034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17034: Assignee: (was: Apache Spark) > Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression > - > > Key: SPARK-17034 > URL: https://issues.apache.org/jira/browse/SPARK-17034 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong > > Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" > should be considered as unresolved before analysis. But in current code, it > uses "Literal" expression to store the ordinal. This is inappropriate as > "Literal" itself is a resolved expression, it gives the user a wrong message > that the ordinals has already been resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression
[ https://issues.apache.org/jira/browse/SPARK-17034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418623#comment-15418623 ] Apache Spark commented on SPARK-17034: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/14616 > Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression > - > > Key: SPARK-17034 > URL: https://issues.apache.org/jira/browse/SPARK-17034 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong > > Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" > should be considered as unresolved before analysis. But in current code, it > uses "Literal" expression to store the ordinal. This is inappropriate as > "Literal" itself is a resolved expression, it gives the user a wrong message > that the ordinals has already been resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression
[ https://issues.apache.org/jira/browse/SPARK-17034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17034: Assignee: Apache Spark > Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression > - > > Key: SPARK-17034 > URL: https://issues.apache.org/jira/browse/SPARK-17034 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Assignee: Apache Spark > > Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" > should be considered as unresolved before analysis. But in current code, it > uses "Literal" expression to store the ordinal. This is inappropriate as > "Literal" itself is a resolved expression, it gives the user a wrong message > that the ordinals has already been resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17034) Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression
Sean Zhong created SPARK-17034: -- Summary: Ordinal in ORDER BY or GROUP BY should be treated as an unresolved expression Key: SPARK-17034 URL: https://issues.apache.org/jira/browse/SPARK-17034 Project: Spark Issue Type: Bug Reporter: Sean Zhong Ordinals in GROUP BY or ORDER BY like "1" in "order by 1" or "group by 1" should be considered as unresolved before analysis. But in current code, it uses "Literal" expression to store the ordinal. This is inappropriate as "Literal" itself is a resolved expression, it gives the user a wrong message that the ordinals has already been resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in spark.ml package
[ https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418585#comment-15418585 ] Jeff Zhang commented on SPARK-15882: I think it is better to keep RDD api underneath as I don't see much benefit from Dataset here. Although linear algebra api is public, I think most of time it is used by spark ml internally. Regarding the interface between linear algebra and spark.ml Transformers/Estimater, I would suggest to use Dataset (e.g. PCA.fit) because I think the user facing interface is mostly on Transformer/Estimator, we should keep them stick on Dataset. > Discuss distributed linear algebra in spark.ml package > -- > > Key: SPARK-15882 > URL: https://issues.apache.org/jira/browse/SPARK-15882 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* > should be migrated to org.apache.spark.ml. > Initial questions: > * Should we use Datasets or RDDs underneath? > * If Datasets, are there missing features needed for the migration? > * Do we want to redesign any aspects of the distributed matrices during this > move? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418526#comment-15418526 ] 胡振宇 commented on SPARK-14850: - I try to run your code on spark1.6.1 but i found that "toDF" cannot be used in this example Here are my code object Example{ def main (args:Array[String]){ case class Test(num:Int,vector:Vector) val conf = new SparkConf.setAppname("Example") val sqlContext=new SQLContext(sc) import sqlContext.implicts._ val temp=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map(i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0.toDF() //at this step toDF can be used I do } } sc.parallelize(0 until 1e4.toInt, 1).map { i => (i, Vectors.dense(Array.fill(1e6.toInt)(1.0))) }.toDF.rdd.count() I even use sparkcontext but toDF cannot be used too Do you have a solution to run the example on spark1.6.1? Thank you } > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8717) Update mllib-data-types docs to include missing "matrix" Python examples
[ https://issues.apache.org/jira/browse/SPARK-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8717. -- Resolution: Duplicate This was actually a duplicate of an issue already fixed. Look at the docs in master. > Update mllib-data-types docs to include missing "matrix" Python examples > > > Key: SPARK-8717 > URL: https://issues.apache.org/jira/browse/SPARK-8717 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Reporter: Rosstin Murphy >Priority: Minor > > Currently, the documentation for MLLib Data Types (docs/mllib-data-types.md > in the repo, https://spark.apache.org/docs/latest/mllib-data-types.html in > the latest online docs) stops listing Python examples at "Labeled point". > "Local vector" and "Labeled point" have Python examples, however none of the > "matrix" entries have Python examples. > The "matrix" entries could be updated to include python examples. > I'm not 100% sure that all the matrices currently have implemented Python > equivalents, but I'm pretty sure that at least the first one ("Local matrix") > could have an entry. > from pyspark.mllib.linalg import DenseMatrix > dm = DenseMatrix(3, 2, [1.0, 3.0, 5.0, 2.0, 4.0, 6.0]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16598) Added a test case for verifying the table identifier parsing
[ https://issues.apache.org/jira/browse/SPARK-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16598. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14244 [https://github.com/apache/spark/pull/14244] > Added a test case for verifying the table identifier parsing > > > Key: SPARK-16598 > URL: https://issues.apache.org/jira/browse/SPARK-16598 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > Fix For: 2.1.0 > > > So far, the test cases of TableIdentifierParserSuite do not cover the quoted > cases. We should add one for avoiding regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16985) SQL Output maybe overrided
[ https://issues.apache.org/jira/browse/SPARK-16985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16985. --- Resolution: Fixed Assignee: Hong Shen Fix Version/s: 2.1.0 Resolved by https://github.com/apache/spark/pull/14574 > SQL Output maybe overrided > -- > > Key: SPARK-16985 > URL: https://issues.apache.org/jira/browse/SPARK-16985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hong Shen >Assignee: Hong Shen > Fix For: 2.1.0 > > > In our cluster, sometimes the sql output maybe overrided. When I submit some > sql, all insert into the same table, and the sql will cost less one minute, > here is the detail, > 1 sql1, 11:03 insert into table. > 2 sql2, 11:04:11 insert into table. > 3 sql3, 11:04:48 insert into table. > 4 sql4, 11:05 insert into table. > 5 sql5, 11:06 insert into table. > The sql3's output file will override the sql2's output file. here is the log: > {code} > 16/05/04 11:04:11 INFO hive.SparkHiveHadoopWriter: > XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_1204544348/1/_tmp.p_20160428/attempt_201605041104_0001_m_00_1 > 16/05/04 11:04:48 INFO hive.SparkHiveHadoopWriter: > XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_212180468/1/_tmp.p_20160428/attempt_201605041104_0001_m_00_1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
[ https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-16975. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14585 [https://github.com/apache/spark/pull/14585] > Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2 > -- > > Key: SPARK-16975 > URL: https://issues.apache.org/jira/browse/SPARK-16975 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Ubuntu Linux 14.04 >Reporter: immerrr again >Assignee: Dongjoon Hyun > Labels: parquet > Fix For: 2.0.1, 2.1.0 > > > Spark-2.0.0 seems to have some problems reading a parquet dataset generated > by 1.6.2. > {code} > In [80]: spark.read.parquet('/path/to/data') > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data. It must be specified manually;' > {code} > The dataset is ~150G and partitioned by _locality_code column. None of the > partitions are empty. I have narrowed the failing dataset to the first 32 > partitions of the data: > {code} > In [82]: spark.read.parquet(*subdirs[:32]) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI. It must be > specified manually;' > {code} > Interestingly, it works OK if you remove any of the partitions from the list: > {code} > In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + > subdirs[i+1:32])) > {code} > Another strange thing is that the schemas for the first and the last 31 > partitions of the subset are identical: > {code} > In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == > spark.read.parquet(*subdirs[1:32]).schema.fields > Out[84]: True > {code} > Which got me interested and I tried this: > {code} > In [87]: spark.read.parquet(*([subdirs[0]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ. It must be > specified manually;' > In [88]: spark.read.parquet(*([subdirs[15]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX. It must be > specified manually;' > In [89]: spark.read.parquet(*([subdirs[31]] * 32)) > ... > AnalysisException: u'Unable to infer schema for ParquetFormat at > /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE. It must be > specified manually;' > {code} > If I read the first partition, save it in 2.0 and try to read in the same > manner, everything is fine: > {code} > In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test') > 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to > context is not a instance of TaskInputOutputContext, but is > org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl > In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32)) > {code} > I have originally posted it to user mailing list, but with the last > discoveries this clearly seems like a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16598) Added a test case for verifying the table identifier parsing
[ https://issues.apache.org/jira/browse/SPARK-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16598: -- Assignee: Xiao Li Priority: Minor (was: Major) > Added a test case for verifying the table identifier parsing > > > Key: SPARK-16598 > URL: https://issues.apache.org/jira/browse/SPARK-16598 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Minor > Fix For: 2.1.0 > > > So far, the test cases of TableIdentifierParserSuite do not cover the quoted > cases. We should add one for avoiding regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418547#comment-15418547 ] 胡振宇 edited comment on SPARK-14850 at 8/12/16 9:02 AM: -- /*code is for spark 1.6.1*/ object Example{ def main (args:Array[String]){ val conf = new SparkConf.setAppname("Example") val sc=new sparkContext(conf) val sqlContext=new SQLContext(sc) import sqlContext.implicts._ val count=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map{ i=>(i,Vectors.dense(Array.fill(1e6.toInt)(1.0))) }.toDF().rdd.count() //at this step toDF can be used on Spark1.6.1 } } so I am not able to test the simple serialization example was (Author: fox19960207): /*code is for spark 1.6.1*/ object Example{ def main (args:Array[String]){ val conf = new SparkConf.setAppname("Example") val sc=new sparkContext(conf) val sqlContext=new SQLContext(sc) import sqlContext.implicts._ val count=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map{ i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0))) }.toDF().rdd.count() //at this step toDF can be used on Spark1.6.1 } } so I am not able to test the simple serialization example > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8717) Update mllib-data-types docs to include missing "matrix" Python examples
[ https://issues.apache.org/jira/browse/SPARK-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418553#comment-15418553 ] Jagadeesan A S commented on SPARK-8717: --- I would like to raise PR for this issue. [~srowen] can u add you input..? > Update mllib-data-types docs to include missing "matrix" Python examples > > > Key: SPARK-8717 > URL: https://issues.apache.org/jira/browse/SPARK-8717 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Reporter: Rosstin Murphy >Priority: Minor > > Currently, the documentation for MLLib Data Types (docs/mllib-data-types.md > in the repo, https://spark.apache.org/docs/latest/mllib-data-types.html in > the latest online docs) stops listing Python examples at "Labeled point". > "Local vector" and "Labeled point" have Python examples, however none of the > "matrix" entries have Python examples. > The "matrix" entries could be updated to include python examples. > I'm not 100% sure that all the matrices currently have implemented Python > equivalents, but I'm pretty sure that at least the first one ("Local matrix") > could have an entry. > from pyspark.mllib.linalg import DenseMatrix > dm = DenseMatrix(3, 2, [1.0, 3.0, 5.0, 2.0, 4.0, 6.0]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418547#comment-15418547 ] 胡振宇 commented on SPARK-14850: - /*code is for spark 1.6.1*/ object Example{ def main (args:Array[String]){ val conf = new SparkConf.setAppname("Example") val sc=new sparkContext(conf) val sqlContext=new SQLContext(sc) import sqlContext.implicts._ val count=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map{ i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0))) }.toDF().rdd.count() //at this step toDF can be used on Spark1.6.1 } } so I am not able to test the simple serialization example > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418537#comment-15418537 ] Apache Spark commented on SPARK-17033: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/14621 > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Priority: Minor > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there is 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17027) PolynomialExpansion.choose is prone to integer overflow
[ https://issues.apache.org/jira/browse/SPARK-17027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418071#comment-15418071 ] Maciej Szymkiewicz commented on SPARK-17027: Yes, this exactly the problem. {code} choose(14, 10) // res0: Int = -182 {code} > PolynomialExpansion.choose is prone to integer overflow > > > Key: SPARK-17027 > URL: https://issues.apache.org/jira/browse/SPARK-17027 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Current implementation computes power of k directly and because of that it is > susceptible to integer overflow on relatively small input (4 features, degree > equal 10). It would be better to use recursive formula instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17033: Assignee: (was: Apache Spark) > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Priority: Minor > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there is 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17033: Description: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 20% increased performance. (was: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 15% increased performance.) > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Priority: Minor > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there are 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17033: Description: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 15% increased performance. (was: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 20% increased performance.) > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Priority: Minor > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there are 15% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17033: Description: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance. (was: {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 20% increased performance.) > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Priority: Minor > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there is 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
Yanbo Liang created SPARK-17033: --- Summary: GaussianMixture should use treeAggregate to improve performance Key: SPARK-17033 URL: https://issues.apache.org/jira/browse/SPARK-17033 Project: Spark Issue Type: Improvement Reporter: Yanbo Liang Priority: Minor {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there are 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418534#comment-15418534 ] Wenchen Fan commented on SPARK-14850: - format your code please, it's unreadable > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17033) GaussianMixture should use treeAggregate to improve performance
[ https://issues.apache.org/jira/browse/SPARK-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17033: Component/s: MLlib ML > GaussianMixture should use treeAggregate to improve performance > --- > > Key: SPARK-17033 > URL: https://issues.apache.org/jira/browse/SPARK-17033 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Priority: Minor > > {{GaussianMixture}} should use {{treeAggregate}} rather than {{aggregate}} to > improve performance and scalability. In my test of dataset with 200 features > and 1M instance, I found there are 20% increased performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org