[jira] [Resolved] (SPARK-23708) Comment of ShutdownHookManager.addShutdownHook is error
[ https://issues.apache.org/jira/browse/SPARK-23708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-23708. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20845 [https://github.com/apache/spark/pull/20845] > Comment of ShutdownHookManager.addShutdownHook is error > --- > > Key: SPARK-23708 > URL: https://issues.apache.org/jira/browse/SPARK-23708 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: zhoukang >Assignee: zhoukang >Priority: Minor > Fix For: 2.4.0 > > > Comment below is not right! > {code:java} > /** >* Adds a shutdown hook with the given priority. Hooks with lower priority > values run >* first. >* >* @param hook The code to run during shutdown. >* @return A handle that can be used to unregister the shutdown hook. >*/ > def addShutdownHook(priority: Int)(hook: () => Unit): AnyRef = { > shutdownHooks.add(priority, hook) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23708) Comment of ShutdownHookManager.addShutdownHook is error
[ https://issues.apache.org/jira/browse/SPARK-23708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-23708: --- Assignee: zhoukang > Comment of ShutdownHookManager.addShutdownHook is error > --- > > Key: SPARK-23708 > URL: https://issues.apache.org/jira/browse/SPARK-23708 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: zhoukang >Assignee: zhoukang >Priority: Minor > Fix For: 2.4.0 > > > Comment below is not right! > {code:java} > /** >* Adds a shutdown hook with the given priority. Hooks with lower priority > values run >* first. >* >* @param hook The code to run during shutdown. >* @return A handle that can be used to unregister the shutdown hook. >*/ > def addShutdownHook(priority: Int)(hook: () => Unit): AnyRef = { > shutdownHooks.add(priority, hook) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23734) InvalidSchemaException While Saving ALSModel
Stanley Poon created SPARK-23734: Summary: InvalidSchemaException While Saving ALSModel Key: SPARK-23734 URL: https://issues.apache.org/jira/browse/SPARK-23734 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Environment: macOS 10.13.2 Scala 2.11.8 Spark 2.3.0 Reporter: Stanley Poon After fitting an ALSModel, get following error while saving the model: Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can not be empty. Parquet does not support empty group without leaves. Empty group: spark_schema Exactly the same code ran ok on 2.2.1. Same issue also occurs on other ALSModels we have. h2. *To reproduce* Get ALSExample: [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala] and add the following line to save the model right before "spark.stop". {quote} model.write.overwrite().save("SparkExampleALSModel") {quote} h2. Stack Trace Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) at org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103) at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83) at com.vitalmove.model.ALSExample.main(ALSExample.scala) Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can not be empty. Parquet does not support empty group without leaves. Empty group: spark_schema at org.apache.parquet.schema.GroupType.(GroupType.java:92) at org.apache.parquet.schema.GroupType.(GroupType.java:48) at org.apache.parquet.schema.MessageType.(MessageType.java:50) at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23732) Broken link to scala source code in Spark Scala api Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-23732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404290#comment-16404290 ] Yogesh Tewari commented on SPARK-23732: --- Adding sourcepath to the code seems to fix the issue. {code:java} scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ) ++ ({code} > Broken link to scala source code in Spark Scala api Scaladoc > > > Key: SPARK-23732 > URL: https://issues.apache.org/jira/browse/SPARK-23732 > Project: Spark > Issue Type: Bug > Components: Build, Documentation, Project Infra >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ~/spark/docs$ cat /etc/*release* > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=16.04 > DISTRIB_CODENAME=xenial > DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" > NAME="Ubuntu" > VERSION="16.04.4 LTS (Xenial Xerus)" > ID=ubuntu > ID_LIKE=debian > PRETTY_NAME="Ubuntu 16.04.4 LTS" > VERSION_ID="16.04" > HOME_URL="http://www.ubuntu.com/; > SUPPORT_URL="http://help.ubuntu.com/; > BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/; > VERSION_CODENAME=xenial > UBUNTU_CODENAME=xenial > {code} > Using spark packaged sbt. > Other versions: > {code:java} > ~/spark/docs$ ruby -v > ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] > ~/spark/docs$ gem -v > 2.5.2.1 > ~/spark/docs$ jekyll -v > jekyll 3.7.3 > ~/spark/docs$ java -version > java version "1.8.0_112" Java(TM) SE Runtime Environment (build > 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed > mode) > {code} >Reporter: Yogesh Tewari >Priority: Trivial > Labels: build, documentation, scaladocs > > Scala source code link in Spark api scaladoc is broken. > Turns out instead of the relative path to the scala files the > "€\{FILE_PATH}.scala" expression in > [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is > generating the absolute path from the developers computer. In this case, if I > try to access the source link on > [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], > it tries to take me to > [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala] > where "/Users/sameera/dev/spark" portion of the URL is coming from the > developers macos home folder. > There seems to be no change in the code responsible for generating this path > during the build in /project/SparkBuild.scala : > Line # 252: > {code:java} > scalacOptions in Compile ++= Seq( > s"-target:jvm-${scalacJVMVersion.value}", > "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required > for relative source links in scaladoc > ), > {code} > Line # 726 > {code:java} > // Use GitHub repository for Scaladoc source links > unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, > scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( > "-groups", // Group similar methods together based on the @group annotation. > "-skip-packages", "org.apache.hadoop" > ) ++ ( > // Add links to sources when generating Scaladoc for a non-snapshot release > if (!isSnapshot.value) { > Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") > } else { > Seq() > } > ){code} > > It seems more like a developers dev environment issue. > I was successfully able to reproduce this in my dev environment. Environment > details attached. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23665) Add adaptive algorithm to select query result collect method
[ https://issues.apache.org/jira/browse/SPARK-23665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang resolved SPARK-23665. -- Resolution: Won't Fix > Add adaptive algorithm to select query result collect method > > > Key: SPARK-23665 > URL: https://issues.apache.org/jira/browse/SPARK-23665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1, 2.3.0 >Reporter: zhoukang >Priority: Major > > Currently, we use configuration like > {code:java} > spark.sql.thriftServer.incrementalCollect > {code} > to specify query result collect method. > Actually,we can estimate the size of the result and select collect method > automatically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23732) Broken link to scala source code in Spark Scala api Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-23732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yogesh Tewari updated SPARK-23732: -- Description: Scala source code link in Spark api scaladoc is broken. Turns out instead of the relative path to the scala files the "€\{FILE_PATH}.scala" expression in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is generating the absolute path from the developers computer. In this case, if I try to access the source link on [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], it tries to take me to [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. Environment details attached. was: Scala source code link in Spark api scaladoc is broken. Turns out instead of the relative path to the scala files the "€\{FILE_PATH}.scala" expression in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is generating the absolute path from the developers computer. In this case, if I try to access the source link on [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], it tries to take me to [https://github.com/apache/spark/tree/v2.3.0{color:#ff}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala|https://github.com/apache/spark/tree/v2.3.0%3Cfont%20color=] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. Environment details attached. > Broken link to scala source code in Spark Scala api Scaladoc > > > Key: SPARK-23732 > URL: https://issues.apache.org/jira/browse/SPARK-23732 > Project: Spark > Issue Type: Bug > Components: Build, Documentation, Project Infra >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ~/spark/docs$ cat /etc/*release* > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=16.04 > DISTRIB_CODENAME=xenial > DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" > NAME="Ubuntu" > VERSION="16.04.4 LTS (Xenial Xerus)" > ID=ubuntu > ID_LIKE=debian > PRETTY_NAME="Ubuntu 16.04.4 LTS" > VERSION_ID="16.04" > HOME_URL="http://www.ubuntu.com/; > SUPPORT_URL="http://help.ubuntu.com/; > BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/; > VERSION_CODENAME=xenial > UBUNTU_CODENAME=xenial > {code} > Using spark packaged sbt. > Other versions: > {code:java} > ~/spark/docs$ ruby -v > ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] > ~/spark/docs$ gem -v > 2.5.2.1 > ~/spark/docs$ jekyll -v > jekyll 3.7.3 > ~/spark/docs$ java -version > java version "1.8.0_112" Java(TM) SE Runtime Environment (build > 1.8.0_112-b15) Java HotSpot(TM) 64-Bit
[jira] [Updated] (SPARK-23732) Broken link to scala source code in Spark Scala api Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-23732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yogesh Tewari updated SPARK-23732: -- Description: Scala source code link in Spark api scaladoc is broken. Turns out instead of the relative path to the scala files the "€\{FILE_PATH}.scala" expression in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is generating the absolute path from the developers computer. In this case, if I try to access the source link on [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], it tries to take me to [https://github.com/apache/spark/tree/v2.3.0{color:#ff}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala|https://github.com/apache/spark/tree/v2.3.0%3Cfont%20color=] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. Environment details attached. was: Scala source code link in Spark api scaladoc is broken. Turns out instead of the relative path to the scala files the "€\{FILE_PATH}.scala" expression in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is generating the absolute path from the developers computer. In this case, if I try to access the source link on [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], it tries to take me to [/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala" class="external-link" rel="nofollow">https://github.com/apache/spark/tree/v2.3.0{color:#ff}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala|https://github.com/apache/spark/tree/v2.3.0%3Cfont%20color=] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. Environment details attached. > Broken link to scala source code in Spark Scala api Scaladoc > > > Key: SPARK-23732 > URL: https://issues.apache.org/jira/browse/SPARK-23732 > Project: Spark > Issue Type: Bug > Components: Build, Documentation, Project Infra >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ~/spark/docs$ cat /etc/*release* > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=16.04 > DISTRIB_CODENAME=xenial > DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" > NAME="Ubuntu" > VERSION="16.04.4 LTS (Xenial Xerus)" > ID=ubuntu > ID_LIKE=debian > PRETTY_NAME="Ubuntu 16.04.4 LTS" > VERSION_ID="16.04" > HOME_URL="http://www.ubuntu.com/; > SUPPORT_URL="http://help.ubuntu.com/; > BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/; > VERSION_CODENAME=xenial > UBUNTU_CODENAME=xenial > {code} > Using spark packaged sbt. > Other versions: > {code:java} > ~/spark/docs$ ruby -v > ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] > ~/spark/docs$
[jira] [Updated] (SPARK-23733) Broken link to java source code in Spark Scala api Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-23733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yogesh Tewari updated SPARK-23733: -- Description: Java source code link in Spark api scaladoc is broken. The relative path expression "€\{FILE_PATH}.scala" in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] has ".scala" hardcoded in the end. If I try to access the source link on [https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.api.java.function.Function2], it tries to take me to [https://github.com/apache/spark/tree/v2.2.0/core/src/main/java/org/apache/spark/api/java/function/Function2.java.scala] This is coming from /project/SparkBuild.scala : Line # 720 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} was: Java source code link in Spark api scaladoc is broken. The relative path expression "€\{FILE_PATH}.scala" in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] has ".scala" hardcoded in the end. If I try to access the source link on [https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.api.java.function.Function2], it tries to take me to [/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala" class="external-link" rel="nofollow">https://github.com/apache/spark/tree/v2.3.0{color:#ff}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala|https://github.com/apache/spark/tree/v2.3.0%3Cfont%20color=] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. Environment details attached. > Broken link to java source code in Spark Scala api Scaladoc > --- > > Key: SPARK-23733 > URL: https://issues.apache.org/jira/browse/SPARK-23733 > Project: Spark > Issue Type: Bug > Components: Build, Documentation, Project Infra >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0 > Environment: {code:java} > ~/spark/docs$ cat /etc/*release* > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=16.04 > DISTRIB_CODENAME=xenial > DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" > NAME="Ubuntu" > VERSION="16.04.4 LTS (Xenial Xerus)" > ID=ubuntu > ID_LIKE=debian > PRETTY_NAME="Ubuntu 16.04.4 LTS" > VERSION_ID="16.04" > HOME_URL="http://www.ubuntu.com/; > SUPPORT_URL="http://help.ubuntu.com/; > BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/; > VERSION_CODENAME=xenial > UBUNTU_CODENAME=xenial > {code} > Using spark packaged sbt. > Other versions: > {code:java} > ~/spark/docs$ ruby -v > ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] > ~/spark/docs$ gem -v > 2.5.2.1 > ~/spark/docs$ jekyll -v > jekyll 3.7.3 > ~/spark/docs$ java -version > java version "1.8.0_112" Java(TM) SE Runtime Environment (build > 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed > mode) > {code} >Reporter: Yogesh Tewari >Priority: Trivial > Labels: build, documentation, scaladocs > > Java source code link in Spark api scaladoc is broken. > The relative path expression "€\{FILE_PATH}.scala" in > [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] has > ".scala" hardcoded in the end. If I try to access the source link on > [https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.api.java.function.Function2], > it tries to take me to >
[jira] [Updated] (SPARK-23733) Broken link to java source code in Spark Scala api Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-23733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yogesh Tewari updated SPARK-23733: -- Affects Version/s: (was: 2.3.0) > Broken link to java source code in Spark Scala api Scaladoc > --- > > Key: SPARK-23733 > URL: https://issues.apache.org/jira/browse/SPARK-23733 > Project: Spark > Issue Type: Bug > Components: Build, Documentation, Project Infra >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0 > Environment: {code:java} > ~/spark/docs$ cat /etc/*release* > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=16.04 > DISTRIB_CODENAME=xenial > DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" > NAME="Ubuntu" > VERSION="16.04.4 LTS (Xenial Xerus)" > ID=ubuntu > ID_LIKE=debian > PRETTY_NAME="Ubuntu 16.04.4 LTS" > VERSION_ID="16.04" > HOME_URL="http://www.ubuntu.com/; > SUPPORT_URL="http://help.ubuntu.com/; > BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/; > VERSION_CODENAME=xenial > UBUNTU_CODENAME=xenial > {code} > Using spark packaged sbt. > Other versions: > {code:java} > ~/spark/docs$ ruby -v > ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] > ~/spark/docs$ gem -v > 2.5.2.1 > ~/spark/docs$ jekyll -v > jekyll 3.7.3 > ~/spark/docs$ java -version > java version "1.8.0_112" Java(TM) SE Runtime Environment (build > 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed > mode) > {code} >Reporter: Yogesh Tewari >Priority: Trivial > Labels: build, documentation, scaladocs > > Java source code link in Spark api scaladoc is broken. > The relative path expression "€\{FILE_PATH}.scala" in > [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] has > ".scala" hardcoded in the end. If I try to access the source link on > [https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.api.java.function.Function2], > it tries to take me to > [/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala" > class="external-link" > rel="nofollow">https://github.com/apache/spark/tree/v2.3.0{color:#ff}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala|https://github.com/apache/spark/tree/v2.3.0%3Cfont%20color=] > where "/Users/sameera/dev/spark" portion of the URL is coming from the > developers macos home folder. > There seems to be no change in the code responsible for generating this path > during the build in /project/SparkBuild.scala : > Line # 252: > {code:java} > scalacOptions in Compile ++= Seq( > s"-target:jvm-${scalacJVMVersion.value}", > "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required > for relative source links in scaladoc > ), > {code} > Line # 726 > {code:java} > // Use GitHub repository for Scaladoc source links > unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, > scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( > "-groups", // Group similar methods together based on the @group annotation. > "-skip-packages", "org.apache.hadoop" > ) ++ ( > // Add links to sources when generating Scaladoc for a non-snapshot release > if (!isSnapshot.value) { > Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") > } else { > Seq() > } > ){code} > > It seems more like a developers dev environment issue. > I was successfully able to reproduce this in my dev environment. Environment > details attached. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23733) Broken link to java source code in Spark Scala api Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-23733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yogesh Tewari updated SPARK-23733: -- Description: Java source code link in Spark api scaladoc is broken. The relative path expression "€\{FILE_PATH}.scala" in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] has ".scala" hardcoded in the end. If I try to access the source link on [https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.api.java.function.Function2], it tries to take me to [/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala" class="external-link" rel="nofollow">https://github.com/apache/spark/tree/v2.3.0{color:#ff}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala|https://github.com/apache/spark/tree/v2.3.0%3Cfont%20color=] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. Environment details attached. was: Scala source code link in Spark api scaladoc is broken. Turns out instead of the relative path to the scala files the "€\{FILE_PATH}.scala" expression in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is generating the absolute path from the developers computer. In this case, if I try to access the source link on [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], it tries to take me to [/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala" class="external-link" rel="nofollow">https://github.com/apache/spark/tree/v2.3.0{color:#ff}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala|https://github.com/apache/spark/tree/v2.3.0%3Cfont%20color=] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. Environment details attached. > Broken link to java source code in Spark Scala api Scaladoc > --- > > Key: SPARK-23733 > URL: https://issues.apache.org/jira/browse/SPARK-23733 > Project: Spark > Issue Type: Bug > Components: Build, Documentation, Project Infra >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0 > Environment: {code:java} > ~/spark/docs$ cat /etc/*release* > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=16.04 > DISTRIB_CODENAME=xenial > DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" > NAME="Ubuntu" > VERSION="16.04.4 LTS (Xenial Xerus)" > ID=ubuntu > ID_LIKE=debian > PRETTY_NAME="Ubuntu 16.04.4 LTS" > VERSION_ID="16.04" > HOME_URL="http://www.ubuntu.com/; > SUPPORT_URL="http://help.ubuntu.com/; > BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/; > VERSION_CODENAME=xenial > UBUNTU_CODENAME=xenial > {code} > Using spark packaged sbt. > Other versions: > {code:java} > ~/spark/docs$ ruby -v > ruby
[jira] [Updated] (SPARK-23733) Broken link to java source code in Spark Scala api Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-23733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yogesh Tewari updated SPARK-23733: -- Affects Version/s: (was: 2.3.1) 1.6.3 2.0.2 2.1.2 2.2.0 > Broken link to java source code in Spark Scala api Scaladoc > --- > > Key: SPARK-23733 > URL: https://issues.apache.org/jira/browse/SPARK-23733 > Project: Spark > Issue Type: Bug > Components: Build, Documentation, Project Infra >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 > Environment: {code:java} > ~/spark/docs$ cat /etc/*release* > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=16.04 > DISTRIB_CODENAME=xenial > DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" > NAME="Ubuntu" > VERSION="16.04.4 LTS (Xenial Xerus)" > ID=ubuntu > ID_LIKE=debian > PRETTY_NAME="Ubuntu 16.04.4 LTS" > VERSION_ID="16.04" > HOME_URL="http://www.ubuntu.com/; > SUPPORT_URL="http://help.ubuntu.com/; > BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/; > VERSION_CODENAME=xenial > UBUNTU_CODENAME=xenial > {code} > Using spark packaged sbt. > Other versions: > {code:java} > ~/spark/docs$ ruby -v > ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] > ~/spark/docs$ gem -v > 2.5.2.1 > ~/spark/docs$ jekyll -v > jekyll 3.7.3 > ~/spark/docs$ java -version > java version "1.8.0_112" Java(TM) SE Runtime Environment (build > 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed > mode) > {code} >Reporter: Yogesh Tewari >Priority: Trivial > Labels: build, documentation, scaladocs > > Scala source code link in Spark api scaladoc is broken. > Turns out instead of the relative path to the scala files the > "€\{FILE_PATH}.scala" expression in > [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is > generating the absolute path from the developers computer. In this case, if I > try to access the source link on > [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], > it tries to take me to > [/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala" > class="external-link" > rel="nofollow">https://github.com/apache/spark/tree/v2.3.0{color:#ff}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala|https://github.com/apache/spark/tree/v2.3.0%3Cfont%20color=] > where "/Users/sameera/dev/spark" portion of the URL is coming from the > developers macos home folder. > There seems to be no change in the code responsible for generating this path > during the build in /project/SparkBuild.scala : > Line # 252: > {code:java} > scalacOptions in Compile ++= Seq( > s"-target:jvm-${scalacJVMVersion.value}", > "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required > for relative source links in scaladoc > ), > {code} > Line # 726 > {code:java} > // Use GitHub repository for Scaladoc source links > unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, > scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( > "-groups", // Group similar methods together based on the @group annotation. > "-skip-packages", "org.apache.hadoop" > ) ++ ( > // Add links to sources when generating Scaladoc for a non-snapshot release > if (!isSnapshot.value) { > Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") > } else { > Seq() > } > ){code} > > It seems more like a developers dev environment issue. > I was successfully able to reproduce this in my dev environment. Environment > details attached. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23733) Broken link to java source code in Spark Scala api Scaladoc
Yogesh Tewari created SPARK-23733: - Summary: Broken link to java source code in Spark Scala api Scaladoc Key: SPARK-23733 URL: https://issues.apache.org/jira/browse/SPARK-23733 Project: Spark Issue Type: Bug Components: Build, Documentation, Project Infra Affects Versions: 2.3.0, 2.3.1 Environment: {code:java} ~/spark/docs$ cat /etc/*release* DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" NAME="Ubuntu" VERSION="16.04.4 LTS (Xenial Xerus)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 16.04.4 LTS" VERSION_ID="16.04" HOME_URL="http://www.ubuntu.com/; SUPPORT_URL="http://help.ubuntu.com/; BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/; VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial {code} Using spark packaged sbt. Other versions: {code:java} ~/spark/docs$ ruby -v ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] ~/spark/docs$ gem -v 2.5.2.1 ~/spark/docs$ jekyll -v jekyll 3.7.3 ~/spark/docs$ java -version java version "1.8.0_112" Java(TM) SE Runtime Environment (build 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) {code} Reporter: Yogesh Tewari Scala source code link in Spark api scaladoc is broken. Turns out instead of the relative path to the scala files the "€\{FILE_PATH}.scala" expression in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is generating the absolute path from the developers computer. In this case, if I try to access the source link on [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], it tries to take me to [/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala" class="external-link" rel="nofollow">https://github.com/apache/spark/tree/v2.3.0{color:#ff}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala|https://github.com/apache/spark/tree/v2.3.0%3Cfont%20color=] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. Environment details attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23732) Broken link to scala source code in Spark Scala api Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-23732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yogesh Tewari updated SPARK-23732: -- Description: Scala source code link in Spark api scaladoc is broken. Turns out instead of the relative path to the scala files the "€\{FILE_PATH}.scala" expression in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is generating the absolute path from the developers computer. In this case, if I try to access the source link on [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], it tries to take me to [/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala" class="external-link" rel="nofollow">https://github.com/apache/spark/tree/v2.3.0{color:#ff}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala|https://github.com/apache/spark/tree/v2.3.0%3Cfont%20color=] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. Environment details attached. was: Scala source code link in Spark api scaladoc is broken. Turns out instead of the relative path to the scala files the "€\{FILE_PATH}.scala" expression in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is generating the absolute path from the developers computer. In this case, if I try to access the source link on [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], it tries to take me to [https://github.com/apache/spark/tree/v2.3.0{color:#FF}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. > Broken link to scala source code in Spark Scala api Scaladoc > > > Key: SPARK-23732 > URL: https://issues.apache.org/jira/browse/SPARK-23732 > Project: Spark > Issue Type: Bug > Components: Build, Documentation, Project Infra >Affects Versions: 2.3.0, 2.3.1 > Environment: {code:java} > ~/spark/docs$ cat /etc/*release* > DISTRIB_ID=Ubuntu > DISTRIB_RELEASE=16.04 > DISTRIB_CODENAME=xenial > DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" > NAME="Ubuntu" > VERSION="16.04.4 LTS (Xenial Xerus)" > ID=ubuntu > ID_LIKE=debian > PRETTY_NAME="Ubuntu 16.04.4 LTS" > VERSION_ID="16.04" > HOME_URL="http://www.ubuntu.com/; > SUPPORT_URL="http://help.ubuntu.com/; > BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/; > VERSION_CODENAME=xenial > UBUNTU_CODENAME=xenial > {code} > Using spark packaged sbt. > Other versions: > {code:java} > ~/spark/docs$ ruby -v > ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] > ~/spark/docs$ gem -v > 2.5.2.1 > ~/spark/docs$ jekyll -v > jekyll 3.7.3 > ~/spark/docs$ java -version
[jira] [Created] (SPARK-23732) Broken link to scala source code in Spark Scala api Scaladoc
Yogesh Tewari created SPARK-23732: - Summary: Broken link to scala source code in Spark Scala api Scaladoc Key: SPARK-23732 URL: https://issues.apache.org/jira/browse/SPARK-23732 Project: Spark Issue Type: Bug Components: Build, Documentation, Project Infra Affects Versions: 2.3.0, 2.3.1 Environment: {code:java} ~/spark/docs$ cat /etc/*release* DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" NAME="Ubuntu" VERSION="16.04.4 LTS (Xenial Xerus)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 16.04.4 LTS" VERSION_ID="16.04" HOME_URL="http://www.ubuntu.com/; SUPPORT_URL="http://help.ubuntu.com/; BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/; VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial {code} Using spark packaged sbt. Other versions: {code:java} ~/spark/docs$ ruby -v ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] ~/spark/docs$ gem -v 2.5.2.1 ~/spark/docs$ jekyll -v jekyll 3.7.3 ~/spark/docs$ java -version java version "1.8.0_112" Java(TM) SE Runtime Environment (build 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) {code} Reporter: Yogesh Tewari Scala source code link in Spark api scaladoc is broken. Turns out instead of the relative path to the scala files the "€\{FILE_PATH}.scala" expression in [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is generating the absolute path from the developers computer. In this case, if I try to access the source link on [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable], it tries to take me to [https://github.com/apache/spark/tree/v2.3.0{color:#FF}/Users/sameera/dev/spark{color}/core/src/main/scala/org/apache/spark/Accumulable.scala] where "/Users/sameera/dev/spark" portion of the URL is coming from the developers macos home folder. There seems to be no change in the code responsible for generating this path during the build in /project/SparkBuild.scala : Line # 252: {code:java} scalacOptions in Compile ++= Seq( s"-target:jvm-${scalacJVMVersion.value}", "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required for relative source links in scaladoc ), {code} Line # 726 {code:java} // Use GitHub repository for Scaladoc source links unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};, scalacOptions in (ScalaUnidoc, unidoc) ++= Seq( "-groups", // Group similar methods together based on the @group annotation. "-skip-packages", "org.apache.hadoop" ) ++ ( // Add links to sources when generating Scaladoc for a non-snapshot release if (!isSnapshot.value) { Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala") } else { Seq() } ){code} It seems more like a developers dev environment issue. I was successfully able to reproduce this in my dev environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23650) Slow SparkR udf (dapply)
[ https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404213#comment-16404213 ] Deepansh edited comment on SPARK-23650 at 3/18/18 10:16 PM: I tried reading the model in UDF, but for every new stream, the model is being read which is adding an overhead (~2s). IMO The problem here is that R environment inside the thread for applying UDF is not getting cached. It is created and destroyed with each query. Attached - logs To overcome the problem, I was using broadcast, as technically broadcast is done only once to the executors. was (Author: litup): I tried reading the model in UDF, but for every new stream, the model is being read which is adding an overhead (~2s). IMO The problem here is the R environment is not getting cached. It is created and destroyed with each query. Attached - logs To overcome the problem, I was using broadcast, as technically broadcast is done only once to the executors. > Slow SparkR udf (dapply) > > > Key: SPARK-23650 > URL: https://issues.apache.org/jira/browse/SPARK-23650 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Deepansh >Priority: Major > Attachments: read_model_in_udf.txt, sparkR_log2.txt, sparkRlag.txt > > > For eg, I am getting streams from Kafka and I want to implement a model made > in R for those streams. For this, I am using dapply. > My code is: > iris_model <- readRDS("./iris_model.rds") > randomBr <- SparkR:::broadcast(sc, iris_model) > kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = > "localhost:9092", topic = "source") > lines<- select(kafka, cast(kafka$value, "string")) > schema<-schema(lines) > df1<-dapply(lines,function(x){ > i_model<-SparkR:::value(randomMatBr) > for (row in 1:nrow(x)) > { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) > y<-toJSON(y) x[row,"value"] = y } > x > },schema) > Every time when Kafka streams are fetched the dapply method creates new > runner thread and ships the variables again, which causes a huge lag(~2s for > shipping model) every time. I even tried without broadcast variables but it > takes same time to ship variables. Can some other techniques be applied to > improve its performance? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23650) Slow SparkR udf (dapply)
[ https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepansh updated SPARK-23650: - Attachment: read_model_in_udf.txt > Slow SparkR udf (dapply) > > > Key: SPARK-23650 > URL: https://issues.apache.org/jira/browse/SPARK-23650 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Deepansh >Priority: Major > Attachments: read_model_in_udf.txt, sparkR_log2.txt, sparkRlag.txt > > > For eg, I am getting streams from Kafka and I want to implement a model made > in R for those streams. For this, I am using dapply. > My code is: > iris_model <- readRDS("./iris_model.rds") > randomBr <- SparkR:::broadcast(sc, iris_model) > kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = > "localhost:9092", topic = "source") > lines<- select(kafka, cast(kafka$value, "string")) > schema<-schema(lines) > df1<-dapply(lines,function(x){ > i_model<-SparkR:::value(randomMatBr) > for (row in 1:nrow(x)) > { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) > y<-toJSON(y) x[row,"value"] = y } > x > },schema) > Every time when Kafka streams are fetched the dapply method creates new > runner thread and ships the variables again, which causes a huge lag(~2s for > shipping model) every time. I even tried without broadcast variables but it > takes same time to ship variables. Can some other techniques be applied to > improve its performance? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)
[ https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404213#comment-16404213 ] Deepansh commented on SPARK-23650: -- I tried reading the model in UDF, but for every new stream, the model is being read which is adding an overhead (~2s). IMO The problem here is the R environment is not getting cached. It is created and destroyed with each query. Attached - logs To overcome the problem, I was using broadcast, as technically broadcast is done only once to the executors. > Slow SparkR udf (dapply) > > > Key: SPARK-23650 > URL: https://issues.apache.org/jira/browse/SPARK-23650 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Deepansh >Priority: Major > Attachments: sparkR_log2.txt, sparkRlag.txt > > > For eg, I am getting streams from Kafka and I want to implement a model made > in R for those streams. For this, I am using dapply. > My code is: > iris_model <- readRDS("./iris_model.rds") > randomBr <- SparkR:::broadcast(sc, iris_model) > kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = > "localhost:9092", topic = "source") > lines<- select(kafka, cast(kafka$value, "string")) > schema<-schema(lines) > df1<-dapply(lines,function(x){ > i_model<-SparkR:::value(randomMatBr) > for (row in 1:nrow(x)) > { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) > y<-toJSON(y) x[row,"value"] = y } > x > },schema) > Every time when Kafka streams are fetched the dapply method creates new > runner thread and ships the variables again, which causes a huge lag(~2s for > shipping model) every time. I even tried without broadcast variables but it > takes same time to ship variables. Can some other techniques be applied to > improve its performance? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)
[ https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404198#comment-16404198 ] Felix Cheung commented on SPARK-23650: -- Is there a reason for the broadcast? Could you instead distribute the .rds to all the executor and then call readRDS from within your UDF? I understand this approach has been done quite a bit. > Slow SparkR udf (dapply) > > > Key: SPARK-23650 > URL: https://issues.apache.org/jira/browse/SPARK-23650 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Deepansh >Priority: Major > Attachments: sparkR_log2.txt, sparkRlag.txt > > > For eg, I am getting streams from Kafka and I want to implement a model made > in R for those streams. For this, I am using dapply. > My code is: > iris_model <- readRDS("./iris_model.rds") > randomBr <- SparkR:::broadcast(sc, iris_model) > kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = > "localhost:9092", topic = "source") > lines<- select(kafka, cast(kafka$value, "string")) > schema<-schema(lines) > df1<-dapply(lines,function(x){ > i_model<-SparkR:::value(randomMatBr) > for (row in 1:nrow(x)) > { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) > y<-toJSON(y) x[row,"value"] = y } > x > },schema) > Every time when Kafka streams are fetched the dapply method creates new > runner thread and ships the variables again, which causes a huge lag(~2s for > shipping model) every time. I even tried without broadcast variables but it > takes same time to ship variables. Can some other techniques be applied to > improve its performance? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23650) Slow SparkR udf (dapply)
[ https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404183#comment-16404183 ] Deepansh commented on SPARK-23650: -- Is there any other way to implement my use case with minimum(ms) overhead. Use case - Input data from Kafka streams and apply a native R model for them and return the prediction to Kafka sink/any other sink. > Slow SparkR udf (dapply) > > > Key: SPARK-23650 > URL: https://issues.apache.org/jira/browse/SPARK-23650 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Deepansh >Priority: Major > Attachments: sparkR_log2.txt, sparkRlag.txt > > > For eg, I am getting streams from Kafka and I want to implement a model made > in R for those streams. For this, I am using dapply. > My code is: > iris_model <- readRDS("./iris_model.rds") > randomBr <- SparkR:::broadcast(sc, iris_model) > kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = > "localhost:9092", topic = "source") > lines<- select(kafka, cast(kafka$value, "string")) > schema<-schema(lines) > df1<-dapply(lines,function(x){ > i_model<-SparkR:::value(randomMatBr) > for (row in 1:nrow(x)) > { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) > y<-toJSON(y) x[row,"value"] = y } > x > },schema) > Every time when Kafka streams are fetched the dapply method creates new > runner thread and ships the variables again, which causes a huge lag(~2s for > shipping model) every time. I even tried without broadcast variables but it > takes same time to ship variables. Can some other techniques be applied to > improve its performance? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23645) pandas_udf can not be called with keyword arguments
[ https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404126#comment-16404126 ] Stu (Michael Stewart) commented on SPARK-23645: --- {quote}Sounds a good to do if the change is minimal but if the change is big, I doubt if this is something we should support. Documenting this might be good enough for now. {quote} Definitely a nontrivial change after digging all the way down. I've updated PR. > pandas_udf can not be called with keyword arguments > --- > > Key: SPARK-23645 > URL: https://issues.apache.org/jira/browse/SPARK-23645 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 > Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, > OpenJDK 64-Bit Server VM, 1.8.0_141 >Reporter: Stu (Michael Stewart) >Priority: Minor > > pandas_udf (all python udfs(?)) do not accept keyword arguments because > `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also > wrapper utility methods, that only accept args and not kwargs: > @ line 168: > {code:java} > ... > def __call__(self, *cols): > judf = self._judf > sc = SparkContext._active_spark_context > return Column(judf.apply(_to_seq(sc, cols, _to_java_column))) > # This function is for improving the online help system in the interactive > interpreter. > # For example, the built-in help / pydoc.help. It wraps the UDF with the > docstring and > # argument annotation. (See: SPARK-19161) > def _wrapped(self): > """ > Wrap this udf with a function and attach docstring from func > """ > # It is possible for a callable instance without __name__ attribute or/and > # __module__ attribute to be wrapped here. For example, > functools.partial. In this case, > # we should avoid wrapping the attributes from the wrapped function to > the wrapper > # function. So, we take out these attribute names from the default names > to set and > # then manually assign it after being wrapped. > assignments = tuple( > a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != > '__module__') > @functools.wraps(self.func, assigned=assignments) > def wrapper(*args): > return self(*args) > ...{code} > as seen in: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit > spark = SparkSession.builder.getOrCreate() > df = spark.range(12).withColumn('b', col('id') * 2) > def ok(a,b): return a*b > df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show() > # no problems > df.withColumn('ok', pandas_udf(f=ok, > returnType='bigint')(a='id',b='b')).show() # fail with ~no stacktrace thanks > to wrapper helper > --- > TypeError Traceback (most recent call last) > in () > > 1 df.withColumn('ok', pandas_udf(f=ok, > returnType='bigint')(a='id',b='b')).show() > TypeError: wrapper() got an unexpected keyword argument 'a'{code} > > > *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF > to be called as such, but the cols tuple that gets passed in the call method: > {code:java} > _to_seq(sc, cols, _to_java_column{code} > has to be in the right order based on the functions defined argument inputs, > or the function will return incorrect results. so, the challenge here is to: > (a) make sure to reconstruct the proper order of the full args/kwargs > --> args first, and then kwargs (not in the order passed but in the order > requested by the fn) > (b) handle python2 and python3 `inspect` module inconsistencies -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-23731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404067#comment-16404067 ] Apache Spark commented on SPARK-23731: -- User 'jaceklaskowski' has created a pull request for this issue: https://github.com/apache/spark/pull/20856 > FileSourceScanExec throws NullPointerException in subexpression elimination > --- > > Key: SPARK-23731 > URL: https://issues.apache.org/jira/browse/SPARK-23731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0, 2.3.1 >Reporter: Jacek Laskowski >Priority: Minor > > While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I > faced the following exception (in Spark 2.3.0): > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257) > at > org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36) > at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358) > at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40) > at > scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136) > at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132) > at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.get(HashMap.scala:70) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54) > at >
[jira] [Assigned] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-23731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23731: Assignee: Apache Spark > FileSourceScanExec throws NullPointerException in subexpression elimination > --- > > Key: SPARK-23731 > URL: https://issues.apache.org/jira/browse/SPARK-23731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0, 2.3.1 >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Minor > > While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I > faced the following exception (in Spark 2.3.0): > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257) > at > org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36) > at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358) > at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40) > at > scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136) > at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132) > at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.get(HashMap.scala:70) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:95) > at >
[jira] [Assigned] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-23731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23731: Assignee: (was: Apache Spark) > FileSourceScanExec throws NullPointerException in subexpression elimination > --- > > Key: SPARK-23731 > URL: https://issues.apache.org/jira/browse/SPARK-23731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0, 2.3.1 >Reporter: Jacek Laskowski >Priority: Minor > > While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I > faced the following exception (in Spark 2.3.0): > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257) > at > org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36) > at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358) > at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40) > at > scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136) > at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132) > at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.get(HashMap.scala:70) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:95) > at >
[jira] [Commented] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-23731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404062#comment-16404062 ] Apache Spark commented on SPARK-23731: -- User 'jaceklaskowski' has created a pull request for this issue: https://github.com/apache/spark/pull/20855 > FileSourceScanExec throws NullPointerException in subexpression elimination > --- > > Key: SPARK-23731 > URL: https://issues.apache.org/jira/browse/SPARK-23731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0, 2.3.1 >Reporter: Jacek Laskowski >Priority: Minor > > While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I > faced the following exception (in Spark 2.3.0): > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257) > at > org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36) > at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358) > at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40) > at > scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136) > at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132) > at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.get(HashMap.scala:70) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54) > at >
[jira] [Created] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination
Jacek Laskowski created SPARK-23731: --- Summary: FileSourceScanExec throws NullPointerException in subexpression elimination Key: SPARK-23731 URL: https://issues.apache.org/jira/browse/SPARK-23731 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.1, 2.3.1 Reporter: Jacek Laskowski While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I faced the following exception (in Spark 2.3.0): {code:java} Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167) at org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502) at org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) at org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257) at org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36) at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358) at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40) at scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136) at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132) at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.get(HashMap.scala:70) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:95) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$$anonfun$addExprTree$1.apply(EquivalentExpressions.scala:96) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$$anonfun$addExprTree$1.apply(EquivalentExpressions.scala:96) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:96) at
[jira] [Assigned] (SPARK-23712) Investigate replacing Code Generated UnsafeRowJoiner with an Interpreted version
[ https://issues.apache.org/jira/browse/SPARK-23712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23712: Assignee: Herman van Hovell (was: Apache Spark) > Investigate replacing Code Generated UnsafeRowJoiner with an Interpreted > version > > > Key: SPARK-23712 > URL: https://issues.apache.org/jira/browse/SPARK-23712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > > We currently have a code generated UnsafeRowJoiner. This does not make a lot > of sense since we can write a perfectly good 'interpreted' version. > We should definitely benchmark this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23712) Investigate replacing Code Generated UnsafeRowJoiner with an Interpreted version
[ https://issues.apache.org/jira/browse/SPARK-23712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23712: Assignee: Apache Spark (was: Herman van Hovell) > Investigate replacing Code Generated UnsafeRowJoiner with an Interpreted > version > > > Key: SPARK-23712 > URL: https://issues.apache.org/jira/browse/SPARK-23712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > > We currently have a code generated UnsafeRowJoiner. This does not make a lot > of sense since we can write a perfectly good 'interpreted' version. > We should definitely benchmark this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23712) Investigate replacing Code Generated UnsafeRowJoiner with an Interpreted version
[ https://issues.apache.org/jira/browse/SPARK-23712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404001#comment-16404001 ] Apache Spark commented on SPARK-23712: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/20854 > Investigate replacing Code Generated UnsafeRowJoiner with an Interpreted > version > > > Key: SPARK-23712 > URL: https://issues.apache.org/jira/browse/SPARK-23712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > > We currently have a code generated UnsafeRowJoiner. This does not make a lot > of sense since we can write a perfectly good 'interpreted' version. > We should definitely benchmark this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23646) pyspark DataFrameWriter ignores customized settings?
[ https://issues.apache.org/jira/browse/SPARK-23646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403985#comment-16403985 ] Hyukjin Kwon commented on SPARK-23646: -- Thanks for logging here. It should be helpful if anyone faces the same problem. > pyspark DataFrameWriter ignores customized settings? > > > Key: SPARK-23646 > URL: https://issues.apache.org/jira/browse/SPARK-23646 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Chuan-Heng Hsiao >Priority: Major > > I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode. > (python version: 3.5.2 from ubuntu 16.04) > I intended to have DataFrame write to hdfs with customized block-size but > failed. > However, the corresponding rdd can successfully write with the customized > block-size. > > > The following is the test code: > (dfs.namenode.fs-limits.min-block-size has been set as 131072 in hdfs) > > > ## > # init > ##from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > > import hdfs > from hdfs import InsecureClient > import os > > import numpy as np > import pandas as pd > import logging > > os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7' > > block_size = 512 * 1024 > > conf = > SparkConf().setAppName("DCSSpark").setMaster("spark://spark1[:7077|http://10.7.34.47:7077/];).set('spark.cores.max', > 20).set("spark.executor.cores", 10).set("spark.executor.memory", > "10g").set("spark.hadoop.dfs.blocksize", > str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size)) > > spark = SparkSession.builder.config(conf=conf).getOrCreate() > spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", > block_size) > spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", > block_size) > > ## > # main > ## > # create DataFrame > df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, > \{'temp': "!"}]) > > # save using DataFrameWriter, resulting 128MB-block-size > df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df') > > # save using rdd, resulting 512k-block-size > client = InsecureClient('[http://spark1:50070|http://spark1:50070/]') > client.delete('/tmp/temp_with_rrd', recursive=True) > df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd') -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23706) spark.conf.get(value, default=None) should produce None in PySpark
[ https://issues.apache.org/jira/browse/SPARK-23706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23706. -- Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 Fixed in https://github.com/apache/spark/pull/20841 > spark.conf.get(value, default=None) should produce None in PySpark > -- > > Key: SPARK-23706 > URL: https://issues.apache.org/jira/browse/SPARK-23706 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.1, 2.4.0 > > > Scala: > {code} > scala> spark.conf.get("hey") > java.util.NoSuchElementException: hey > at > org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600) > at > org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:1600) > at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74) > ... 49 elided > scala> spark.conf.get("hey", null) > res1: String = null > scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null) > res2: String = null > {code} > Python: > {code} > >>> spark.conf.get("hey") > ... > py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. > : java.util.NoSuchElementException: hey > ... > >>> spark.conf.get("hey", None) > ... > py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. > : java.util.NoSuchElementException: hey > ... > >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None) > u'STATIC' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23730) Save and expose "in bag" tracking for random forest model
Julian King created SPARK-23730: --- Summary: Save and expose "in bag" tracking for random forest model Key: SPARK-23730 URL: https://issues.apache.org/jira/browse/SPARK-23730 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Julian King In a random forest model, it is often useful to be able to keep track of which samples ended up in each of the bootstrap replications (and how many times this happened). For instance, in the R randomForest package this is accomplished through the option keep.inbag=TRUE Similar functionality in Spark ML's random forest would be helpful -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23710) Upgrade Hive to 2.3.2
[ https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403886#comment-16403886 ] Yuming Wang commented on SPARK-23710: - Without {{-Phive}} works fine, because {{$SPARK_HOME/jars}} contains {{hive-storage-api-2.4.0.jar}}. {{nohive}} is to compatible with Hive 1.x, if still uses {{nohive}} after upgrading Hive to 2.3.2, there will be a lot of conflicts. > Upgrade Hive to 2.3.2 > - > > Key: SPARK-23710 > URL: https://issues.apache.org/jira/browse/SPARK-23710 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > h1. Mainly changes > * Maven dependency: > hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change > {{hive.classifier}} to {{core}} > calcite.version from {{1.2.0-incubating}} to {{1.10.0}} > datanucleus-core.version from {{3.2.10}} to {{4.1.17}} > remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: > ORC-174 > add new dependency {{avatica}} and {{hive.storage.api}} > * ORC compatibility changes: > OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, > OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala > * hive-thriftserver java file update: > update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2 > update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to > hive 2.3.2 > * TestSuite should update: > ||TestSuite||Reason|| > |StatisticsSuite|HIVE-16098| > |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]| > |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, > SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to > hive-hcatalog-core-2.3.2.jar| > |SparkExecuteStatementOperationSuite|Interface changed from > org.apache.hive.service.cli.Type.NULL_TYPE to > org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE| > |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo > change to com.esotericsoftware.kryo.Kryo| > |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") > to Seq("1.100\t1", "2.100\t2")| > |HiveOrcFilterSuite|Result format changed| > |HiveDDLSuite|Remove $ (This change needs to be reconsidered)| > |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: > org.datanucleus.identity.DatastoreIdImpl cannot be cast to > org.datanucleus.identity.OID| > * Other changes: > Close hive schema verification: > [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251] > and > [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58] > Update > [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192] > Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't > connect to Hive 1.x metastore, We should use > {{HiveMetaStoreClient.getDelegationToken}} instead of > {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}} > All changes can be found at > [PR-20659|https://github.com/apache/spark/pull/20659]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23729) Glob resolution breaks remote naming of files/archives
[ https://issues.apache.org/jira/browse/SPARK-23729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23729: Assignee: (was: Apache Spark) > Glob resolution breaks remote naming of files/archives > -- > > Key: SPARK-23729 > URL: https://issues.apache.org/jira/browse/SPARK-23729 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Mihaly Toth >Priority: Major > > Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the > {{\-\-files}} parameters, in case the file name actually contains glob > patterns, the rename part ({{...#nameAs}}) of the filename will eventually be > ignored. > Thinking over the resolution cases, if the resolution results in multiple > files, it does not make sense to send all of them under the same remote name. > So this should then result in an error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23729) Glob resolution breaks remote naming of files/archives
[ https://issues.apache.org/jira/browse/SPARK-23729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23729: Assignee: Apache Spark > Glob resolution breaks remote naming of files/archives > -- > > Key: SPARK-23729 > URL: https://issues.apache.org/jira/browse/SPARK-23729 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Mihaly Toth >Assignee: Apache Spark >Priority: Major > > Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the > {{\-\-files}} parameters, in case the file name actually contains glob > patterns, the rename part ({{...#nameAs}}) of the filename will eventually be > ignored. > Thinking over the resolution cases, if the resolution results in multiple > files, it does not make sense to send all of them under the same remote name. > So this should then result in an error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23729) Glob resolution breaks remote naming of files/archives
[ https://issues.apache.org/jira/browse/SPARK-23729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403879#comment-16403879 ] Apache Spark commented on SPARK-23729: -- User 'misutoth' has created a pull request for this issue: https://github.com/apache/spark/pull/20853 > Glob resolution breaks remote naming of files/archives > -- > > Key: SPARK-23729 > URL: https://issues.apache.org/jira/browse/SPARK-23729 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Mihaly Toth >Priority: Major > > Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the > {{\-\-files}} parameters, in case the file name actually contains glob > patterns, the rename part ({{...#nameAs}}) of the filename will eventually be > ignored. > Thinking over the resolution cases, if the resolution results in multiple > files, it does not make sense to send all of them under the same remote name. > So this should then result in an error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23729) Glob resolution breaks remote naming of files/archives
[ https://issues.apache.org/jira/browse/SPARK-23729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403876#comment-16403876 ] Mihaly Toth commented on SPARK-23729: - Already working on this. Will submit a PR shortly. > Glob resolution breaks remote naming of files/archives > -- > > Key: SPARK-23729 > URL: https://issues.apache.org/jira/browse/SPARK-23729 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Mihaly Toth >Priority: Major > > Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the > {{\-\-files}} parameters, in case the file name actually contains glob > patterns, the rename part ({{...#nameAs}}) of the filename will eventually be > ignored. > Thinking over the resolution cases, if the resolution results in multiple > files, it does not make sense to send all of them under the same remote name. > So this should then result in an error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23729) Glob resolution breaks remote naming of files/archives
Mihaly Toth created SPARK-23729: --- Summary: Glob resolution breaks remote naming of files/archives Key: SPARK-23729 URL: https://issues.apache.org/jira/browse/SPARK-23729 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.3.0 Reporter: Mihaly Toth Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the {{\-\-files}} parameters, in case the file name actually contains glob patterns, the rename part ({{...#nameAs}}) of the filename will eventually be ignored. Thinking over the resolution cases, if the resolution results in multiple files, it does not make sense to send all of them under the same remote name. So this should then result in an error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org