[jira] [Commented] (SPARK-18039) ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced
[ https://issues.apache.org/jira/browse/SPARK-18039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594254#comment-15594254 ] Liwei Lin commented on SPARK-18039: --- hi [~astralidea], if I understand correctly, configuring `spark.scheduler.minRegisteredResourcesRatio` may help. > ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced > - > > Key: SPARK-18039 > URL: https://issues.apache.org/jira/browse/SPARK-18039 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 >Reporter: astralidea >Priority: Minor > > receiver scheduling balance is important for me > for instance > if I have 2 executor, each executor has 1 receiver, calc time is 0.1s per > batch. > but if I have 2 executor, one executor has 2 receiver and another is 0 > receiver ,calc time is increase 3s per batch. > In my cluster executor init is slow I need about 30s to wait. > but dummy job only run 4s to wait, I add conf > spark.scheduler.maxRegisteredResourcesWaitingTime it does not work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18040) Improve R handling or messaging of JVM exception
[ https://issues.apache.org/jira/browse/SPARK-18040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18040: - Description: Similar to SPARK-17838, there are a few cases where an exception can be thrown from the JVM side when an action is performed (head, count, collect). For example, any error with planner can and only happen then. We need to have error handling for those cases to present the error more clearly in R instead of a long Java stacktrace. > Improve R handling or messaging of JVM exception > > > Key: SPARK-18040 > URL: https://issues.apache.org/jira/browse/SPARK-18040 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.2, 2.1.0 >Reporter: Felix Cheung >Priority: Minor > > Similar to SPARK-17838, there are a few cases where an exception can be > thrown from the JVM side when an action is performed (head, count, collect). > For example, any error with planner can and only happen then. We need to have > error handling for those cases to present the error more clearly in R instead > of a long Java stacktrace. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17254) Filter operator should have “stop if false” semantics for sorted data
[ https://issues.apache.org/jira/browse/SPARK-17254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-17254: Attachment: (was: stop-after-physical-plan.pdf) > Filter operator should have “stop if false” semantics for sorted data > - > > Key: SPARK-17254 > URL: https://issues.apache.org/jira/browse/SPARK-17254 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tejas Patil > Attachments: stop-after-physical-plan.pdf > > > From > https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf: > Filter on sorted data > If the data is sorted by a key, filters on the key could stop as soon as the > data is out of range. For example, WHERE ticker_id < “F” should stop as soon > as the first row starting with “F” is seen. This can be done adding a Filter > operator that has “stop if false” semantics. This is generally useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17254) Filter operator should have “stop if false” semantics for sorted data
[ https://issues.apache.org/jira/browse/SPARK-17254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-17254: Attachment: stop-after-physical-plan.pdf > Filter operator should have “stop if false” semantics for sorted data > - > > Key: SPARK-17254 > URL: https://issues.apache.org/jira/browse/SPARK-17254 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tejas Patil > Attachments: stop-after-physical-plan.pdf > > > From > https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf: > Filter on sorted data > If the data is sorted by a key, filters on the key could stop as soon as the > data is out of range. For example, WHERE ticker_id < “F” should stop as soon > as the first row starting with “F” is seen. This can be done adding a Filter > operator that has “stop if false” semantics. This is generally useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18041) activedrivers section in http:sparkMasterurl/json is missing Main class information
sudheesh k s created SPARK-18041: Summary: activedrivers section in http:sparkMasterurl/json is missing Main class information Key: SPARK-18041 URL: https://issues.apache.org/jira/browse/SPARK-18041 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.6.2 Reporter: sudheesh k s Priority: Minor http:sparkMaster_Url/json gives the status of running applications as well as drivers. But it is missing information like, driver main class. To identify which driver is running on driver class information is needed. eg: "activedrivers" : [ { "id" : "driver-20161020173528-0032", "starttime" : "1476965128734", "state" : "RUNNING", "cores" : 1, "memory" : 1024 } ], -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18040) Improve R handling or messaging of JVM exception
Felix Cheung created SPARK-18040: Summary: Improve R handling or messaging of JVM exception Key: SPARK-18040 URL: https://issues.apache.org/jira/browse/SPARK-18040 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.2, 2.1.0 Reporter: Felix Cheung Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17275) Flaky test: org.apache.spark.deploy.RPackageUtilsSuite.jars that don't exist are skipped and print warning
[ https://issues.apache.org/jira/browse/SPARK-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594080#comment-15594080 ] Felix Cheung commented on SPARK-17275: -- is this still a problem? > Flaky test: org.apache.spark.deploy.RPackageUtilsSuite.jars that don't exist > are skipped and print warning > -- > > Key: SPARK-17275 > URL: https://issues.apache.org/jira/browse/SPARK-17275 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1623/testReport/junit/org.apache.spark.deploy/RPackageUtilsSuite/jars_that_don_t_exist_are_skipped_and_print_warning/ > {code} > Error Message > java.io.IOException: Unable to delete directory > /home/jenkins/.ivy2/cache/a/mylib. > Stacktrace > sbt.ForkMain$ForkError: java.io.IOException: Unable to delete directory > /home/jenkins/.ivy2/cache/a/mylib. > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1541) > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270) > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) > at > org.apache.spark.deploy.IvyTestUtils$.purgeLocalIvyCache(IvyTestUtils.scala:394) > at > org.apache.spark.deploy.IvyTestUtils$.withRepository(IvyTestUtils.scala:384) > at > org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply$mcV$sp(RPackageUtilsSuite.scala:103) > at > org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply(RPackageUtilsSuite.scala:100) > at > org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply(RPackageUtilsSuite.scala:100) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.deploy.RPackageUtilsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(RPackageUtilsSuite.scala:38) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > at > org.apache.spark.deploy.RPackageUtilsSuite.runTest(RPackageUtilsSuite.scala:38) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357) > at > o
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594046#comment-15594046 ] Felix Cheung commented on SPARK-17916: -- So here's what happen. First, R read.csv has clearly documented that it treats empty/blank string the same as NA in the following condition: "Blank fields are also considered to be missing values in logical, integer, numeric and complex fields." Second, in this example in R, the 2nd column is turned into "logical", instead of "character" (ie. string) as expected: {code} > d <- "col1,col2 + 1,\"-\" + 2,\"\"" > df <- read.csv(text=d, quote="\"", na.strings=c("-")) > df col1 col2 11 NA 22 NA > str(df) 'data.frame': 2 obs. of 2 variables: $ col1: int 1 2 $ col2: logi NA NA {code} And that is why the blank string is turned into NA. Whereas if the data.frame has character/factor column instead, the blank field is retained as blank: {code} > d <- "col1,col2 + 1,\"###\" + 2,\"\" + 3,\"this is a string\"" > df <- read.csv(text=d, quote="\"", na.strings=c("###")) > df col1 col2 11 22 33 this is a string > str(df) 'data.frame': 3 obs. of 2 variables: $ col1: int 1 2 3 $ col2: Factor w/ 2 levels "","this is a string": NA 1 2 {code} IMO this behavior makes sense. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18029) PruneFileSourcePartitions should not change the output of LogicalRelation
[ https://issues.apache.org/jira/browse/SPARK-18029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18029. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15569 [https://github.com/apache/spark/pull/15569] > PruneFileSourcePartitions should not change the output of LogicalRelation > - > > Key: SPARK-18029 > URL: https://issues.apache.org/jira/browse/SPARK-18029 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL
[ https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593935#comment-15593935 ] Maciej Bryński edited comment on SPARK-18022 at 10/21/16 4:24 AM: -- I think the problem is in this PR. https://github.com/apache/spark/commit/811a2cef03647c5be29fef522c423921c79b1bc3 CC: [~davies] was (Author: maver1ck): I think the problem is in this PR. https://github.com/apache/spark/commit/811a2cef03647c5be29fef522c423921c79b1bc3 > java.lang.NullPointerException instead of real exception when saving DF to > MySQL > > > Key: SPARK-18022 > URL: https://issues.apache.org/jira/browse/SPARK-18022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Maciej Bryński >Priority: Minor > > Hi, > I have found following issue. > When there is an exception while saving dataframe to MySQL I'm unable to get > it. > Instead of I'm getting following stacktrace. > {code} > 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID > 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a > null exception. > at java.lang.Throwable.addSuppressed(Throwable.java:1046) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The real exception could be for example duplicate on primary key etc. > With this it's very difficult to debugging apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL
[ https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593935#comment-15593935 ] Maciej Bryński commented on SPARK-18022: I think the problem is in this PR. https://github.com/apache/spark/commit/811a2cef03647c5be29fef522c423921c79b1bc3 > java.lang.NullPointerException instead of real exception when saving DF to > MySQL > > > Key: SPARK-18022 > URL: https://issues.apache.org/jira/browse/SPARK-18022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Maciej Bryński >Priority: Minor > > Hi, > I have found following issue. > When there is an exception while saving dataframe to MySQL I'm unable to get > it. > Instead of I'm getting following stacktrace. > {code} > 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID > 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a > null exception. > at java.lang.Throwable.addSuppressed(Throwable.java:1046) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The real exception could be for example duplicate on primary key etc. > With this it's very difficult to debugging apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18039) ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced
astralidea created SPARK-18039: -- Summary: ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced Key: SPARK-18039 URL: https://issues.apache.org/jira/browse/SPARK-18039 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.0.1 Reporter: astralidea Priority: Minor receiver scheduling balance is important for me for instance if I have 2 executor, each executor has 1 receiver, calc time is 0.1s per batch. but if I have 2 executor, one executor has 2 receiver and another is 0 receiver ,calc time is increase 3s per batch. In my cluster executor init is slow I need about 30s to wait. but dummy job only run 4s to wait, I add conf spark.scheduler.maxRegisteredResourcesWaitingTime it does not work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL
[ https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593846#comment-15593846 ] Maciej Bryński edited comment on SPARK-18022 at 10/21/16 3:56 AM: -- Only improvement in error handling. Because right now I'm getting only NPE and I have to guess whats the real reason of error. was (Author: maver1ck): Only improvement in error handling. Because right now I'm getting only NPE and have to guess whats the real reason of error. > java.lang.NullPointerException instead of real exception when saving DF to > MySQL > > > Key: SPARK-18022 > URL: https://issues.apache.org/jira/browse/SPARK-18022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Maciej Bryński >Priority: Minor > > Hi, > I have found following issue. > When there is an exception while saving dataframe to MySQL I'm unable to get > it. > Instead of I'm getting following stacktrace. > {code} > 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID > 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a > null exception. > at java.lang.Throwable.addSuppressed(Throwable.java:1046) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The real exception could be for example duplicate on primary key etc. > With this it's very difficult to debugging apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL
[ https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593846#comment-15593846 ] Maciej Bryński edited comment on SPARK-18022 at 10/21/16 3:39 AM: -- Only improvement in error handling. Because right now I'm getting only NPE and have to guess whats the real reason of error. was (Author: maver1ck): Only improvement in error handling. > java.lang.NullPointerException instead of real exception when saving DF to > MySQL > > > Key: SPARK-18022 > URL: https://issues.apache.org/jira/browse/SPARK-18022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Maciej Bryński >Priority: Minor > > Hi, > I have found following issue. > When there is an exception while saving dataframe to MySQL I'm unable to get > it. > Instead of I'm getting following stacktrace. > {code} > 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID > 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a > null exception. > at java.lang.Throwable.addSuppressed(Throwable.java:1046) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The real exception could be for example duplicate on primary key etc. > With this it's very difficult to debugging apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL
[ https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593846#comment-15593846 ] Maciej Bryński commented on SPARK-18022: Only improvement in error handling. > java.lang.NullPointerException instead of real exception when saving DF to > MySQL > > > Key: SPARK-18022 > URL: https://issues.apache.org/jira/browse/SPARK-18022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Maciej Bryński >Priority: Minor > > Hi, > I have found following issue. > When there is an exception while saving dataframe to MySQL I'm unable to get > it. > Instead of I'm getting following stacktrace. > {code} > 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID > 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a > null exception. > at java.lang.Throwable.addSuppressed(Throwable.java:1046) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The real exception could be for example duplicate on primary key etc. > With this it's very difficult to debugging apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15765) Make continuous Parquet writes consistent with non-continuous Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593782#comment-15593782 ] Liwei Lin edited comment on SPARK-15765 at 10/21/16 3:07 AM: - I'm closing this in favor of SPARK-17924 was (Author: lwlin): I'm closing this in favor of SPARK-18025 > Make continuous Parquet writes consistent with non-continuous Parquet writes > > > Key: SPARK-15765 > URL: https://issues.apache.org/jira/browse/SPARK-15765 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin(Inactive) > > Currently there are some code duplicates in continuous Parquet writes (as in > Structured Streaming) and non-continuous writes; see > [ParquetFileFormat#prepareWrite()|(https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L68] > and > [ParquetFileFormat#ParquetOutputWriterFactory|https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L414]. > This may lead to inconsistent behavior, when we only change one piece of code > but not the other. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15765) Make continuous Parquet writes consistent with non-continuous Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593782#comment-15593782 ] Liwei Lin commented on SPARK-15765: --- I'm closing this in favor of SPARK-18025 > Make continuous Parquet writes consistent with non-continuous Parquet writes > > > Key: SPARK-15765 > URL: https://issues.apache.org/jira/browse/SPARK-15765 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin(Inactive) > > Currently there are some code duplicates in continuous Parquet writes (as in > Structured Streaming) and non-continuous writes; see > [ParquetFileFormat#prepareWrite()|(https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L68] > and > [ParquetFileFormat#ParquetOutputWriterFactory|https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L414]. > This may lead to inconsistent behavior, when we only change one piece of code > but not the other. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15765) Make continuous Parquet writes consistent with non-continuous Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin(Inactive) closed SPARK-15765. --- Resolution: Duplicate > Make continuous Parquet writes consistent with non-continuous Parquet writes > > > Key: SPARK-15765 > URL: https://issues.apache.org/jira/browse/SPARK-15765 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin(Inactive) > > Currently there are some code duplicates in continuous Parquet writes (as in > Structured Streaming) and non-continuous writes; see > [ParquetFileFormat#prepareWrite()|(https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L68] > and > [ParquetFileFormat#ParquetOutputWriterFactory|https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L414]. > This may lead to inconsistent behavior, when we only change one piece of code > but not the other. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593760#comment-15593760 ] Liwei Lin commented on SPARK-16845: --- Oh thanks for the feedback; it's helpful! The branch you're testing against is one way to fix this, and there's also an alternative way -- we're still discussing which would be better. I think this shall get merged in possibly after Spark Summit Europe. Thanks! > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17829) Stable format for offset log
[ https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593706#comment-15593706 ] Tyson Condie commented on SPARK-17829: -- Had a conversation with Michael about how to offset serialization. When considering deserialization, the following three options seem possible. 1. Ask the source to deserialize the string into an offset (object). 2. Follow a formatting convention e.g., first line identifies an offset implementation class that accepts a string constructor argument; the string that is passed to the constructor comes from the second line. 3. Get rid of the Offset trait entirely and only deal with strings. This seems reasonable since we do not need to compare two offsets; we only care about the source's understanding of the offset, which it can interpret from whatever it embeds in the string e.g., like option 2. > Stable format for offset log > > > Key: SPARK-17829 > URL: https://issues.apache.org/jira/browse/SPARK-17829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Tyson Condie > > Currently we use java serialization for the WAL that stores the offsets > contained in each batch. This has two main issues: > - It can break across spark releases (though this is not the only thing > preventing us from upgrading a running query) > - It is unnecessarily opaque to the user. > I'd propose we require offsets to provide a user readable serialization and > use that instead. JSON is probably a good option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17891) SQL-based three column join loses first column
[ https://issues.apache.org/jira/browse/SPARK-17891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593685#comment-15593685 ] Yuming Wang commented on SPARK-17891: - *Workaround:* # Disable BroadcastHashJoin by setting {{spark.sql.autoBroadcastJoinThreshold=-1}} # Convert join keys to StringType > SQL-based three column join loses first column > -- > > Key: SPARK-17891 > URL: https://issues.apache.org/jira/browse/SPARK-17891 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 >Reporter: Eli Miller > Attachments: test.tgz > > > Hi all, > I hope that this is not a known issue (I haven't had any luck finding > anything similar in Jira or the mailing lists but I could be searching with > the wrong terms). I just started to experiment with Spark SQL and am seeing > what appears to be a bug. When using Spark SQL to join two tables with a > three column inner join, the first column join is ignored. The example code > that I have starts with two tables *T1*: > {noformat} > +---+---+---+---+ > | a| b| c| d| > +---+---+---+---+ > | 1| 2| 3| 4| > +---+---+---+---+ > {noformat} > and *T2*: > {noformat} > +---+---+---+---+ > | b| c| d| e| > +---+---+---+---+ > | 2| 3| 4| 5| > | -2| 3| 4| 6| > | 2| -3| 4| 7| > +---+---+---+---+ > {noformat} > Joining *T1* to *T2* on *b*, *c* and *d* (in that order): > {code:sql} > SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e > FROM t1, t2 > WHERE t1.b = t2.b AND t1.c = t2.c AND t1.d = t2.d > {code} > results in the following (note that *T1.b* != *T2.b* in the first row): > {noformat} > +---+---+---+---+---+---+---+---+ > | a| b| b| c| c| d| d| e| > +---+---+---+---+---+---+---+---+ > | 1| 2| -2| 3| 3| 4| 4| 6| > | 1| 2| 2| 3| 3| 4| 4| 5| > +---+---+---+---+---+---+---+---+ > {noformat} > Switching the predicate order to *c*, *b* and *d*: > {code:sql} > SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e > FROM t1, t2 > WHERE t1.c = t2.c AND t1.b = t2.b AND t1.d = t2.d > {code} > yields different results (now *T1.c* != *T2.c* in the first row): > {noformat} > +---+---+---+---+---+---+---+---+ > | a| b| b| c| c| d| d| e| > +---+---+---+---+---+---+---+---+ > | 1| 2| 2| 3| -3| 4| 4| 7| > | 1| 2| 2| 3| 3| 4| 4| 5| > +---+---+---+---+---+---+---+---+ > {noformat} > Is this expected? > I started to research this a bit and one thing that jumped out at me was the > ordering of the HashedRelationBroadcastMode concatenation in the plan (this > is from the *b*, *c*, *d* ordering): > {noformat} > ... > *Project [a#0, b#1, b#9, c#2, c#10, d#3, d#11, e#12] > +- *BroadcastHashJoin [b#1, c#2, d#3], [b#9, c#10, d#11], Inner, BuildRight >:- *Project [a#0, b#1, c#2, d#3] >: +- *Filter ((isnotnull(b#1) && isnotnull(c#2)) && isnotnull(d#3)) >: +- *Scan csv [a#0,b#1,c#2,d#3] Format: CSV, InputPaths: > file:/home/eli/git/IENG/what/target/classes/t1.csv, PartitionFilters: [], > PushedFilters: [IsNotNull(b), IsNotNull(c), IsNotNull(d)], ReadSchema: > struct >+- BroadcastExchange > HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, > true] as bigint), 32) | (cast(input[1, int, true] as bigint) & 4294967295)), > 32) | (cast(input[2, int, true] as bigint) & 4294967295 > +- *Project [b#9, c#10, d#11, e#12] > +- *Filter ((isnotnull(c#10) && isnotnull(b#9)) && isnotnull(d#11)) > +- *Scan csv [b#9,c#10,d#11,e#12] Format: CSV, InputPaths: > file:/home/eli/git/IENG/what/target/classes/t2.csv, PartitionFilters: [], > PushedFilters: [IsNotNull(c), IsNotNull(b), IsNotNull(d)], ReadSchema: > struct] > {noformat} > If this concatenated byte array is ever truncated to 64 bits in a comparison, > the leading column will be lost and could result in this behavior. > I will attach my example code and data. Please let me know if I can provide > any other details. > Best regards, > Eli -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-882) Have link for feedback/suggestions in docs
[ https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593573#comment-15593573 ] Deron Eriksson edited comment on SPARK-882 at 10/21/16 1:37 AM: I don't see any activity, so mind if I take a crack at this for the Spark documentation (link to open a pre-populated minor doc JIRA)? cc [~pwendell] [~srowen] was (Author: deron): I don't see any activity, so mind if I take a crack at this for the Spark documentation (link to open a pre-populated minor doc JIRA)? I think this is a great idea so I just implemented it for SystemML (http://apache.github.io/incubator-systemml/ under Issue Tracking on the top nav). cc [~pwendell] [~srowen] > Have link for feedback/suggestions in docs > -- > > Key: SPARK-882 > URL: https://issues.apache.org/jira/browse/SPARK-882 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Patrick Wendell >Assignee: Patrick Cogan > > It would be cool to have a link at the top of the docs for > feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from > that and it could be a good way to crowdsource correctness checking, since a > lot of us that write them never have to use them. > Something to the right of the main top nav might be good. [~andyk] [~matei] - > what do you guys think? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children
[ https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593583#comment-15593583 ] Reynold Xin commented on SPARK-18038: - It definitely does. > Move output partitioning definition from UnaryNodeExec to its children > -- > > Key: SPARK-18038 > URL: https://issues.apache.org/jira/browse/SPARK-18038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Tejas Patil >Priority: Trivial > > This was a suggestion by [~rxin] over one of the dev list discussion : > http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html > {noformat} > I think this is very risky because preserving output partitioning should not > be a property of UnaryNodeExec (e.g. exchange). > It would be better (safer) to move the output partitioning definition into > each of the operator and remove it from UnaryExecNode. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-882) Have link for feedback/suggestions in docs
[ https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593573#comment-15593573 ] Deron Eriksson commented on SPARK-882: -- I don't see any activity, so mind if I take a crack at this for the Spark documentation (link to open a pre-populated minor doc JIRA)? I think this is a great idea so I just implemented it for SystemML (http://apache.github.io/incubator-systemml/ under Issue Tracking on the top nav). cc [~pwendell] [~srowen] > Have link for feedback/suggestions in docs > -- > > Key: SPARK-882 > URL: https://issues.apache.org/jira/browse/SPARK-882 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Patrick Wendell >Assignee: Patrick Cogan > > It would be cool to have a link at the top of the docs for > feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from > that and it could be a good way to crowdsource correctness checking, since a > lot of us that write them never have to use them. > Something to the right of the main top nav might be good. [~andyk] [~matei] - > what do you guys think? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593539#comment-15593539 ] Joseph K. Bradley edited comment on SPARK-7146 at 10/21/16 12:53 AM: - Update: We may need to make Java interfaces for these, rather than expecting users to depend upon the Scala traits. There's a Java binary compatibility issue which surfaced in the MiMa upgrade here: [https://github.com/apache/spark/pull/15571] We could probably also expose the corresponding Scala traits since they should be safe for Scala users to use outside of Spark. was (Author: josephkb): Update: We may need to make Java interfaces for these, rather than expecting users to depend upon the Scala traits. There's a Java binary compatibility issue which surfaced in the Mi > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593539#comment-15593539 ] Joseph K. Bradley commented on SPARK-7146: -- Update: We may need to make Java interfaces for these, rather than expecting users to depend upon the Scala traits. There's a Java binary compatibility issue which surfaced in the Mi > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18030: Assignee: Apache Spark (was: Tathagata Das) > Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite > - > > Key: SPARK-18030 > URL: https://issues.apache.org/jira/browse/SPARK-18030 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Davies Liu >Assignee: Apache Spark > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite&test_name=when+schema+inference+is+turned+on%2C+should+read+partition+data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593493#comment-15593493 ] Apache Spark commented on SPARK-18030: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15577 > Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite > - > > Key: SPARK-18030 > URL: https://issues.apache.org/jira/browse/SPARK-18030 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Davies Liu >Assignee: Tathagata Das > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite&test_name=when+schema+inference+is+turned+on%2C+should+read+partition+data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18030: Assignee: Tathagata Das (was: Apache Spark) > Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite > - > > Key: SPARK-18030 > URL: https://issues.apache.org/jira/browse/SPARK-18030 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Davies Liu >Assignee: Tathagata Das > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite&test_name=when+schema+inference+is+turned+on%2C+should+read+partition+data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17674) Warnings from SparkR tests being ignored without redirecting to errors
[ https://issues.apache.org/jira/browse/SPARK-17674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593484#comment-15593484 ] Apache Spark commented on SPARK-17674: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/15576 > Warnings from SparkR tests being ignored without redirecting to errors > -- > > Key: SPARK-17674 > URL: https://issues.apache.org/jira/browse/SPARK-17674 > Project: Spark > Issue Type: Test > Components: SparkR >Reporter: Hyukjin Kwon > > For example, _currently_ we are having warnings as below: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65905/consoleFull > {code} > Warnings > --- > 1. spark.mlp (@test_mllib.R#400) - is.na() applied to non-(list or vector) of > type 'NULL' > 2. spark.mlp (@test_mllib.R#401) - is.na() applied to non-(list or vector) of > type 'NULL' > {code} > This should be errors as specified in > https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L22 > However, it seems passing the tests fine. > This seems related with the behaciour in `testhat` library. We should > invesigate and fix. This was also discussed in > https://github.com/apache/spark/pull/15232 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17674) Warnings from SparkR tests being ignored without redirecting to errors
[ https://issues.apache.org/jira/browse/SPARK-17674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17674: Assignee: Apache Spark > Warnings from SparkR tests being ignored without redirecting to errors > -- > > Key: SPARK-17674 > URL: https://issues.apache.org/jira/browse/SPARK-17674 > Project: Spark > Issue Type: Test > Components: SparkR >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > For example, _currently_ we are having warnings as below: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65905/consoleFull > {code} > Warnings > --- > 1. spark.mlp (@test_mllib.R#400) - is.na() applied to non-(list or vector) of > type 'NULL' > 2. spark.mlp (@test_mllib.R#401) - is.na() applied to non-(list or vector) of > type 'NULL' > {code} > This should be errors as specified in > https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L22 > However, it seems passing the tests fine. > This seems related with the behaciour in `testhat` library. We should > invesigate and fix. This was also discussed in > https://github.com/apache/spark/pull/15232 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17674) Warnings from SparkR tests being ignored without redirecting to errors
[ https://issues.apache.org/jira/browse/SPARK-17674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17674: Assignee: (was: Apache Spark) > Warnings from SparkR tests being ignored without redirecting to errors > -- > > Key: SPARK-17674 > URL: https://issues.apache.org/jira/browse/SPARK-17674 > Project: Spark > Issue Type: Test > Components: SparkR >Reporter: Hyukjin Kwon > > For example, _currently_ we are having warnings as below: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65905/consoleFull > {code} > Warnings > --- > 1. spark.mlp (@test_mllib.R#400) - is.na() applied to non-(list or vector) of > type 'NULL' > 2. spark.mlp (@test_mllib.R#401) - is.na() applied to non-(list or vector) of > type 'NULL' > {code} > This should be errors as specified in > https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L22 > However, it seems passing the tests fine. > This seems related with the behaciour in `testhat` library. We should > invesigate and fix. This was also discussed in > https://github.com/apache/spark/pull/15232 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children
[ https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18038: Assignee: (was: Apache Spark) > Move output partitioning definition from UnaryNodeExec to its children > -- > > Key: SPARK-18038 > URL: https://issues.apache.org/jira/browse/SPARK-18038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Tejas Patil >Priority: Trivial > > This was a suggestion by [~rxin] over one of the dev list discussion : > http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html > {noformat} > I think this is very risky because preserving output partitioning should not > be a property of UnaryNodeExec (e.g. exchange). > It would be better (safer) to move the output partitioning definition into > each of the operator and remove it from UnaryExecNode. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children
[ https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593479#comment-15593479 ] Apache Spark commented on SPARK-18038: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/15575 > Move output partitioning definition from UnaryNodeExec to its children > -- > > Key: SPARK-18038 > URL: https://issues.apache.org/jira/browse/SPARK-18038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Tejas Patil >Priority: Trivial > > This was a suggestion by [~rxin] over one of the dev list discussion : > http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html > {noformat} > I think this is very risky because preserving output partitioning should not > be a property of UnaryNodeExec (e.g. exchange). > It would be better (safer) to move the output partitioning definition into > each of the operator and remove it from UnaryExecNode. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children
[ https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18038: Assignee: Apache Spark > Move output partitioning definition from UnaryNodeExec to its children > -- > > Key: SPARK-18038 > URL: https://issues.apache.org/jira/browse/SPARK-18038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Tejas Patil >Assignee: Apache Spark >Priority: Trivial > > This was a suggestion by [~rxin] over one of the dev list discussion : > http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html > {noformat} > I think this is very risky because preserving output partitioning should not > be a property of UnaryNodeExec (e.g. exchange). > It would be better (safer) to move the output partitioning definition into > each of the operator and remove it from UnaryExecNode. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children
[ https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593478#comment-15593478 ] Tejas Patil commented on SPARK-18038: - Not sure if this deserves a jira but created one. This is a small refactoring of code. > Move output partitioning definition from UnaryNodeExec to its children > -- > > Key: SPARK-18038 > URL: https://issues.apache.org/jira/browse/SPARK-18038 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1 >Reporter: Tejas Patil >Priority: Trivial > > This was a suggestion by [~rxin] over one of the dev list discussion : > http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html > {noformat} > I think this is very risky because preserving output partitioning should not > be a property of UnaryNodeExec (e.g. exchange). > It would be better (safer) to move the output partitioning definition into > each of the operator and remove it from UnaryExecNode. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children
Tejas Patil created SPARK-18038: --- Summary: Move output partitioning definition from UnaryNodeExec to its children Key: SPARK-18038 URL: https://issues.apache.org/jira/browse/SPARK-18038 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.1 Reporter: Tejas Patil Priority: Trivial This was a suggestion by [~rxin] over one of the dev list discussion : http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html {noformat} I think this is very risky because preserving output partitioning should not be a property of UnaryNodeExec (e.g. exchange). It would be better (safer) to move the output partitioning definition into each of the operator and remove it from UnaryExecNode. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13955) Spark in yarn mode fails
[ https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593453#comment-15593453 ] Tzach Zohar commented on SPARK-13955: - [~saisai_shao] can you clarify regarding option #1: when you say bq. You need to zip all the jars and specify spark.yarn.archive with the path of zipped jars How exactly should that archive look like? We're upgrading from 1.6.2 and we keep getting the same error mentioned above: bq. Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher We've tried using {{spark.yarn.archive}} with: - The Spark binary downloaded from the download page (e.g. http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.6.tgz) - Creating a {{.zip}} file with the contents of the {{jars/}} folder from the downloaded binary - Creating a {{.tgz}} file with the contents of the {{jars/}} folder from the downloaded binary - All of these options while placing file either on HDFS or locally on driver machine None of these resolve the issue. The only option that actually worked for us was the third one you mentioned - setting neither {{spark.yarn.jars}} nor {{spark.yarn.archive}} and making sure the right jars exist in {{SPARK_HOME/jars}} on each node - but since we run several applications with different spark versions and want to simplify our provisioning - this isn't convenient for us. Any clarification would be greatly appreciated, Thanks! > Spark in yarn mode fails > > > Key: SPARK-13955 > URL: https://issues.apache.org/jira/browse/SPARK-13955 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Marcelo Vanzin > Fix For: 2.0.0 > > > I ran spark-shell in yarn client, but from the logs seems the spark assembly > jar is not uploaded to HDFS. This may be known issue in the process of > SPARK-11157, create this ticket to track this issue. [~vanzin] > {noformat} > 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory > including 384 MB overhead > 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM > 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM > container > 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container > 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive > is set, falling back to uploading libraries under SPARK_HOME. > 16/03/17 17:57:48 INFO Client: Uploading resource > file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar > 16/03/17 17:57:49 INFO Client: Uploading resource > file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar > 16/03/17 17:57:49 INFO Client: Uploading resource > file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip > -> > hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip > 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang > 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang > 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(jzhang); users > with modify permissions: Set(jzhang) > 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager > {noformat} > message in AM container > {noformat} > Error: Could not find or load main class > org.apache.spark.deploy.yarn.ExecutorLauncher > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18037) Event listener should be aware of multiple tries of same stage
[ https://issues.apache.org/jira/browse/SPARK-18037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593347#comment-15593347 ] Josh Rosen commented on SPARK-18037: Ahhh, I remember there being other JIRAs related to a negative number of active tasks but AFAIK we were never able to reproduce that issue. Thanks for getting to the bottom of this! I'll search JIRA and link those older issues here. > Event listener should be aware of multiple tries of same stage > -- > > Key: SPARK-18037 > URL: https://issues.apache.org/jira/browse/SPARK-18037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu > > A stage could be resubmitted before all the task from previous submit had > finished, then event listen will mess them up, cause confusing number of > active tasks (become negative). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18037) Event listener should be aware of multiple tries of same stage
Davies Liu created SPARK-18037: -- Summary: Event listener should be aware of multiple tries of same stage Key: SPARK-18037 URL: https://issues.apache.org/jira/browse/SPARK-18037 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu A stage could be resubmitted before all the task from previous submit had finished, then event listen will mess them up, cause confusing number of active tasks (become negative). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18019) Log instrumentation in GBTs
[ https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18019: Assignee: (was: Apache Spark) > Log instrumentation in GBTs > --- > > Key: SPARK-18019 > URL: https://issues.apache.org/jira/browse/SPARK-18019 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Sub-task for adding instrumentation to GBTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18019) Log instrumentation in GBTs
[ https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593306#comment-15593306 ] Apache Spark commented on SPARK-18019: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/15574 > Log instrumentation in GBTs > --- > > Key: SPARK-18019 > URL: https://issues.apache.org/jira/browse/SPARK-18019 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Sub-task for adding instrumentation to GBTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18019) Log instrumentation in GBTs
[ https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18019: Assignee: Apache Spark > Log instrumentation in GBTs > --- > > Key: SPARK-18019 > URL: https://issues.apache.org/jira/browse/SPARK-18019 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Apache Spark > > Sub-task for adding instrumentation to GBTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593300#comment-15593300 ] Hyukjin Kwon commented on SPARK-17916: -- Could I please ask what you think? cc [~felixcheung] > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593271#comment-15593271 ] Hyukjin Kwon commented on SPARK-17916: -- Oh, yes sure. I just thought the root problem is to differentiate {{""}}. Once we can distinguish it, we can easily transform it. Also, another point I want to make was.. we already have a great reference in R but it seems not handling this case. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18036) Decision Trees do not handle edge cases
Seth Hendrickson created SPARK-18036: Summary: Decision Trees do not handle edge cases Key: SPARK-18036 URL: https://issues.apache.org/jira/browse/SPARK-18036 Project: Spark Issue Type: Bug Components: ML, MLlib Reporter: Seth Hendrickson Priority: Minor Decision trees/GBT/RF do not handle edge cases such as constant features or empty features. For example: {code} val dt = new DecisionTreeRegressor() val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF() dt.fit(data) java.lang.UnsupportedOperationException: empty.max at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229) at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234) at org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207) at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105) at org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93) at org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) ... 52 elided {code} as well as {code} val dt = new DecisionTreeRegressor() val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF() dt.fit(data) java.lang.UnsupportedOperationException: empty.maxBy at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236) at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37) at org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15777) Catalog federation
[ https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593229#comment-15593229 ] Yan commented on SPARK-15777: - One approach could be first tagging a subtree as specific to a data source, and then only applying the custom rules from that data source to the subtree so tagged. There could be other feasible approaches, and it is considered one of the details left open for future discussions. Thanks. > Catalog federation > -- > > Key: SPARK-15777 > URL: https://issues.apache.org/jira/browse/SPARK-15777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: SparkFederationDesign.pdf > > > This is a ticket to track progress to support federating multiple external > catalogs. This would require establishing an API (similar to the current > ExternalCatalog API) for getting information about external catalogs, and > ability to convert a table into a data source table. > As part of this, we would also need to be able to support more than a > two-level table identifier (database.table). At the very least we would need > a three level identifier for tables (catalog.database.table). A possibly > direction is to support arbitrary level hierarchical namespaces similar to > file systems. > Once we have this implemented, we can convert the current Hive catalog > implementation into an external catalog that is "mounted" into an internal > catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18035) Unwrapping java maps in HiveInspectors allocates unnecessary buffer
[ https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18035: Assignee: Apache Spark > Unwrapping java maps in HiveInspectors allocates unnecessary buffer > --- > > Key: SPARK-18035 > URL: https://issues.apache.org/jira/browse/SPARK-18035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Tejas Patil >Assignee: Apache Spark >Priority: Minor > > In HiveInspectors, I saw that converting Java map to Spark's > `ArrayBasedMapData` spent quite sometime in buffer copying : > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658 > The reason being `map.toSeq` allocates a new buffer and copies the map > entries to it: > https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323 > This copy is not needed as we get rid of it once we extract the key and value > arrays. > Here is the call trace: > {noformat} > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664) > scala.collection.AbstractMap.toSeq(Map.scala:59) > scala.collection.MapLike$class.toSeq(MapLike.scala:323) > scala.collection.AbstractMap.toBuffer(Map.scala:59) > scala.collection.MapLike$class.toBuffer(MapLike.scala:326) > scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104) > scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275) > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > scala.collection.AbstractIterable.foreach(Iterable.scala:54) > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > scala.collection.Iterator$class.foreach(Iterator.scala:893) > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18035) Unwrapping java maps in HiveInspectors allocates unnecessary buffer
[ https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593190#comment-15593190 ] Apache Spark commented on SPARK-18035: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/15573 > Unwrapping java maps in HiveInspectors allocates unnecessary buffer > --- > > Key: SPARK-18035 > URL: https://issues.apache.org/jira/browse/SPARK-18035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Tejas Patil >Priority: Minor > > In HiveInspectors, I saw that converting Java map to Spark's > `ArrayBasedMapData` spent quite sometime in buffer copying : > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658 > The reason being `map.toSeq` allocates a new buffer and copies the map > entries to it: > https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323 > This copy is not needed as we get rid of it once we extract the key and value > arrays. > Here is the call trace: > {noformat} > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664) > scala.collection.AbstractMap.toSeq(Map.scala:59) > scala.collection.MapLike$class.toSeq(MapLike.scala:323) > scala.collection.AbstractMap.toBuffer(Map.scala:59) > scala.collection.MapLike$class.toBuffer(MapLike.scala:326) > scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104) > scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275) > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > scala.collection.AbstractIterable.foreach(Iterable.scala:54) > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > scala.collection.Iterator$class.foreach(Iterator.scala:893) > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18035) Unwrapping java maps in HiveInspectors allocates unnecessary buffer
[ https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18035: Assignee: (was: Apache Spark) > Unwrapping java maps in HiveInspectors allocates unnecessary buffer > --- > > Key: SPARK-18035 > URL: https://issues.apache.org/jira/browse/SPARK-18035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Tejas Patil >Priority: Minor > > In HiveInspectors, I saw that converting Java map to Spark's > `ArrayBasedMapData` spent quite sometime in buffer copying : > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658 > The reason being `map.toSeq` allocates a new buffer and copies the map > entries to it: > https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323 > This copy is not needed as we get rid of it once we extract the key and value > arrays. > Here is the call trace: > {noformat} > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664) > scala.collection.AbstractMap.toSeq(Map.scala:59) > scala.collection.MapLike$class.toSeq(MapLike.scala:323) > scala.collection.AbstractMap.toBuffer(Map.scala:59) > scala.collection.MapLike$class.toBuffer(MapLike.scala:326) > scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104) > scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275) > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > scala.collection.AbstractIterable.foreach(Iterable.scala:54) > scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > scala.collection.Iterator$class.foreach(Iterator.scala:893) > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18035) Unwrapping java maps in HiveInspectors allocates unnecessary buffer
Tejas Patil created SPARK-18035: --- Summary: Unwrapping java maps in HiveInspectors allocates unnecessary buffer Key: SPARK-18035 URL: https://issues.apache.org/jira/browse/SPARK-18035 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Tejas Patil Priority: Minor In HiveInspectors, I saw that converting Java map to Spark's `ArrayBasedMapData` spent quite sometime in buffer copying : https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658 The reason being `map.toSeq` allocates a new buffer and copies the map entries to it: https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323 This copy is not needed as we get rid of it once we extract the key and value arrays. Here is the call trace: ``` org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664) scala.collection.AbstractMap.toSeq(Map.scala:59) scala.collection.MapLike$class.toSeq(MapLike.scala:323) scala.collection.AbstractMap.toBuffer(Map.scala:59) scala.collection.MapLike$class.toBuffer(MapLike.scala:326) scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104) scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) scala.collection.AbstractIterable.foreach(Iterable.scala:54) scala.collection.IterableLike$class.foreach(IterableLike.scala:72) scala.collection.AbstractIterator.foreach(Iterator.scala:1336) scala.collection.Iterator$class.foreach(Iterator.scala:893) scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18035) Unwrapping java maps in HiveInspectors allocates unnecessary buffer
[ https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated SPARK-18035: Description: In HiveInspectors, I saw that converting Java map to Spark's `ArrayBasedMapData` spent quite sometime in buffer copying : https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658 The reason being `map.toSeq` allocates a new buffer and copies the map entries to it: https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323 This copy is not needed as we get rid of it once we extract the key and value arrays. Here is the call trace: {noformat} org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664) scala.collection.AbstractMap.toSeq(Map.scala:59) scala.collection.MapLike$class.toSeq(MapLike.scala:323) scala.collection.AbstractMap.toBuffer(Map.scala:59) scala.collection.MapLike$class.toBuffer(MapLike.scala:326) scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104) scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) scala.collection.AbstractIterable.foreach(Iterable.scala:54) scala.collection.IterableLike$class.foreach(IterableLike.scala:72) scala.collection.AbstractIterator.foreach(Iterator.scala:1336) scala.collection.Iterator$class.foreach(Iterator.scala:893) scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) {noformat} was: In HiveInspectors, I saw that converting Java map to Spark's `ArrayBasedMapData` spent quite sometime in buffer copying : https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658 The reason being `map.toSeq` allocates a new buffer and copies the map entries to it: https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323 This copy is not needed as we get rid of it once we extract the key and value arrays. Here is the call trace: ``` org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664) scala.collection.AbstractMap.toSeq(Map.scala:59) scala.collection.MapLike$class.toSeq(MapLike.scala:323) scala.collection.AbstractMap.toBuffer(Map.scala:59) scala.collection.MapLike$class.toBuffer(MapLike.scala:326) scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104) scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) scala.collection.AbstractIterable.foreach(Iterable.scala:54) scala.collection.IterableLike$class.foreach(IterableLike.scala:72) scala.collection.AbstractIterator.foreach(Iterator.scala:1336) scala.collection.Iterator$class.foreach(Iterator.scala:893) scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) ``` > Unwrapping java maps in HiveInspectors allocates unnecessary buffer > --- > > Key: SPARK-18035 > URL: https://issues.apache.org/jira/browse/SPARK-18035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Tejas Patil >Priority: Minor > > In HiveInspectors, I saw that converting Java map to Spark's > `ArrayBasedMapData` spent quite sometime in buffer copying : > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658 > The reason being `map.toSeq` allocates a new buffer and copies the map > entries to it: > https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323 > This copy is not needed as we get rid of it once we extract the key and value > arrays. > Here is the call trace: > {noformat} > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664) > scala.collection.AbstractMap.toSeq(Map.scala:59) > scala.collection.MapLike$class.toSeq(MapLike.scala:323) > scala.collection.AbstractMap.toBuffer(Map.scala:59) > scala.collection.MapLike$class.toBuffer(MapLike.scala:326) > scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104) > scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnc
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593119#comment-15593119 ] Suresh Thalamati commented on SPARK-17916: -- Thank you for trying out the different scenarios. I think output you are getting after setting he quote to empty is not what is expected in the case. You want "" to be recognized as empty string, not actual quotes in the output. Example (Before my changes on 2.0.1 branch): input: col1,col2 1,"-" 2,"" 3, 4,"A,B" val df = spark.read.format("csv").option("nullValue", "\"-\"").option("quote", "").option("header", true).load("/Users/suresht/sparktests/emptystring.csv") df: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> df.selectExpr("length(col2)").show ++ |length(col2)| ++ |null| | 2| |null| | 2| ++ scala> df.show +++ |col1|col2| +++ | 1|null| | 2| ""| | 3|null| | 4| "A| +++ > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL
[ https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593008#comment-15593008 ] Dongjoon Hyun commented on SPARK-18022: --- Or, what you want is just general improvement for error handling in `JdbcUtils.scala:256`? > java.lang.NullPointerException instead of real exception when saving DF to > MySQL > > > Key: SPARK-18022 > URL: https://issues.apache.org/jira/browse/SPARK-18022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Maciej Bryński >Priority: Minor > > Hi, > I have found following issue. > When there is an exception while saving dataframe to MySQL I'm unable to get > it. > Instead of I'm getting following stacktrace. > {code} > 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID > 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a > null exception. > at java.lang.Throwable.addSuppressed(Throwable.java:1046) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The real exception could be for example duplicate on primary key etc. > With this it's very difficult to debugging apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17829) Stable format for offset log
[ https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593000#comment-15593000 ] Cody Koeninger commented on SPARK-17829: At least with regard to kafka offsets, it might be good to keep this the same format as in SPARK-17812 > Stable format for offset log > > > Key: SPARK-17829 > URL: https://issues.apache.org/jira/browse/SPARK-17829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Tyson Condie > > Currently we use java serialization for the WAL that stores the offsets > contained in each batch. This has two main issues: > - It can break across spark releases (though this is not the only thing > preventing us from upgrading a running query) > - It is unnecessarily opaque to the user. > I'd propose we require offsets to provide a user readable serialization and > use that instead. JSON is probably a good option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL
[ https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593005#comment-15593005 ] Dongjoon Hyun commented on SPARK-18022: --- Hi, [~maver1ck]. Could you give us more information to reproduce this? Is the table created by Spark? Spark does not create INDEX or CONSTRAINT (primary key), does it? > java.lang.NullPointerException instead of real exception when saving DF to > MySQL > > > Key: SPARK-18022 > URL: https://issues.apache.org/jira/browse/SPARK-18022 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Maciej Bryński >Priority: Minor > > Hi, > I have found following issue. > When there is an exception while saving dataframe to MySQL I'm unable to get > it. > Instead of I'm getting following stacktrace. > {code} > 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID > 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a > null exception. > at java.lang.Throwable.addSuppressed(Throwable.java:1046) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > The real exception could be for example duplicate on primary key etc. > With this it's very difficult to debugging apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17829) Stable format for offset log
[ https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17829: - Assignee: Tyson Condie > Stable format for offset log > > > Key: SPARK-17829 > URL: https://issues.apache.org/jira/browse/SPARK-17829 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Tyson Condie > > Currently we use java serialization for the WAL that stores the offsets > contained in each batch. This has two main issues: > - It can break across spark releases (though this is not the only thing > preventing us from upgrading a running query) > - It is unnecessarily opaque to the user. > I'd propose we require offsets to provide a user readable serialization and > use that instead. JSON is probably a good option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592862#comment-15592862 ] Don Drake commented on SPARK-16845: --- I compiled your branch and ran my large job and it finished successfully. Sorry for the confusion, I wasn't watching the PR, just this JIRA and wasn't aware of the changes you were making. Can this get merged as well as backported to 2.0.x? Thanks so much. -Don > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18034) Upgrade to MiMa 0.1.11
[ https://issues.apache.org/jira/browse/SPARK-18034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18034: Assignee: Josh Rosen (was: Apache Spark) > Upgrade to MiMa 0.1.11 > -- > > Key: SPARK-18034 > URL: https://issues.apache.org/jira/browse/SPARK-18034 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should upgrade to the latest release of MiMa (0.1.11) in order to include > my fix for a bug which led to flakiness in the MiMa checks > (https://github.com/typesafehub/migration-manager/issues/115) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18034) Upgrade to MiMa 0.1.11
[ https://issues.apache.org/jira/browse/SPARK-18034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18034: Assignee: Apache Spark (was: Josh Rosen) > Upgrade to MiMa 0.1.11 > -- > > Key: SPARK-18034 > URL: https://issues.apache.org/jira/browse/SPARK-18034 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen >Assignee: Apache Spark > > We should upgrade to the latest release of MiMa (0.1.11) in order to include > my fix for a bug which led to flakiness in the MiMa checks > (https://github.com/typesafehub/migration-manager/issues/115) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18034) Upgrade to MiMa 0.1.11
[ https://issues.apache.org/jira/browse/SPARK-18034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592845#comment-15592845 ] Apache Spark commented on SPARK-18034: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/15571 > Upgrade to MiMa 0.1.11 > -- > > Key: SPARK-18034 > URL: https://issues.apache.org/jira/browse/SPARK-18034 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should upgrade to the latest release of MiMa (0.1.11) in order to include > my fix for a bug which led to flakiness in the MiMa checks > (https://github.com/typesafehub/migration-manager/issues/115) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592839#comment-15592839 ] Reynold Xin commented on SPARK-10915: - The current implementation of collect_list isn't going to work very well for you. I do think we should create a version of collect_list that spills. Alternatively, you can do df.repartition().sortWithinPartitions() -- which will give you the same thing. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18034) Upgrade to MiMa 0.1.11
Josh Rosen created SPARK-18034: -- Summary: Upgrade to MiMa 0.1.11 Key: SPARK-18034 URL: https://issues.apache.org/jira/browse/SPARK-18034 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Josh Rosen Assignee: Josh Rosen We should upgrade to the latest release of MiMa (0.1.11) in order to include my fix for a bug which led to flakiness in the MiMa checks (https://github.com/typesafehub/migration-manager/issues/115) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592831#comment-15592831 ] Jason White commented on SPARK-10915: - At the moment, we use .repartitionAndSortWithinPartitions to give us a strictly ordered iterable that we can process one at a time. We don't have a Python list sitting in memory, instead we rely on ExternalSort to order in a memory-safe way. I don't yet have enough experience with DataFrames to know if we will have the same or similar problems there. It's possible that collect_list will perform better - I'll give that a try when we get there and report back on this ticket if it's a suitable approach for our use case. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2629) Improved state management for Spark Streaming (mapWithState)
[ https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-2629: - Summary: Improved state management for Spark Streaming (mapWithState) (was: Improved state management for Spark Streaming) > Improved state management for Spark Streaming (mapWithState) > > > Key: SPARK-2629 > URL: https://issues.apache.org/jira/browse/SPARK-2629 > Project: Spark > Issue Type: Epic > Components: Streaming >Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.6.0 > > > Current updateStateByKey provides stateful processing in Spark Streaming. It > allows the user to maintain per-key state and manage that state using an > updateFunction. The updateFunction is called for each key, and it uses new > data and existing state of the key, to generate an updated state. However, > based on community feedback, we have learnt the following lessons. > - Need for more optimized state management that does not scan every key > - Need to make it easier to implement common use cases - (a) timeout of idle > data, (b) returning items other than state > The high level idea that I am proposing is > - Introduce a new API -trackStateByKey- *mapWithState* that, allows the user > to update per-key state, and emit arbitrary records. The new API is necessary > as this will have significantly different semantics than the existing > updateStateByKey API. This API will have direct support for timeouts. > - Internally, the system will keep the state data as a map/list within the > partitions of the state RDDs. The new data RDDs will be partitioned > appropriately, and for all the key-value data, it will lookup the map/list in > the state RDD partition and create a new list/map of updated state data. The > new state RDD partition will be created based on the update data and if > necessary, with old data. > Here is the detailed design doc (*outdated, to be updated*). Please take a > look and provide feedback as comments. > https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18021) Refactor file name specification for data sources
[ https://issues.apache.org/jira/browse/SPARK-18021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18021. - Resolution: Fixed Fix Version/s: 2.1.0 > Refactor file name specification for data sources > - > > Key: SPARK-18021 > URL: https://issues.apache.org/jira/browse/SPARK-18021 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > Currently each data source OutputWriter is responsible for specifying the > entire file name for each file output. This, however, does not make any sense > because we rely on file name for certain behaviors in Spark SQL, e.g. bucket > id. The current approach allows individual data sources to break the > implementation of bucketing. > We don't want to move file name entirely also out of the data sources, > because different data sources do want to specify different extensions. > A good compromise is for the OutputWriter to take in the prefix for a file, > and it can add its own suffix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18033) Deprecate TaskContext.partitionId
Cody Koeninger created SPARK-18033: -- Summary: Deprecate TaskContext.partitionId Key: SPARK-18033 URL: https://issues.apache.org/jira/browse/SPARK-18033 Project: Spark Issue Type: Improvement Reporter: Cody Koeninger Mark TaskContext.partitionId as deprecated, because it doesn't always reflect the physical index at the time the RDD is created. Add a foreachPartitionWithIndex method to mirror the existing mapPartitionsWithIndex method. For background, see http://apache-spark-developers-list.1001551.n3.nabble.com/PSA-TaskContext-partitionId-the-actual-logical-partition-index-td19524.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592534#comment-15592534 ] Jason White commented on SPARK-10915: - That's unfortunate. Materializing a list somewhere is exactly what we're trying to avoid. The lists can get unpredictably long for some small number of keys, and this approach tends to cause us to blow by our memory ceiling, at least when using RDDs. It's why we don't use .groupByKey unless absolutely necessary. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592544#comment-15592544 ] Reynold Xin commented on SPARK-10915: - But if you need strict ordering guarantees, materializing them would be necessary, since sorting is a blocking operator. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592514#comment-15592514 ] Davies Liu commented on SPARK-10915: [~jason.white] When a aggregate function is applied, the order of input rows is not defined (even you have a order by before the aggregate). In case that the order matters, you will have to use collect_list and UDF. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17999) Add getPreferredLocations for KafkaSourceRDD
[ https://issues.apache.org/jira/browse/SPARK-17999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17999: - Assignee: Saisai Shao > Add getPreferredLocations for KafkaSourceRDD > > > Key: SPARK-17999 > URL: https://issues.apache.org/jira/browse/SPARK-17999 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > The newly implemented Structured Streaming KafkaSource did calculate the > preferred locations for each topic partition, but didn't offer this > information through RDD's {{getPreferredLocations}} method. So here propose > to add this method in {{KafkaSourceRDD}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17999) Add getPreferredLocations for KafkaSourceRDD
[ https://issues.apache.org/jira/browse/SPARK-17999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17999. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 > Add getPreferredLocations for KafkaSourceRDD > > > Key: SPARK-17999 > URL: https://issues.apache.org/jira/browse/SPARK-17999 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Reporter: Saisai Shao >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > The newly implemented Structured Streaming KafkaSource did calculate the > preferred locations for each topic partition, but didn't offer this > information through RDD's {{getPreferredLocations}} method. So here propose > to add this method in {{KafkaSourceRDD}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18032) Spark test failed as OOM in jenkins
Davies Liu created SPARK-18032: -- Summary: Spark test failed as OOM in jenkins Key: SPARK-18032 URL: https://issues.apache.org/jira/browse/SPARK-18032 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Josh Rosen I saw some tests failed as OOM recently, for example, https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/1998/console#l10n-footer Maybe we should increase the heapsize, since we are continue to add more stuff into Spark/tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality
Davies Liu created SPARK-18031: -- Summary: Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality Key: SPARK-18031 URL: https://issues.apache.org/jira/browse/SPARK-18031 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite&test_name=basic+functionality -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite
Davies Liu created SPARK-18030: -- Summary: Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite Key: SPARK-18030 URL: https://issues.apache.org/jira/browse/SPARK-18030 Project: Spark Issue Type: Bug Components: Streaming Reporter: Davies Liu Assignee: Tathagata Das https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite&test_name=when+schema+inference+is+turned+on%2C+should+read+partition+data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15687) Columnar execution engine
[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592429#comment-15592429 ] Evan Chan commented on SPARK-15687: --- [~kiszk] thanks for the PR... would you mind pointing me to the ColumnarBatch Trait/API?I'd like to review that piece of it, but the code review is really really really long :)Thanks > Columnar execution engine > - > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > From the architectural perspective: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we encode nested data? What are the operations on nested data, and > how do we handle these operations in a columnar format? > - What is the transition plan towards the end state? > From an external API perspective: > - Can we expose a more efficient column batch user-defined function API? > - How do we leverage this to integrate with 3rd party tools? > - Can we have a spec for a fixed version of the column batch format that can > be externalized and use that in data source API v2? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15780) Support mapValues on KeyValueGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15780. - Resolution: Fixed Assignee: Koert Kuipers Fix Version/s: 2.1.0 > Support mapValues on KeyValueGroupedDataset > --- > > Key: SPARK-15780 > URL: https://issues.apache.org/jira/browse/SPARK-15780 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > Fix For: 2.1.0 > > > Currently when doing groupByKey on a Dataset the key ends up in the values > which can be clumsy: > {noformat} > val ds: Dataset[(K, V)] = ... > val grouped: KeyValueGroupedDataset[(K, (K, V))] = ds.groupByKey(_._1) > {noformat} > With mapValues one can create something more similar to PairRDDFunctions[K, > V]: > {noformat} > val ds: Dataset[(K, V)] = ... > val grouped: KeyValueGroupedDataset[(K, V)] = > ds.groupByKey(_._1).mapValues(_._2) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17698) Join predicates should not contain filter clauses
[ https://issues.apache.org/jira/browse/SPARK-17698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17698. - Resolution: Fixed Assignee: Tejas Patil Fix Version/s: 2.1.0 > Join predicates should not contain filter clauses > - > > Key: SPARK-17698 > URL: https://issues.apache.org/jira/browse/SPARK-17698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.1.0 > > > `ExtractEquiJoinKeys` is incorrectly using filter predicates as the join > condition for joins. While this does not lead to incorrect results but in > case of bucketed + sorted tables, we might miss out on avoiding un-necessary > shuffle + sort. eg. > {code} > val df = (1 until 10).toDF("id").coalesce(1) > hc.sql("DROP TABLE IF EXISTS table1").collect > df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1") > hc.sql("DROP TABLE IF EXISTS table2").collect > df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2") > sqlContext.sql(""" > SELECT a.id, b.id > FROM table1 a > FULL OUTER JOIN table2 b > ON a.id = b.id AND a.id='1' AND b.id='1' > """).explain(true) > {code} > This is doing shuffle + sort over table scan outputs which is not needed as > both tables are bucketed and sorted on the same columns and have same number > of buckets. This should be a single stage job. > {code} > SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as > double)], FullOuter > :- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 > ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200) > : +- *FileScan parquet default.table1[id#38] Batched: true, Format: > ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > +- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) > ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200) > +- *FileScan parquet default.table2[id#39] Batched: true, Format: > ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.
[ https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592272#comment-15592272 ] Piotr Smolinski commented on SPARK-17904: - Would it work at all? I have been looking recently on SparkR implementation. ATM, on the executor side all dapply/gapply/spark.lappy are single shot operations. Executor JVM either forks preallocated small daemon process or launches new R runtime (windows or when daemon is explicitly disabled) only for duration of the call. This process is immediately disposed once task is done. That means there is no R runtime that can be preinitialized. Check: https://github.com/apache/spark/blob/master/R/pkg/inst/worker/worker.R > Add a wrapper function to install R packages on each executors. > --- > > Key: SPARK-17904 > URL: https://issues.apache.org/jira/browse/SPARK-17904 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Yanbo Liang > > SparkR provides {{spark.lappy}} to run local R functions in distributed > environment, and {{dapply}} to run UDF on SparkDataFrame. > If users use third-party libraries inside of the function which was passed > into {{spark.lappy}} or {{dapply}}, they should install required R packages > on each executor in advance. > To install dependent R packages on each executors and check it successfully, > we can run similar code like following: > (Note: The code is just for example, not the prototype of this proposal. The > detail implementation should be discussed.) > {code} > rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), > install.packages("Matrix”)) > test <- function(x) { "Matrix" %in% rownames(installed.packages()) } > rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test ) > collectRDD(rdd) > {code} > It’s cumbersome to run this code snippet each time when you need third-party > library, since SparkR is an interactive analytics tools, users may call lots > of libraries during the analytics session. In native R, users can run > {{install.packages()}} and {{library()}} across the interactive session. > Should we provide one API to wrapper the work mentioned above, then users can > install dependent R packages to each executor easily? > I propose the following API: > {{spark.installPackages(pkgs, repos)}} > * pkgs: the name of packages. If repos = NULL, this can be set with a > local/hdfs path, then SparkR can install packages from local package archives. > * repos: the base URL(s) of the repositories to use. It can be NULL to > install from local directories. > Since SparkR has its own library directories where to install the packages on > each executor, so I think it will not pollute the native R environment. I'd > like to know whether it make sense, and feel free to correct me if there is > misunderstanding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17048) ML model read for custom transformers in a pipeline does not work
[ https://issues.apache.org/jira/browse/SPARK-17048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592141#comment-15592141 ] Nicolas Long edited comment on SPARK-17048 at 10/20/16 3:32 PM: I hit this today too. The Scala workaround is simply to create an object of the same name that extends DefaultParamsReadable. E.g. {code:java} class HtmlRemover(val uid: String) extends StringUnaryTransformer[String, HtmlRemover] with DefaultParamsWritable { def this() = this(Identifiable.randomUID("htmlremover")) def createTransformFunc: String => String = s => { Jsoup.parse(s).body().text() } } object HtmlRemover extends DefaultParamsReadable[HtmlRemover] {code} But it would be nice to be able to not have to have the singleton object and simply add the trait to the transformer itself. Note that StringUnaryTransformer is a simple custom wrapper trait here. was (Author: nicl): I hit this today too. The Scala workaround is simply to create an object of the same name that extends DefaultParamsReadable. E.g. {code:java} class HtmlRemover(val uid: String) extends StringUnaryTransformer[String, HtmlRemover] with DefaultParamsWritable { def this() = this(Identifiable.randomUID("htmlremover")) def createTransformFunc: String => String = s => { Jsoup.parse(s).body().text() } } object HtmlRemover extends DefaultParamsReadable[HtmlRemover] {code} Note that StringUnaryTransformer is a simple custom wrapper trait here. > ML model read for custom transformers in a pipeline does not work > -- > > Key: SPARK-17048 > URL: https://issues.apache.org/jira/browse/SPARK-17048 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 > Java API >Reporter: Taras Matyashovskyy > Labels: easyfix, features > Original Estimate: 2h > Remaining Estimate: 2h > > 0. Use Java API :( > 1. Create any custom ML transformer > 2. Make it MLReadable and MLWritable > 3. Add to pipeline > 4. Evaluate model, e.g. CrossValidationModel, and save results to disk > 5. For custom transformer you can use DefaultParamsReader and > DefaultParamsWriter, for instance > 6. Load model from saved directory > 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, > Evaluator, etc. > 8. Your custom transformer will fail with NPE > Reason: > ReadWrite.scala:447 > cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) > In Java this only works for static methods. > As we are implementing MLReadable or MLWritable, then this call should be > instance method call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17048) ML model read for custom transformers in a pipeline does not work
[ https://issues.apache.org/jira/browse/SPARK-17048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592141#comment-15592141 ] Nicolas Long commented on SPARK-17048: -- I hit this today too. The Scala workaround is simply to create an object of the same name that extends DefaultParamsReadable. E.g. {code:java} class HtmlRemover(val uid: String) extends StringUnaryTransformer[String, HtmlRemover] with DefaultParamsWritable { def this() = this(Identifiable.randomUID("htmlremover")) def createTransformFunc: String => String = s => { Jsoup.parse(s).body().text() } } object HtmlRemover extends DefaultParamsReadable[HtmlRemover] {code} Note that StringUnaryTransformer is a simple custom wrapper trait here. > ML model read for custom transformers in a pipeline does not work > -- > > Key: SPARK-17048 > URL: https://issues.apache.org/jira/browse/SPARK-17048 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 > Environment: Spark 2.0.0 > Java API >Reporter: Taras Matyashovskyy > Labels: easyfix, features > Original Estimate: 2h > Remaining Estimate: 2h > > 0. Use Java API :( > 1. Create any custom ML transformer > 2. Make it MLReadable and MLWritable > 3. Add to pipeline > 4. Evaluate model, e.g. CrossValidationModel, and save results to disk > 5. For custom transformer you can use DefaultParamsReader and > DefaultParamsWriter, for instance > 6. Load model from saved directory > 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, > Evaluator, etc. > 8. Your custom transformer will fail with NPE > Reason: > ReadWrite.scala:447 > cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) > In Java this only works for static methods. > As we are implementing MLReadable or MLWritable, then this call should be > instance method call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15777) Catalog federation
[ https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592092#comment-15592092 ] Nattavut Sutyanyong commented on SPARK-15777: - How do we test that a rule added in one data source implementation will not interfere with such a SQL statement referencing objects from both data sources? > Catalog federation > -- > > Key: SPARK-15777 > URL: https://issues.apache.org/jira/browse/SPARK-15777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: SparkFederationDesign.pdf > > > This is a ticket to track progress to support federating multiple external > catalogs. This would require establishing an API (similar to the current > ExternalCatalog API) for getting information about external catalogs, and > ability to convert a table into a data source table. > As part of this, we would also need to be able to support more than a > two-level table identifier (database.table). At the very least we would need > a three level identifier for tables (catalog.database.table). A possibly > direction is to support arbitrary level hierarchical namespaces similar to > file systems. > Once we have this implemented, we can convert the current Hive catalog > implementation into an external catalog that is "mounted" into an internal > catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18029) PruneFileSourcePartitions should not change the output of LogicalRelation
[ https://issues.apache.org/jira/browse/SPARK-18029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18029: Assignee: Apache Spark (was: Wenchen Fan) > PruneFileSourcePartitions should not change the output of LogicalRelation > - > > Key: SPARK-18029 > URL: https://issues.apache.org/jira/browse/SPARK-18029 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18029) PruneFileSourcePartitions should not change the output of LogicalRelation
[ https://issues.apache.org/jira/browse/SPARK-18029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592067#comment-15592067 ] Apache Spark commented on SPARK-18029: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/15569 > PruneFileSourcePartitions should not change the output of LogicalRelation > - > > Key: SPARK-18029 > URL: https://issues.apache.org/jira/browse/SPARK-18029 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18029) PruneFileSourcePartitions should not change the output of LogicalRelation
[ https://issues.apache.org/jira/browse/SPARK-18029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18029: Assignee: Wenchen Fan (was: Apache Spark) > PruneFileSourcePartitions should not change the output of LogicalRelation > - > > Key: SPARK-18029 > URL: https://issues.apache.org/jira/browse/SPARK-18029 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18029) PruneFileSourcePartitions should not change the output of LogicalRelation
Wenchen Fan created SPARK-18029: --- Summary: PruneFileSourcePartitions should not change the output of LogicalRelation Key: SPARK-18029 URL: https://issues.apache.org/jira/browse/SPARK-18029 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
[ https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592030#comment-15592030 ] Nick Orka commented on SPARK-9219: -- I've made a CLONE for the JIRA ticket here https://issues.apache.org/jira/browse/SPARK-18015 > ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD > --- > > Key: SPARK-9219 > URL: https://issues.apache.org/jira/browse/SPARK-9219 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Mohsen Zainalpour > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 > (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.List$SerializationProxy to field > org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type > scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) > at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scal
[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
[ https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592041#comment-15592041 ] Nick Orka commented on SPARK-9219: -- I'm using IntelliJ Idea. Here is whole dependency tree (IML file) {code:xml} {code} > ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD > --- > > Key: SPARK-9219 > URL: https://issues.apache.org/jira/browse/SPARK-9219 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Mohsen Zainalpour > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 > (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.List$SerializationProxy to field > org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type > scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) > at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.
[jira] [Issue Comment Deleted] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
[ https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Orka updated SPARK-9219: - Comment: was deleted (was: I've made a CLONE for the JIRA ticket here https://issues.apache.org/jira/browse/SPARK-18015) > ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD > --- > > Key: SPARK-9219 > URL: https://issues.apache.org/jira/browse/SPARK-9219 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Mohsen Zainalpour > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 > (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.List$SerializationProxy to field > org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type > scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) > at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.ref
[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksander Eskilson updated SPARK-18016: Description: When attempting to encode collections of large Java objects to Datasets having very wide or deeply nested schemas, code generation can fail, yielding: {code} Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection has grown past JVM limit of 0x at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) at org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) at org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) at org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) at org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) at org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:905) ... 35 more {code} During generation of the code for Spec
[jira] [Comment Edited] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)
[ https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592000#comment-15592000 ] Aleksander Eskilson edited comment on SPARK-17131 at 10/20/16 2:45 PM: --- Yeah, that makes sense. So far, what I documented and this one seem to have been the only JIRAs that exhibit specifically the Constant Pool limit error. I'm trying to dig deeper into it to see if it really marks its own class of error, but given that SPARK-17702 didn't resolve the error case I posted (even though it splits up sections of large generated code), I do suspect they are, quite related, but ultimately different issues. I think the splitExpressions technique that was used in SPARK-17702 and that also appears to be being employed in SPARK-16845 could be useful for the range of different classes that can generate too many lines of code. Seeing the issues linked together is definitely useful. To that end, I'll leave mine resolved as a duplicate of SPARK-16845 for now until I can make use of the patch it develops, so we can see more conclusively if they're related issues, or truly duplicates. And I'll link the two "0x" issues together as related. was (Author: aeskilson): Yeah, that makes sense. So far, what I documented and this one seem to have been the only JIRAs that exhibit specifically the Constant Pool limit error. I'm trying to dig deeper into it to see if it really marks its own class of error, but given that SPARK-17702 didn't resolve the error case I posted (even though it splits up sections of large generated code), I do suspect they are, quite related, but ultimately different issues. I think the spliExpressions technique that was used in SPARK-17702 and that also appears to be being employed in SPARK-16845 could be useful for the range of different classes that can generate too many lines of code. Seeing the issues linked together is definitely useful. To that end, I'll leave mine resolved as a duplicate of SPARK-16845 for now until I can make use of the patch it develops, so we can see more conclusively if they're related issues, or truly duplicates. And I'll link the two "0x" issues together as related. > Code generation fails when running SQL expressions against a wide dataset > (thousands of columns) > > > Key: SPARK-17131 > URL: https://issues.apache.org/jira/browse/SPARK-17131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Iaroslav Zeigerman > Attachments: > _SPARK_17131__add_a_test_case_with_1000_column_DF_where_describe___fails.patch > > > When reading the CSV file that contains 1776 columns Spark and Janino fail to > generate the code with message: > {noformat} > Constant pool has grown past JVM limit of 0x > {noformat} > When running a common select with all columns it's fine: > {code} > val allCols = df.columns.map(c => col(c).as(c + "_alias")) > val newDf = df.select(allCols: _*) > newDf.show() > {code} > But when I invoke the describe method: > {code} > newDf.describe(allCols: _*) > {code} > it fails with the following stack trace: > {noformat} > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 30 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has > grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402) > at > org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300) > at > org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307) > at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346) > at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265) > at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at org.codehaus.janino.UnitCompiler.fakeCompile(Unit
[jira] [Commented] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)
[ https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592000#comment-15592000 ] Aleksander Eskilson commented on SPARK-17131: - Yeah, that makes sense. So far, what I documented and this one seem to have been the only JIRAs that exhibit specifically the Constant Pool limit error. I'm trying to dig deeper into it to see if it really marks its own class of error, but given that SPARK-17702 didn't resolve the error case I posted (even though it splits up sections of large generated code), I do suspect they are, quite related, but ultimately different issues. I think the spliExpressions technique that was used in SPARK-17702 and that also appears to be being employed in SPARK-16845 could be useful for the range of different classes that can generate too many lines of code. Seeing the issues linked together is definitely useful. To that end, I'll leave mine resolved as a duplicate of SPARK-16845 for now until I can make use of the patch it develops, so we can see more conclusively if they're related issues, or truly duplicates. And I'll link the two "0x" issues together as related. > Code generation fails when running SQL expressions against a wide dataset > (thousands of columns) > > > Key: SPARK-17131 > URL: https://issues.apache.org/jira/browse/SPARK-17131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Iaroslav Zeigerman > Attachments: > _SPARK_17131__add_a_test_case_with_1000_column_DF_where_describe___fails.patch > > > When reading the CSV file that contains 1776 columns Spark and Janino fail to > generate the code with message: > {noformat} > Constant pool has grown past JVM limit of 0x > {noformat} > When running a common select with all columns it's fine: > {code} > val allCols = df.columns.map(c => col(c).as(c + "_alias")) > val newDf = df.select(allCols: _*) > newDf.show() > {code} > But when I invoke the describe method: > {code} > newDf.describe(allCols: _*) > {code} > it fails with the following stack trace: > {noformat} > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 30 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has > grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402) > at > org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300) > at > org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307) > at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346) > at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265) > at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975) > at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662) > at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654) > at org.codehaus.janino.UnitCompiler.compile2(Un
[jira] [Commented] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)
[ https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591953#comment-15591953 ] Sean Owen commented on SPARK-17131: --- OK well I think it's fine to leave one copy of the "0x" issue open if you have any reasonable reason to suspect it's different, and just link the JIRAs. I suppose I was mostly saying this could just be reopened, and separately, there are a lot of real duplicates of similar issues out there too, making it hard to figure out what the underlying unique issues are. > Code generation fails when running SQL expressions against a wide dataset > (thousands of columns) > > > Key: SPARK-17131 > URL: https://issues.apache.org/jira/browse/SPARK-17131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Iaroslav Zeigerman > Attachments: > _SPARK_17131__add_a_test_case_with_1000_column_DF_where_describe___fails.patch > > > When reading the CSV file that contains 1776 columns Spark and Janino fail to > generate the code with message: > {noformat} > Constant pool has grown past JVM limit of 0x > {noformat} > When running a common select with all columns it's fine: > {code} > val allCols = df.columns.map(c => col(c).as(c + "_alias")) > val newDf = df.select(allCols: _*) > newDf.show() > {code} > But when I invoke the describe method: > {code} > newDf.describe(allCols: _*) > {code} > it fails with the following stack trace: > {noformat} > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 30 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has > grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402) > at > org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300) > at > org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307) > at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346) > at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265) > at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975) > at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662) > at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643) > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksander Eskilson updated SPARK-18016: Description: When attempting to encode collections of large Java objects to Datasets having very wide or deeply nested schemas, code generation can fail, yielding: {code} Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection has grown past JVM limit of 0x at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) at org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) at org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) at org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) at org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) at org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:905) ... 35 more {code} During generation of the code for Spec
[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksander Eskilson updated SPARK-18016: Summary: Code Generation: Constant Pool Past Limit for Wide/Nested Dataset (was: Code Generation Fails When Encoding Large Object to Wide Dataset) > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(
[jira] [Resolved] (SPARK-18016) Code Generation Fails When Encoding Large Object to Wide Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksander Eskilson resolved SPARK-18016. - Resolution: Duplicate > Code Generation Fails When Encoding Large Object to Wide Dataset > > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311) >
[jira] [Commented] (SPARK-18016) Code Generation Fails When Encoding Large Object to Wide Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591814#comment-15591814 ] Aleksander Eskilson commented on SPARK-18016: - As per some discussion in SPARK-17131, marking this issue as a potential duplicate of SPARK-16845 so we can see if its resolution solves the same issue and we can track in what way these bugs may be related. We can reopen if necessary. > Code Generation Fails When Encoding Large Object to Wide Dataset > > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) >
[jira] [Commented] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)
[ https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591810#comment-15591810 ] Aleksander Eskilson commented on SPARK-17131: - Sure, I apologize for that. I'll also mark it as a duplicate of SPARK-16845 and monitor its pull-request to see if it resolves the issue I opened. > Code generation fails when running SQL expressions against a wide dataset > (thousands of columns) > > > Key: SPARK-17131 > URL: https://issues.apache.org/jira/browse/SPARK-17131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Iaroslav Zeigerman > Attachments: > _SPARK_17131__add_a_test_case_with_1000_column_DF_where_describe___fails.patch > > > When reading the CSV file that contains 1776 columns Spark and Janino fail to > generate the code with message: > {noformat} > Constant pool has grown past JVM limit of 0x > {noformat} > When running a common select with all columns it's fine: > {code} > val allCols = df.columns.map(c => col(c).as(c + "_alias")) > val newDf = df.select(allCols: _*) > newDf.show() > {code} > But when I invoke the describe method: > {code} > newDf.describe(allCols: _*) > {code} > it fails with the following stack trace: > {noformat} > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 30 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has > grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402) > at > org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300) > at > org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307) > at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346) > at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265) > at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975) > at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662) > at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643) > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18028) simplify TableFileCatalog
[ https://issues.apache.org/jira/browse/SPARK-18028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18028: Assignee: Apache Spark (was: Wenchen Fan) > simplify TableFileCatalog > - > > Key: SPARK-18028 > URL: https://issues.apache.org/jira/browse/SPARK-18028 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18028) simplify TableFileCatalog
[ https://issues.apache.org/jira/browse/SPARK-18028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591794#comment-15591794 ] Apache Spark commented on SPARK-18028: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/15568 > simplify TableFileCatalog > - > > Key: SPARK-18028 > URL: https://issues.apache.org/jira/browse/SPARK-18028 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org