[jira] [Commented] (SPARK-20310) Dependency convergence error for scala-xml
[ https://issues.apache.org/jira/browse/SPARK-20310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967146#comment-15967146 ] Samik R commented on SPARK-20310: - Hi Sean, Thanks for your comments. You are probably thinking that I am using scala v2.11.0 based on the line ["+-org.scala-lang:scalap:2.11.0"], but this is coming from the jackson-json4s dependency of spark-core. I am actually on 2.11.8 on the box and have that scala version as library dependency in the pom file as well. I also agree, this may not actually cause a problem (reason why I thought this is a minor bug). I was just planning to set the dependency to the latest version and hope things work fine. Thanks. -Samik > Dependency convergence error for scala-xml > -- > > Key: SPARK-20310 > URL: https://issues.apache.org/jira/browse/SPARK-20310 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Samik R >Priority: Minor > > Hi, > I am trying to compile a package (apache tinkerpop) which has spark-core as > one of the dependencies. I am trying to compile with v2.1.0. But when I run > maven build through a dependency checker, it is showing a dependency error > within the spark-core itself for scala-xml package, as below: > Dependency convergence error for org.scala-lang.modules:scala-xml_2.11:1.0.1 > paths to dependency are: > +-org.apache.tinkerpop:spark-gremlin:3.2.3 > +-org.apache.spark:spark-core_2.11:2.1.0 > +-org.json4s:json4s-jackson_2.11:3.2.11 > +-org.json4s:json4s-core_2.11:3.2.11 > +-org.scala-lang:scalap:2.11.0 > +-org.scala-lang:scala-compiler:2.11.0 > +-org.scala-lang.modules:scala-xml_2.11:1.0.1 > and > +-org.apache.tinkerpop:spark-gremlin:3.2.3 > +-org.apache.spark:spark-core_2.11:2.1.0 > +-org.apache.spark:spark-tags_2.11:2.1.0 > +-org.scalatest:scalatest_2.11:2.2.6 > +-org.scala-lang.modules:scala-xml_2.11:1.0.2 > Can this be fixed? > Thanks. > -Samik -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
[ https://issues.apache.org/jira/browse/SPARK-20316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967145#comment-15967145 ] Apache Spark commented on SPARK-20316: -- User 'ouyangxiaochen' has created a pull request for this issue: https://github.com/apache/spark/pull/17628 > In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax > - > > Key: SPARK-20316 > URL: https://issues.apache.org/jira/browse/SPARK-20316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark2.1.0 >Reporter: Xiaochen Ouyang > > In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax. > private var prompt = "spark-sql" > private var continuedPrompt = "".padTo(prompt.length, ' ') > if there is no place to change the variable.We should use 'val' to modify the > variable,otherwise 'var'. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
[ https://issues.apache.org/jira/browse/SPARK-20316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20316: Assignee: Apache Spark > In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax > - > > Key: SPARK-20316 > URL: https://issues.apache.org/jira/browse/SPARK-20316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark2.1.0 >Reporter: Xiaochen Ouyang >Assignee: Apache Spark > > In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax. > private var prompt = "spark-sql" > private var continuedPrompt = "".padTo(prompt.length, ' ') > if there is no place to change the variable.We should use 'val' to modify the > variable,otherwise 'var'. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
[ https://issues.apache.org/jira/browse/SPARK-20316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20316: Assignee: (was: Apache Spark) > In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax > - > > Key: SPARK-20316 > URL: https://issues.apache.org/jira/browse/SPARK-20316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark2.1.0 >Reporter: Xiaochen Ouyang > > In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax. > private var prompt = "spark-sql" > private var continuedPrompt = "".padTo(prompt.length, ' ') > if there is no place to change the variable.We should use 'val' to modify the > variable,otherwise 'var'. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
Xiaochen Ouyang created SPARK-20316: --- Summary: In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax Key: SPARK-20316 URL: https://issues.apache.org/jira/browse/SPARK-20316 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Environment: Spark2.1.0 Reporter: Xiaochen Ouyang In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax. private var prompt = "spark-sql" private var continuedPrompt = "".padTo(prompt.length, ' ') if there is no place to change the variable.We should use 'val' to modify the variable,otherwise 'var'. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19924) Handle InvocationTargetException for all Hive Shim
[ https://issues.apache.org/jira/browse/SPARK-19924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967142#comment-15967142 ] Apache Spark commented on SPARK-19924: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17627 > Handle InvocationTargetException for all Hive Shim > -- > > Key: SPARK-19924 > URL: https://issues.apache.org/jira/browse/SPARK-19924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > Since we are using shim for most Hive metastore APIs, the exceptions thrown > by the underlying method of Method.invoke() are wrapped by > `InvocationTargetException`. Instead of doing it one by one, we should handle > all of them in the `withClient`. If any of them is missing, the error message > could looks unfriendly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
[ https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966973#comment-15966973 ] Stephane Maarek commented on SPARK-20287: - [~c...@koeninger.org] How about using the subscribe pattern? https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html ``` public void subscribe(Collection topics) Subscribe to the given list of topics to get dynamically assigned partitions. Topic subscriptions are not incremental. This list will replace the current assignment (if there is one). It is not possible to combine topic subscription with group management with manual partition assignment through assign(Collection). If the given list of topics is empty, it is treated the same as unsubscribe(). ``` Then you let Kafka handle the partition assignments? As all the consumers share the same group.id, the data will be effectively distributed between every Spark instance? But then I guess you may have already explored that option and it goes against the Spark DirectStream API? (not a Spark expert, just trying to understand the limitations. I believe you when you say you did it the most straightforward way) > Kafka Consumer should be able to subscribe to more than one topic partition > --- > > Key: SPARK-20287 > URL: https://issues.apache.org/jira/browse/SPARK-20287 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Stephane Maarek > > As I understand and as it stands, one Kafka Consumer is created for each > topic partition in the source Kafka topics, and they're cached. > cf > https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 > In my opinion, that makes the design an anti pattern for Kafka and highly > unefficient: > - Each Kafka consumer creates a connection to Kafka > - Spark doesn't leverage the power of the Kafka consumers, which is that it > automatically assigns and balances partitions amongst all the consumers that > share the same group.id > - You can still cache your Kafka consumer even if it has multiple partitions. > I'm not sure about how that translates to the spark underlying RDD > architecture, but from a Kafka standpoint, I believe creating one consumer > per partition is a big overhead, and a risk as the user may have to increase > the spark.streaming.kafka.consumer.cache.maxCapacity parameter. > Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20131) Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-20131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20131. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.2.0 2.1.1 > Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data > should be deserialized properly > --- > > Key: SPARK-20131 > URL: https://issues.apache.org/jira/browse/SPARK-20131 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Takuya Ueshin >Assignee: Shixiong Zhu >Priority: Minor > Labels: flaky-test > Fix For: 2.1.1, 2.2.0 > > > This test failed recently here: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/ > Dashboard > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite_name=SPARK-18560+Receiver+data+should+be+deserialized+properly. > Error Message > {code} > latch.await(60L, SECONDS) was false > {code} > {code} > org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was > false > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44) > at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) > at > org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at >
[jira] [Commented] (SPARK-20036) impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-20036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966939#comment-15966939 ] Daniel Nuriyev commented on SPARK-20036: Thank you, Cody. I will do as you say and report what happens. > impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0 > > > Key: SPARK-20036 > URL: https://issues.apache.org/jira/browse/SPARK-20036 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Daniel Nuriyev > Attachments: Main.java, pom.xml > > > I use kafka 0.10.1 and java code with the following dependencies: > > org.apache.kafka > kafka_2.11 > 0.10.1.1 > > > org.apache.kafka > kafka-clients > 0.10.1.1 > > > org.apache.spark > spark-streaming_2.11 > 2.0.0 > > > org.apache.spark > spark-streaming-kafka-0-10_2.11 > 2.0.0 > > The code tries to read the whole topic using: > kafkaParams.put("auto.offset.reset", "earliest"); > Using 5 second batches: > jssc = new JavaStreamingContext(conf, Durations.seconds(5)); > Each batch returns empty. > I debugged the code I noticed that KafkaUtils.fixKafkaParams is called that > overrides "earliest" with "none". > Whether this is related or not, when I used kafka 0.8 on the client with > kafka 0.10.1 on the server, I could read the whole topic. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20315) Set ScalaUDF's deterministic to true
[ https://issues.apache.org/jira/browse/SPARK-20315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20315: Assignee: Apache Spark (was: Xiao Li) > Set ScalaUDF's deterministic to true > > > Key: SPARK-20315 > URL: https://issues.apache.org/jira/browse/SPARK-20315 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > ScalaUDF is always assumed to deterministic. However, the current master > still set it based on the children's deterministic values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20315) Set ScalaUDF's deterministic to true
[ https://issues.apache.org/jira/browse/SPARK-20315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20315: Assignee: Xiao Li (was: Apache Spark) > Set ScalaUDF's deterministic to true > > > Key: SPARK-20315 > URL: https://issues.apache.org/jira/browse/SPARK-20315 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > ScalaUDF is always assumed to deterministic. However, the current master > still set it based on the children's deterministic values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20315) Set ScalaUDF's deterministic to true
[ https://issues.apache.org/jira/browse/SPARK-20315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966938#comment-15966938 ] Apache Spark commented on SPARK-20315: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17626 > Set ScalaUDF's deterministic to true > > > Key: SPARK-20315 > URL: https://issues.apache.org/jira/browse/SPARK-20315 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > ScalaUDF is always assumed to deterministic. However, the current master > still set it based on the children's deterministic values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20315) Set ScalaUDF's deterministic to true
Xiao Li created SPARK-20315: --- Summary: Set ScalaUDF's deterministic to true Key: SPARK-20315 URL: https://issues.apache.org/jira/browse/SPARK-20315 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2 Reporter: Xiao Li Assignee: Xiao Li ScalaUDF is always assumed to deterministic. However, the current master still set it based on the children's deterministic values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966926#comment-15966926 ] Yan Facai (颜发才) commented on SPARK-20199: - It's not hard, and I can work on it. However, there are two possible solutions: 1. add `setFeatureSubsetStrategy` method to DecisionTree. So for GBT, it create an DecesionTree by using the method. code like `val dt = new DecisionTreeRegressor().setFeatureSubsetStrategy(xxx)`. 2. add `featureSubsetStrategy` param for `train` method of DecesionTree. minimum changes. which one is better? I prefer to the first. > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20314) Inconsistent error handling in JSON parsing SQL functions
Eric Wasserman created SPARK-20314: -- Summary: Inconsistent error handling in JSON parsing SQL functions Key: SPARK-20314 URL: https://issues.apache.org/jira/browse/SPARK-20314 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Eric Wasserman Most parse errors in the JSON parsing SQL functions (e.g. json_tuple, get_json_object) will return a null(s) if the JSON is badly formed. However, if Jackson determines that the string includes invalid characters it will throw an exception (java.io.CharConversionException: Invalid UTF-32 character) that Spark does not catch. This creates a robustness problem in that these functions cannot be used at all when there may be dirty data as these exceptions will kill the jobs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1809) Mesos backend doesn't respect HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966782#comment-15966782 ] Andrew Ash edited comment on SPARK-1809 at 4/12/17 11:00 PM: - I'm not using Mesos anymore, so closing was (Author: aash): Not using Mesos anymore, so closing > Mesos backend doesn't respect HADOOP_CONF_DIR > - > > Key: SPARK-1809 > URL: https://issues.apache.org/jira/browse/SPARK-1809 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Andrew Ash > > In order to use HDFS paths without the server component, standalone mode > reads spark-env.sh and scans the HADOOP_CONF_DIR to open core-site.xml and > get the fs.default.name parameter. > This lets you use HDFS paths like: > - hdfs:///tmp/myfile.txt > instead of > - hdfs://myserver.mydomain.com:8020/tmp/myfile.txt > However as of recent 1.0.0 pre-release (hash 756c96) I had to specify HDFS > paths with the full server even though I have HADOOP_CONF_DIR still set in > spark-env.sh. The HDFS, Spark, and Mesos nodes are all co-located and > non-domain HDFS paths work fine when using the standalone mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1809) Mesos backend doesn't respect HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash closed SPARK-1809. - Resolution: Unresolved Not using Mesos anymore, so closing > Mesos backend doesn't respect HADOOP_CONF_DIR > - > > Key: SPARK-1809 > URL: https://issues.apache.org/jira/browse/SPARK-1809 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Andrew Ash > > In order to use HDFS paths without the server component, standalone mode > reads spark-env.sh and scans the HADOOP_CONF_DIR to open core-site.xml and > get the fs.default.name parameter. > This lets you use HDFS paths like: > - hdfs:///tmp/myfile.txt > instead of > - hdfs://myserver.mydomain.com:8020/tmp/myfile.txt > However as of recent 1.0.0 pre-release (hash 756c96) I had to specify HDFS > paths with the full server even though I have HADOOP_CONF_DIR still set in > spark-env.sh. The HDFS, Spark, and Mesos nodes are all co-located and > non-domain HDFS paths work fine when using the standalone mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9103) Tracking spark's memory usage
[ https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966736#comment-15966736 ] Apache Spark commented on SPARK-9103: - User 'jsoltren' has created a pull request for this issue: https://github.com/apache/spark/pull/17625 > Tracking spark's memory usage > - > > Key: SPARK-9103 > URL: https://issues.apache.org/jira/browse/SPARK-9103 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Reporter: Zhang, Liye > Attachments: Tracking Spark Memory Usage - Phase 1.pdf > > > Currently spark only provides little memory usage information (RDD cache on > webUI) for the executors. User have no idea on what is the memory consumption > when they are running spark applications with a lot of memory used in spark > executors. Especially when they encounter the OOM, it’s really hard to know > what is the cause of the problem. So it would be helpful to give out the > detail memory consumption information for each part of spark, so that user > can clearly have a picture of where the memory is exactly used. > The memory usage info to expose should include but not limited to shuffle, > cache, network, serializer, etc. > User can optionally choose to open this functionality since this is mainly > for debugging and tuning. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966385#comment-15966385 ] Jacques Nadeau commented on SPARK-13534: Great, thanks [~holdenk]! > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Wes McKinney > Attachments: benchmark.py > > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966339#comment-15966339 ] holdenk commented on SPARK-13534: - So I'm following along with the progress on this, I'll try and take a more thorough look this Thursday. > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Wes McKinney > Attachments: benchmark.py > > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20301) Flakiness in StreamingAggregationSuite
[ https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-20301. --- Resolution: Fixed Fix Version/s: 2.2.0 > Flakiness in StreamingAggregationSuite > -- > > Key: SPARK-20301 > URL: https://issues.apache.org/jira/browse/SPARK-20301 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Labels: flaky-test > Fix For: 2.2.0 > > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966300#comment-15966300 ] Tejas Patil commented on SPARK-20184: - Out of curiosity, I tried out a query with ~20 columns (similar to your repro) and ran with and without codegen. Input table was 20 TB with 320 billion rows. Codegen was 23% faster. In my case, the query ran for more than an hour. This was with Spark 2.0 release and not trunk. When I tried the exact same repro in the jira over my box (this time with trunk), with and without codegen didn't show difference. > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19570) Allow to disable hive in pyspark shell
[ https://issues.apache.org/jira/browse/SPARK-19570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-19570. - Resolution: Fixed Fix Version/s: 2.2.0 > Allow to disable hive in pyspark shell > -- > > Key: SPARK-19570 > URL: https://issues.apache.org/jira/browse/SPARK-19570 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > Fix For: 2.2.0 > > > SPARK-15236 do this for scala shell, this ticket is for pyspark shell. This > is not only for pyspark itself, but can also benefit downstream project like > livy which use shell.py for its interactive session. For now, livy has no > control of whether enable hive or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19570) Allow to disable hive in pyspark shell
[ https://issues.apache.org/jira/browse/SPARK-19570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-19570: --- Assignee: Jeff Zhang > Allow to disable hive in pyspark shell > -- > > Key: SPARK-19570 > URL: https://issues.apache.org/jira/browse/SPARK-19570 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > Fix For: 2.2.0 > > > SPARK-15236 do this for scala shell, this ticket is for pyspark shell. This > is not only for pyspark itself, but can also benefit downstream project like > livy which use shell.py for its interactive session. For now, livy has no > control of whether enable hive or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966231#comment-15966231 ] Jacques Nadeau commented on SPARK-13534: Anybody know some committers we can get to look at this? > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Wes McKinney > Attachments: benchmark.py > > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19976) DirectStream API throws OffsetOutOfRange Exception
[ https://issues.apache.org/jira/browse/SPARK-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966208#comment-15966208 ] Cody Koeninger commented on SPARK-19976: What would your expected behavior be when you delete data out of kafka before the stream has read it? My expectation would be that it fails. > DirectStream API throws OffsetOutOfRange Exception > -- > > Key: SPARK-19976 > URL: https://issues.apache.org/jira/browse/SPARK-19976 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Taukir > > I am using following code. While data on kafka topic get deleted/retention > period is over, it throws Exception and application crash > def functionToCreateContext(sc:SparkContext):StreamingContext = { > val kafkaParams = new mutable.HashMap[String, Object]() > kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBrokers) > kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG, KafkaConsumerGrp) > kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, > classOf[StringDeserializer]) > kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, > classOf[StringDeserializer]) > kafkaParams.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,"true") > kafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest") >val consumerStrategy = ConsumerStrategies.Subscribe[String, > String](topic.split(",").map(_.trim).filter(!_.isEmpty).toSet, kafkaParams) > val kafkaStream = > KafkaUtils.createDirectStream(ssc,LocationStrategies.PreferConsistent,consumerStrategy) > } > spark throws error and crash once OffsetOutOf RangeException is thrown > WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 : > org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of > range with no configured reset policy for partitions: {test-2=127287} > at > org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588) > at > org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354) > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:98) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15354) Topology aware block replication strategies
[ https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966203#comment-15966203 ] Apache Spark commented on SPARK-15354: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17624 > Topology aware block replication strategies > --- > > Key: SPARK-15354 > URL: https://issues.apache.org/jira/browse/SPARK-15354 > Project: Spark > Issue Type: Sub-task > Components: Mesos, Spark Core, YARN >Reporter: Shubham Chopra >Assignee: Shubham Chopra > Fix For: 2.2.0 > > > Implementations of strategies for resilient block replication for different > resource managers that replicate the 3-replica strategy used by HDFS, where > the first replica is on an executor, the second replica within the same rack > as the executor and a third replica on a different rack. > The implementation involves providing two pluggable classes, one running in > the driver that provides topology information for every host at cluster start > and the second prioritizing a list of peer BlockManagerIds. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966201#comment-15966201 ] holdenk commented on SPARK-20202: - Oh right, sorry I was misreading the intent of Affects Version/s. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966199#comment-15966199 ] Cody Koeninger commented on SPARK-20037: I'd be inclined to say this is a duplicate of the issue in SPARK-20036, and would start investigating the same way,i.e. remove the explicit dependencies on org.apache.kafka artifacts. > impossible to set kafka offsets using kafka 0.10 and spark 2.0.0 > > > Key: SPARK-20037 > URL: https://issues.apache.org/jira/browse/SPARK-20037 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Daniel Nuriyev > Attachments: Main.java, offsets.png > > > I use kafka 0.10.1 and java code with the following dependencies: > > org.apache.kafka > kafka_2.11 > 0.10.1.1 > > > org.apache.kafka > kafka-clients > 0.10.1.1 > > > org.apache.spark > spark-streaming_2.11 > 2.0.0 > > > org.apache.spark > spark-streaming-kafka-0-10_2.11 > 2.0.0 > > The code tries to read the a topic starting with offsets. > The topic has 4 partitions that start somewhere before 585000 and end after > 674000. So I wanted to read all partitions starting with 585000 > fromOffsets.put(new TopicPartition(topic, 0), 585000L); > fromOffsets.put(new TopicPartition(topic, 1), 585000L); > fromOffsets.put(new TopicPartition(topic, 2), 585000L); > fromOffsets.put(new TopicPartition(topic, 3), 585000L); > Using 5 second batches: > jssc = new JavaStreamingContext(conf, Durations.seconds(5)); > The code immediately throws: > Beginning offset 585000 is after the ending offset 584464 for topic > commerce_item_expectation partition 1 > It does not make sense because this topic/partition starts at 584464, not ends > I use this as a base: > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > But I use direct stream: > KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(), > ConsumerStrategies.Subscribe( > topics, kafkaParams, fromOffsets > ) > ) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20036) impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-20036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966193#comment-15966193 ] Cody Koeninger commented on SPARK-20036: fixKafkaParams is related to executor consumers, not the driver consumer. In a nutshell, the executor should read the offsets the driver tells it to, not auto reset. Remove the explicit dependencies in your pom on org.apache.kafka artifacts. spark-streaming-kafka-0-10 has appropriate transitive dependencies. > impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0 > > > Key: SPARK-20036 > URL: https://issues.apache.org/jira/browse/SPARK-20036 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Daniel Nuriyev > Attachments: Main.java, pom.xml > > > I use kafka 0.10.1 and java code with the following dependencies: > > org.apache.kafka > kafka_2.11 > 0.10.1.1 > > > org.apache.kafka > kafka-clients > 0.10.1.1 > > > org.apache.spark > spark-streaming_2.11 > 2.0.0 > > > org.apache.spark > spark-streaming-kafka-0-10_2.11 > 2.0.0 > > The code tries to read the whole topic using: > kafkaParams.put("auto.offset.reset", "earliest"); > Using 5 second batches: > jssc = new JavaStreamingContext(conf, Durations.seconds(5)); > Each batch returns empty. > I debugged the code I noticed that KafkaUtils.fixKafkaParams is called that > overrides "earliest" with "none". > Whether this is related or not, when I used kafka 0.8 on the client with > kafka 0.10.1 on the server, I could read the whole topic. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
[ https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966180#comment-15966180 ] Cody Koeninger commented on SPARK-20287: The issue here is that the underlying new Kafka consumer api doesn't have a way for a single consumer to subscribe to multiple partitions, but only read a particular range of messages from one of them. The max capacity is just a simple way of dealing with what is basically a LRU cache - if someone creates topics dynamically and then stops sending messages to them, you don't want to keep leaking resources. I'm not claiming there's anything great or elegant about those solutions, but they were pretty much the most straightforward way to make the direct stream model work with the new kafka consumer api. > Kafka Consumer should be able to subscribe to more than one topic partition > --- > > Key: SPARK-20287 > URL: https://issues.apache.org/jira/browse/SPARK-20287 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Stephane Maarek > > As I understand and as it stands, one Kafka Consumer is created for each > topic partition in the source Kafka topics, and they're cached. > cf > https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 > In my opinion, that makes the design an anti pattern for Kafka and highly > unefficient: > - Each Kafka consumer creates a connection to Kafka > - Spark doesn't leverage the power of the Kafka consumers, which is that it > automatically assigns and balances partitions amongst all the consumers that > share the same group.id > - You can still cache your Kafka consumer even if it has multiple partitions. > I'm not sure about how that translates to the spark underlying RDD > architecture, but from a Kafka standpoint, I believe creating one consumer > per partition is a big overhead, and a risk as the user may have to increase > the spark.streaming.kafka.consumer.cache.maxCapacity parameter. > Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20291) NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, DoubleType)
[ https://issues.apache.org/jira/browse/SPARK-20291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20291: Fix Version/s: 2.0.3 > NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, > DoubleType) > --- > > Key: SPARK-20291 > URL: https://issues.apache.org/jira/browse/SPARK-20291 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > `NaNvl(float value, null)` will be converted into `NaNvl(float value, > Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), > Cast(null, DoubleType))`. > This will cause mismatching in the output type when the input type is float. > By adding extra rule in TypeCoercion can resolve this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20313) Possible lack of join optimization when partitions are in the join condition
[ https://issues.apache.org/jira/browse/SPARK-20313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albert Meltzer updated SPARK-20313: --- Description: Given two tables T1 and T2, partitioned on column part1, the following have vastly different execution performance: // initial, slow {noformat} val df1 = // load data from T1 .filter(functions.col("part1").between("val1", "val2") val df2 = // load data from T2 .filter(functions.col("part1").between("val1", "val2") val df3 = df1.join(df2, Seq("part1", "col1")) {noformat} // manually optimized, considerably faster {noformat} val df1 = // load data from T1 val df2 = // load data from T2 val part1values = Seq(...) // a collection of values between val1 and val2 val df3 = part1values .map(part1value => { val df1filtered = df1.filter(functions.col("part1") === part1value) val df2filtered = df2.filter(functions.col("part1") === part1value) df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join }) .reduce(_ union _) {noformat} was: Given two tables T1 and T2, partitioned on column part1, the following have vastly different execution performance: // initial, slow {noformat} val df1 = // load data from T1 .filter(functions.col("part1").between("val1", "val2") val df2 = // load data from T2 .filter(functions.col("part1").between("val1", "val2") val df3 = df1.join(df2, Seq("part1", "col1")) {noformat} // manually optimized, considerably faster {noformat} val df1 = // load data from T1 val df2 = // load data from T2 val part1values = Seq(...) // a collection of values between val1 and val2 val df3 = part1values .map(part1value => { val df3 = df1.join(df2, Seq("part1", "col1")) val df1filtered = df1.filter(functions.col("part1") === part1value) val df2filtered = df2.filter(functions.col("part1") === part1value) df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join }) .reduce(_ union _) {noformat} > Possible lack of join optimization when partitions are in the join condition > > > Key: SPARK-20313 > URL: https://issues.apache.org/jira/browse/SPARK-20313 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Albert Meltzer > > Given two tables T1 and T2, partitioned on column part1, the following have > vastly different execution performance: > // initial, slow > {noformat} > val df1 = // load data from T1 > .filter(functions.col("part1").between("val1", "val2") > val df2 = // load data from T2 > .filter(functions.col("part1").between("val1", "val2") > val df3 = df1.join(df2, Seq("part1", "col1")) > {noformat} > // manually optimized, considerably faster > {noformat} > val df1 = // load data from T1 > val df2 = // load data from T2 > val part1values = Seq(...) // a collection of values between val1 and val2 > val df3 = part1values > .map(part1value => { > val df1filtered = df1.filter(functions.col("part1") === part1value) > val df2filtered = df2.filter(functions.col("part1") === part1value) > df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join > }) > .reduce(_ union _) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20313) Possible lack of join optimization when partitions are in the join condition
Albert Meltzer created SPARK-20313: -- Summary: Possible lack of join optimization when partitions are in the join condition Key: SPARK-20313 URL: https://issues.apache.org/jira/browse/SPARK-20313 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 2.1.0 Reporter: Albert Meltzer Given two tables T1 and T2, partitioned on column part1, the following have vastly different execution performance: // initial {noformat} val df1 = // load data from T1 .filter(functions.col("part1").between("val1", "val2") val df2 = // load data from T2 .filter(functions.col("part1").between("val1", "val2") val df3 = df1.join(df2, Seq("part1", "col1")) {noformat} // manually optimized {noformat} val df1 = // load data from T1 val df2 = // load data from T2 val part1values = Seq(...) // a collection of values between val1 and val2 val df3 = part1values .map(part1value => { val df3 = df1.join(df2, Seq("part1", "col1")) val df1filtered = df1.filter(functions.col("part1") === part1value) val df2filtered = df2.filter(functions.col("part1") === part1value) df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join }) .reduce(_ union _) {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20313) Possible lack of join optimization when partitions are in the join condition
[ https://issues.apache.org/jira/browse/SPARK-20313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albert Meltzer updated SPARK-20313: --- Description: Given two tables T1 and T2, partitioned on column part1, the following have vastly different execution performance: // initial, slow {noformat} val df1 = // load data from T1 .filter(functions.col("part1").between("val1", "val2") val df2 = // load data from T2 .filter(functions.col("part1").between("val1", "val2") val df3 = df1.join(df2, Seq("part1", "col1")) {noformat} // manually optimized, considerably faster {noformat} val df1 = // load data from T1 val df2 = // load data from T2 val part1values = Seq(...) // a collection of values between val1 and val2 val df3 = part1values .map(part1value => { val df3 = df1.join(df2, Seq("part1", "col1")) val df1filtered = df1.filter(functions.col("part1") === part1value) val df2filtered = df2.filter(functions.col("part1") === part1value) df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join }) .reduce(_ union _) {noformat} was: Given two tables T1 and T2, partitioned on column part1, the following have vastly different execution performance: // initial {noformat} val df1 = // load data from T1 .filter(functions.col("part1").between("val1", "val2") val df2 = // load data from T2 .filter(functions.col("part1").between("val1", "val2") val df3 = df1.join(df2, Seq("part1", "col1")) {noformat} // manually optimized {noformat} val df1 = // load data from T1 val df2 = // load data from T2 val part1values = Seq(...) // a collection of values between val1 and val2 val df3 = part1values .map(part1value => { val df3 = df1.join(df2, Seq("part1", "col1")) val df1filtered = df1.filter(functions.col("part1") === part1value) val df2filtered = df2.filter(functions.col("part1") === part1value) df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join }) .reduce(_ union _) {noformat} > Possible lack of join optimization when partitions are in the join condition > > > Key: SPARK-20313 > URL: https://issues.apache.org/jira/browse/SPARK-20313 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Albert Meltzer > > Given two tables T1 and T2, partitioned on column part1, the following have > vastly different execution performance: > // initial, slow > {noformat} > val df1 = // load data from T1 > .filter(functions.col("part1").between("val1", "val2") > val df2 = // load data from T2 > .filter(functions.col("part1").between("val1", "val2") > val df3 = df1.join(df2, Seq("part1", "col1")) > {noformat} > // manually optimized, considerably faster > {noformat} > val df1 = // load data from T1 > val df2 = // load data from T2 > val part1values = Seq(...) // a collection of values between val1 and val2 > val df3 = part1values > .map(part1value => { > val df3 = df1.join(df2, Seq("part1", "col1")) > val df1filtered = df1.filter(functions.col("part1") === part1value) > val df2filtered = df2.filter(functions.col("part1") === part1value) > df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join > }) > .reduce(_ union _) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20304) AssertNotNull should not include path in string representation
[ https://issues.apache.org/jira/browse/SPARK-20304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20304. - Resolution: Fixed Fix Version/s: 2.2.0 2.1.2 > AssertNotNull should not include path in string representation > -- > > Key: SPARK-20304 > URL: https://issues.apache.org/jira/browse/SPARK-20304 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.2, 2.2.0 > > > AssertNotNull's toString/simpleString dumps the entire walkedTypePath. > walkedTypePath is used for error message reporting and shouldn't be part of > the output. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20303) Rename createTempFunction to registerFunction
[ https://issues.apache.org/jira/browse/SPARK-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20303. - Resolution: Fixed Fix Version/s: 2.2.0 > Rename createTempFunction to registerFunction > - > > Key: SPARK-20303 > URL: https://issues.apache.org/jira/browse/SPARK-20303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > Session catalog API `createTempFunction` is being used by Hive build-in > functions, persistent functions, and temporary functions. Thus, the name is > confusing. This PR is to replace it by `registerFunction`. Also we can move > construction of `FunctionBuilder` and `ExpressionInfo` into the new > `registerFunction`, instead of duplicating the logics everywhere. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them
[ https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albert Meltzer updated SPARK-20312: --- Description: When optimizing an outer join, spark passes an empty row to both sides to see if nulls would be ignored (side comment: for half-outer joins it subsequently ignores the assessment on the dominant side). For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT NULL}} might result in checking the right side first, and an exception if the udf doesn't expect a null input (given the left side first). A example is SIMILAR to the following (see actual query plans separately): {noformat} def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT NULL added filter on the result val df1 = sparkSession .table(...) .select("col1", "col2") // LongType both val df11 = df1 .filter(df1("col1").isNotNull) .withColumn("col3", functions.udf(func)(df1("col1")) .repartition(df1("col2")) .sortWithinPartitions(df1("col2")) val df2 = ... // load other data containing col2, similarly repartition and sort val df3 = df1.join(df2, Seq("col2"), "left_outer") {noformat} was: When optimizing an outer join, spark passes an empty row to both sides to see if nulls would be ignored (side comment: for half-outer joins it subsequently ignores the assessment on the dominant side). For some reason, a condition such as "x IS NOT NULL && udf(x) IS NOT NULL" might result in checking the right side first, and an exception if the udf doesn't expect a null input (given the left side first). A example is SIMILAR to the following (see actual query plans separately): def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT NULL added filter on the result val df1 = sparkSession .table(...) .select("col1", "col2") // LongType both val df11 = df1 .filter(df1("col1").isNotNull) .withColumn("col3", functions.udf(func)(df1("col1")) .repartition(df1("col2")) .sortWithinPartitions(df1("col2")) val df2 = ... // load other data containing col2, similarly repartition and sort val df3 = df1.join(df2, Seq("col2"), "left_outer") > query optimizer calls udf with null values when it doesn't expect them > -- > > Key: SPARK-20312 > URL: https://issues.apache.org/jira/browse/SPARK-20312 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Albert Meltzer > > When optimizing an outer join, spark passes an empty row to both sides to see > if nulls would be ignored (side comment: for half-outer joins it subsequently > ignores the assessment on the dominant side). > For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT > NULL}} might result in checking the right side first, and an exception if the > udf doesn't expect a null input (given the left side first). > A example is SIMILAR to the following (see actual query plans separately): > {noformat} > def func(value: Any): Int = ... // return AnyVal which probably causes a IS > NOT NULL added filter on the result > val df1 = sparkSession > .table(...) > .select("col1", "col2") // LongType both > val df11 = df1 > .filter(df1("col1").isNotNull) > .withColumn("col3", functions.udf(func)(df1("col1")) > .repartition(df1("col2")) > .sortWithinPartitions(df1("col2")) > val df2 = ... // load other data containing col2, similarly repartition and > sort > val df3 = > df1.join(df2, Seq("col2"), "left_outer") > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them
[ https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966094#comment-15966094 ] Albert Meltzer commented on SPARK-20312: During query optimization, one of the subtrees becomes: {noformat} Project [col4#0L, col3#5L, col2#10L, col1#11L, 2016-11-11 AS version#59, col5#45] +- Repartition 8, true +- Project [col4#0L, col3#5L, col2#10L, col1#11L, UDF(col1#11L) AS col5#45] +- Filter isnotnull(UDF(col1#11L)) +- Join LeftOuter, (col2#10L = col2#6L) :- Sort [col2#10L ASC NULLS FIRST], false : +- RepartitionByExpression [col2#10L] : +- Filter (isnotnull(col2#10L) && isnotnull(col1#11L)) :+- LocalRelation [col2#10L, col1#11L] +- Project [col4#0L, col3#5L, col2#6L] +- Join LeftOuter, (col3#5L = col3#1L) :- Sort [col3#5L ASC NULLS FIRST], false : +- RepartitionByExpression [col3#5L] : +- Filter (isnotnull(col3#5L) && isnotnull(col2#6L)) :+- LocalRelation [col3#5L, col2#6L] +- Sort [col3#1L ASC NULLS FIRST], false +- RepartitionByExpression [col3#1L] +- Filter (isnotnull(col4#0L) && isnotnull(col3#1L)) +- LocalRelation [col4#0L, col3#1L] {noformat} And that causes evaluation of the UDF in the {{org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin}} class, before the filter of the parameter value. > query optimizer calls udf with null values when it doesn't expect them > -- > > Key: SPARK-20312 > URL: https://issues.apache.org/jira/browse/SPARK-20312 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Albert Meltzer > > When optimizing an outer join, spark passes an empty row to both sides to see > if nulls would be ignored (side comment: for half-outer joins it subsequently > ignores the assessment on the dominant side). > For some reason, a condition such as "x IS NOT NULL && udf(x) IS NOT NULL" > might result in checking the right side first, and an exception if the udf > doesn't expect a null input (given the left side first). > A example is SIMILAR to the following (see actual query plans separately): > def func(value: Any): Int = ... // return AnyVal which probably causes a IS > NOT NULL added filter on the result > val df1 = sparkSession > .table(...) > .select("col1", "col2") // LongType both > val df11 = df1 > .filter(df1("col1").isNotNull) > .withColumn("col3", functions.udf(func)(df1("col1")) > .repartition(df1("col2")) > .sortWithinPartitions(df1("col2")) > val df2 = ... // load other data containing col2, similarly repartition and > sort > val df3 = > df1.join(df2, Seq("col2"), "left_outer") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them
[ https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966091#comment-15966091 ] Albert Meltzer commented on SPARK-20312: Query plans are as follows: {noformat} == Parsed Logical Plan == 'InsertIntoTable 'UnresolvedRelation `database`.`table`, OverwriteOptions(true,Map()), false +- Filter (((isnotnull(col2#10L) && isnotnull(col1#11L)) && isnotnull(version#59)) && isnotnull(col5#45)) +- Project [col4#0L, col3#5L, col2#10L, col1#11L, version#59, col5#45] +- Project [col4#0L, col3#5L, col2#10L, col1#11L, col5#45, 2016-11-11 AS version#59] +- Repartition 8, true +- Project [col4#0L, col3#5L, col2#10L, col1#11L, UDF(col1#11L) AS col5#45] +- Project [col4#0L, col3#5L, col2#10L, col1#11L] +- Project [col2#10L, col1#11L, col4#0L, col3#5L] +- Join LeftOuter, (col2#10L = col2#6L) :- Sort [col2#10L ASC NULLS FIRST], false : +- RepartitionByExpression [col2#10L] : +- Filter isnotnull(col1#11L) :+- Filter isnotnull(col2#10L) : +- LocalRelation [col2#10L, col1#11L] +- Project [col4#0L, col3#5L, col2#6L] +- Project [col3#5L, col2#6L, col4#0L] +- Join LeftOuter, (col3#5L = col3#1L) :- Sort [col3#5L ASC NULLS FIRST], false : +- RepartitionByExpression [col3#5L] : +- Filter isnotnull(col2#6L) :+- Filter isnotnull(col3#5L) : +- LocalRelation [col3#5L, col2#6L] +- Sort [col3#1L ASC NULLS FIRST], false +- RepartitionByExpression [col3#1L] +- Filter isnotnull(col3#1L) +- Filter isnotnull(col4#0L) +- LocalRelation [col4#0L, col3#1L] == Analyzed Logical Plan == InsertIntoTable MetastoreRelation database, table, Map(version -> None, col5 -> None), OverwriteOptions(true,Map()), false +- Filter (((isnotnull(col2#10L) && isnotnull(col1#11L)) && isnotnull(version#59)) && isnotnull(col5#45)) +- Project [col4#0L, col3#5L, col2#10L, col1#11L, version#59, col5#45] +- Project [col4#0L, col3#5L, col2#10L, col1#11L, col5#45, 2016-11-11 AS version#59] +- Repartition 8, true +- Project [col4#0L, col3#5L, col2#10L, col1#11L, UDF(col1#11L) AS col5#45] +- Project [col4#0L, col3#5L, col2#10L, col1#11L] +- Project [col2#10L, col1#11L, col4#0L, col3#5L] +- Join LeftOuter, (col2#10L = col2#6L) :- Sort [col2#10L ASC NULLS FIRST], false : +- RepartitionByExpression [col2#10L] : +- Filter isnotnull(col1#11L) :+- Filter isnotnull(col2#10L) : +- LocalRelation [col2#10L, col1#11L] +- Project [col4#0L, col3#5L, col2#6L] +- Project [col3#5L, col2#6L, col4#0L] +- Join LeftOuter, (col3#5L = col3#1L) :- Sort [col3#5L ASC NULLS FIRST], false : +- RepartitionByExpression [col3#5L] : +- Filter isnotnull(col2#6L) :+- Filter isnotnull(col3#5L) : +- LocalRelation [col3#5L, col2#6L] +- Sort [col3#1L ASC NULLS FIRST], false +- RepartitionByExpression [col3#1L] +- Filter isnotnull(col3#1L) +- Filter isnotnull(col4#0L) +- LocalRelation [col4#0L, col3#1L] == Optimized Logical Plan == InsertIntoTable MetastoreRelation database, table, Map(version -> None, col5 -> None), OverwriteOptions(true,Map()), false +- Project [col4#0L, col3#5L, col2#10L, col1#11L, 2016-11-11 AS version#59, col5#45] +- Repartition 8, true +- Project [col4#0L, col3#5L, col2#10L, col1#11L, UDF(col1#11L) AS col5#45] +- Join LeftOuter, (col2#10L = col2#6L) :- Sort [col2#10L ASC NULLS FIRST], false : +- RepartitionByExpression [col2#10L] : +- Filter ((isnotnull(col2#10L) && isnotnull(col1#11L)) && isnotnull(UDF(col1#11L))) :+- LocalRelation [col2#10L, col1#11L] +- Project [col4#0L, col3#5L, col2#6L] +- Join
[jira] [Created] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them
Albert Meltzer created SPARK-20312: -- Summary: query optimizer calls udf with null values when it doesn't expect them Key: SPARK-20312 URL: https://issues.apache.org/jira/browse/SPARK-20312 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Albert Meltzer When optimizing an outer join, spark passes an empty row to both sides to see if nulls would be ignored (side comment: for half-outer joins it subsequently ignores the assessment on the dominant side). For some reason, a condition such as "x IS NOT NULL && udf(x) IS NOT NULL" might result in checking the right side first, and an exception if the udf doesn't expect a null input (given the left side first). A example is SIMILAR to the following (see actual query plans separately): def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT NULL added filter on the result val df1 = sparkSession .table(...) .select("col1", "col2") // LongType both val df11 = df1 .filter(df1("col1").isNotNull) .withColumn("col3", functions.udf(func)(df1("col1")) .repartition(df1("col2")) .sortWithinPartitions(df1("col2")) val df2 = ... // load other data containing col2, similarly repartition and sort val df3 = df1.join(df2, Seq("col2"), "left_outer") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20292) string representation of TreeNode is messy
[ https://issues.apache.org/jira/browse/SPARK-20292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966083#comment-15966083 ] Apache Spark commented on SPARK-20292: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/17623 > string representation of TreeNode is messy > -- > > Key: SPARK-20292 > URL: https://issues.apache.org/jira/browse/SPARK-20292 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > > Currently we have a lot of string representations for QueryPlan/Expression: > toString, simpleString, verboseString, treeString, sql, etc. > The logic between them is not very clear and as a result, > {{Expression.treeString}} is mostly unreadable as it contains a lot > duplicated information. > We should clean it up -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20292) string representation of TreeNode is messy
[ https://issues.apache.org/jira/browse/SPARK-20292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20292: Assignee: Apache Spark > string representation of TreeNode is messy > -- > > Key: SPARK-20292 > URL: https://issues.apache.org/jira/browse/SPARK-20292 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > > Currently we have a lot of string representations for QueryPlan/Expression: > toString, simpleString, verboseString, treeString, sql, etc. > The logic between them is not very clear and as a result, > {{Expression.treeString}} is mostly unreadable as it contains a lot > duplicated information. > We should clean it up -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20292) string representation of TreeNode is messy
[ https://issues.apache.org/jira/browse/SPARK-20292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20292: Assignee: (was: Apache Spark) > string representation of TreeNode is messy > -- > > Key: SPARK-20292 > URL: https://issues.apache.org/jira/browse/SPARK-20292 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > > Currently we have a lot of string representations for QueryPlan/Expression: > toString, simpleString, verboseString, treeString, sql, etc. > The logic between them is not very clear and as a result, > {{Expression.treeString}} is mostly unreadable as it contains a lot > duplicated information. > We should clean it up -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items
[ https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966079#comment-15966079 ] Apache Spark commented on SPARK-20300: -- User 'MLnick' has created a pull request for this issue: https://github.com/apache/spark/pull/17622 > Python API for ALSModel.recommendForAllUsers,Items > -- > > Key: SPARK-20300 > URL: https://issues.apache.org/jira/browse/SPARK-20300 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > Python API for ALSModel methods recommendForAllUsers, recommendForAllItems -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items
[ https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20300: Assignee: (was: Apache Spark) > Python API for ALSModel.recommendForAllUsers,Items > -- > > Key: SPARK-20300 > URL: https://issues.apache.org/jira/browse/SPARK-20300 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > Python API for ALSModel methods recommendForAllUsers, recommendForAllItems -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items
[ https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20300: Assignee: Apache Spark > Python API for ALSModel.recommendForAllUsers,Items > -- > > Key: SPARK-20300 > URL: https://issues.apache.org/jira/browse/SPARK-20300 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Python API for ALSModel methods recommendForAllUsers, recommendForAllItems -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work
Juliusz Sompolski created SPARK-20311: - Summary: SQL "range(N) as alias" or "range(N) alias" doesn't work Key: SPARK-20311 URL: https://issues.apache.org/jira/browse/SPARK-20311 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Juliusz Sompolski Priority: Minor `select * from range(10) as A;` or `select * from range(10) A;` does not work. As a workaround, a subquery has to be used: `select * from (select * from range(10)) as A;` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception
[ https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965922#comment-15965922 ] Omaiyma Popat commented on SPARK-19547: --- Hi Team, Can you please advise a resolution on the above issue. I am facing the same error for: spark-streaming-kafka-0-10_2.11:2.0.1:jar val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> prop.getProperty("kafka.broker.list"), ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.GROUP_ID_CONFIG -> prop.getProperty("kafka.group.id"), ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> prop.getProperty("kafka.auto.offset.reset"), ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (true: java.lang.Boolean) ) val topics = Array(prop.getProperty("kafka.topic")) val stream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams)) > KafkaUtil throw 'No current assignment for partition' Exception > --- > > Key: SPARK-19547 > URL: https://issues.apache.org/jira/browse/SPARK-19547 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 1.6.1 >Reporter: wuchang > > Below is my scala code to create spark kafka stream: > val kafkaParams = Map[String, Object]( > "bootstrap.servers" -> "server110:2181,server110:9092", > "zookeeper" -> "server110:2181", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> classOf[StringDeserializer], > "group.id" -> "example", > "auto.offset.reset" -> "latest", > "enable.auto.commit" -> (false: java.lang.Boolean) > ) > val topics = Array("ABTest") > val stream = KafkaUtils.createDirectStream[String, String]( > ssc, > PreferConsistent, > Subscribe[String, String](topics, kafkaParams) > ) > But after run for 10 hours, it throws exceptions: > 2017-02-10 10:56:20,000 INFO [JobGenerator] internals.ConsumerCoordinator: > Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example > 2017-02-10 10:56:20,000 INFO [JobGenerator] internals.AbstractCoordinator: > (Re-)joining group example > 2017-02-10 10:56:20,011 INFO [JobGenerator] internals.AbstractCoordinator: > (Re-)joining group example > 2017-02-10 10:56:40,057 INFO [JobGenerator] internals.AbstractCoordinator: > Successfully joined group example with generation 5 > 2017-02-10 10:56:40,058 INFO [JobGenerator] internals.ConsumerCoordinator: > Setting newly assigned partitions [ABTest-1] for group example > 2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error > generating jobs for time 148669538 ms > java.lang.IllegalStateException: No current assignment for partition ABTest-0 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at >
[jira] [Commented] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965915#comment-15965915 ] Julien Champ commented on SPARK-19451: -- Any news on this bug / feature request ? > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965822#comment-15965822 ] Paul Zaczkieiwcz commented on SPARK-18055: -- Actually it seems it was this issue. I forgot to mention that the Aggregator returns an Array[case class] which I would then like to flatMap. I get the stack trace on the {{stitcher.toColumn}} line rather than the {{flatMap}} line. I've since patched Spark with this PR to work around this issue. {code:java} case class StitchedVisitor(cookie_id:java.math.BigDecimal, visit_num:Int, ...) case class CookieId(cookie_id:java.math.BigDecimal) val aggregator[Aggregator[StitchedVisitor,Array[Option[StitchedVisitor]],Array[StitchedVisitor]]] = ... df.groupByKey(s => CookieId(s.cookie_id) ).agg(stitcher.toColumn ).flatMap(agg => agg._2) {code} > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Fix For: 2.0.3, 2.1.1, 2.2.0 > > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20310) Dependency convergence error for scala-xml
[ https://issues.apache.org/jira/browse/SPARK-20310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20310: -- Issue Type: Improvement (was: Bug) You may have a slightly earlier convergence problem in your build ... it looks like you're on Scala 2.11.0 instead of 2.11.8 like Spark. Can you manage that up? Maybe that ends up agreeing on scala-xml 1.0.2, though, I doubt it actually causes a problem in any event. > Dependency convergence error for scala-xml > -- > > Key: SPARK-20310 > URL: https://issues.apache.org/jira/browse/SPARK-20310 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Samik R >Priority: Minor > > Hi, > I am trying to compile a package (apache tinkerpop) which has spark-core as > one of the dependencies. I am trying to compile with v2.1.0. But when I run > maven build through a dependency checker, it is showing a dependency error > within the spark-core itself for scala-xml package, as below: > Dependency convergence error for org.scala-lang.modules:scala-xml_2.11:1.0.1 > paths to dependency are: > +-org.apache.tinkerpop:spark-gremlin:3.2.3 > +-org.apache.spark:spark-core_2.11:2.1.0 > +-org.json4s:json4s-jackson_2.11:3.2.11 > +-org.json4s:json4s-core_2.11:3.2.11 > +-org.scala-lang:scalap:2.11.0 > +-org.scala-lang:scala-compiler:2.11.0 > +-org.scala-lang.modules:scala-xml_2.11:1.0.1 > and > +-org.apache.tinkerpop:spark-gremlin:3.2.3 > +-org.apache.spark:spark-core_2.11:2.1.0 > +-org.apache.spark:spark-tags_2.11:2.1.0 > +-org.scalatest:scalatest_2.11:2.2.6 > +-org.scala-lang.modules:scala-xml_2.11:1.0.2 > Can this be fixed? > Thanks. > -Samik -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20310) Dependency convergence error for scala-xml
Samik R created SPARK-20310: --- Summary: Dependency convergence error for scala-xml Key: SPARK-20310 URL: https://issues.apache.org/jira/browse/SPARK-20310 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.1.0 Reporter: Samik R Priority: Minor Hi, I am trying to compile a package (apache tinkerpop) which has spark-core as one of the dependencies. I am trying to compile with v2.1.0. But when I run maven build through a dependency checker, it is showing a dependency error within the spark-core itself for scala-xml package, as below: Dependency convergence error for org.scala-lang.modules:scala-xml_2.11:1.0.1 paths to dependency are: +-org.apache.tinkerpop:spark-gremlin:3.2.3 +-org.apache.spark:spark-core_2.11:2.1.0 +-org.json4s:json4s-jackson_2.11:3.2.11 +-org.json4s:json4s-core_2.11:3.2.11 +-org.scala-lang:scalap:2.11.0 +-org.scala-lang:scala-compiler:2.11.0 +-org.scala-lang.modules:scala-xml_2.11:1.0.1 and +-org.apache.tinkerpop:spark-gremlin:3.2.3 +-org.apache.spark:spark-core_2.11:2.1.0 +-org.apache.spark:spark-tags_2.11:2.1.0 +-org.scalatest:scalatest_2.11:2.2.6 +-org.scala-lang.modules:scala-xml_2.11:1.0.2 Can this be fixed? Thanks. -Samik -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20306) beeline connect spark thrift server failure
[ https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-20306: --- > beeline connect spark thrift server failure > > > Key: SPARK-20306 > URL: https://issues.apache.org/jira/browse/SPARK-20306 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.2 >Reporter: sydt > > Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and > error is : > Error occurred during processing of message. > java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: > Unsupported mechanism type GSSAPI > The connect command is : > !connect > jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn > . However, it works well before. > When my hadoop cluster equipped with federation ,it can not work well -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20306) beeline connect spark thrift server failure
[ https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20306. --- Resolution: Not A Problem > beeline connect spark thrift server failure > > > Key: SPARK-20306 > URL: https://issues.apache.org/jira/browse/SPARK-20306 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.2 >Reporter: sydt > > Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and > error is : > Error occurred during processing of message. > java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: > Unsupported mechanism type GSSAPI > The connect command is : > !connect > jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn > . However, it works well before. > When my hadoop cluster equipped with federation ,it can not work well -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20309) Repartitioning - more than the default number of partitions
[ https://issues.apache.org/jira/browse/SPARK-20309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20309. --- Resolution: Not A Problem > Repartitioning - more than the default number of partitions > --- > > Key: SPARK-20309 > URL: https://issues.apache.org/jira/browse/SPARK-20309 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 1.6.1 > Environment: Spark 1.6.1, Hadoop 2.6.2, Redhat Linux >Reporter: balaji krishnan > > We have a requirement to process roughly 45MM rows. Each row has a KEY > column. The unique number of KEYS are roughly 25000. I do the following Spark > SQL to get all 25000 partitions > hiveContext.sql("select * from BigTbl distribute by KEY") > This nicely creates all required partitions. Using MapPartitions we process > all of these partitions in parallel. I have a variable which should be > initiated once per partition. Lets call that variable name as designName. > This variable is initialized with the KEY value (partition value) for future > identification purposes. > FlatMapFunctionflatMapSetup = new > FlatMapFunction () { > String designName= null; > @Override > public java.lang.Iterable call(java.util.Iterator it) > throws Exception { > while (it.hasNext()) { > Row row = it.next(); //row is of type org.apache.spark.sql.Row > if (designName == null) { > designName = row.getString(1); // row is of type org.apache.spark.sql.Row > } > . > } > List Dsg = partitionedRecs.mapPartitions(flatMapSetup).collect(); > The problem starts here for me. > Since i have more than the default number of partitions (200) i do a > re-partition to the unique number of partitions in my BigTbl. However if i > re-partition the designName gets completely confused. Instead of processing a > total of 25000 unique values, i get only 15000 values processed. For some > reason the repartition completely messes up the uniqueness of the previous > step. The statement that i use to 'distribute by' and subsequently repartition > long distinctValues = hiveContext.sql("select KEY from > BigTbl").distinct().count(); > JavaRDD partitionedRecs = hiveContext.sql("select * from BigTbl > DISTRIBUTE by SKU ").repartition((int)distinctValues).toJavaRDD().cache(); > I tried another solution, that is by changing the > spark.sql.shuffle.partitions during SparkContext creation. Even that did not > help. I get the same issue. > new JavaSparkContext(new > SparkConf().set("spark.driver.maxResultSize","0").set("spark.sql.shuffle.partitions","26000")) > Is there a way to solve this issue please or is this a DISTRIBUTE BY bug ? We > are using Spark 1.6.1 > Regards > Bala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965735#comment-15965735 ] Fei Wang commented on SPARK-20184: -- Also use the master branch to test my test case: 1. Java version 192:spark wangfei$ java -version java version "1.8.0_65" Java(TM) SE Runtime Environment (build 1.8.0_65-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode) 2. Spark starting cmd 192:spark wangfei$ bin/spark-sql --master local[4] --driver-memory 16g 3. test result: sql: {code} select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 group by dim_1, dim_2 limit 100; {code} codegen on: about 1.4s codegen off: about 0.6s > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20306) beeline connect spark thrift server failure
[ https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sydt closed SPARK-20306. Resolution: Fixed > beeline connect spark thrift server failure > > > Key: SPARK-20306 > URL: https://issues.apache.org/jira/browse/SPARK-20306 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.2 >Reporter: sydt > > Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and > error is : > Error occurred during processing of message. > java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: > Unsupported mechanism type GSSAPI > The connect command is : > !connect > jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn > . However, it works well before. > When my hadoop cluster equipped with federation ,it can not work well -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20309) Repartitioning - more than the default number of partitions
[ https://issues.apache.org/jira/browse/SPARK-20309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965715#comment-15965715 ] balaji krishnan commented on SPARK-20309: - thanks Sean. I will send that as an email to u...@spark.apache.org. I need to re partition to process all partitions as the number of partitions that i have in my case is higher than the default partition (200). Will try and articulate even better while sending this to that email address.. Thanks for quick response. Regards Bala > Repartitioning - more than the default number of partitions > --- > > Key: SPARK-20309 > URL: https://issues.apache.org/jira/browse/SPARK-20309 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 1.6.1 > Environment: Spark 1.6.1, Hadoop 2.6.2, Redhat Linux >Reporter: balaji krishnan > > We have a requirement to process roughly 45MM rows. Each row has a KEY > column. The unique number of KEYS are roughly 25000. I do the following Spark > SQL to get all 25000 partitions > hiveContext.sql("select * from BigTbl distribute by KEY") > This nicely creates all required partitions. Using MapPartitions we process > all of these partitions in parallel. I have a variable which should be > initiated once per partition. Lets call that variable name as designName. > This variable is initialized with the KEY value (partition value) for future > identification purposes. > FlatMapFunctionflatMapSetup = new > FlatMapFunction () { > String designName= null; > @Override > public java.lang.Iterable call(java.util.Iterator it) > throws Exception { > while (it.hasNext()) { > Row row = it.next(); //row is of type org.apache.spark.sql.Row > if (designName == null) { > designName = row.getString(1); // row is of type org.apache.spark.sql.Row > } > . > } > List Dsg = partitionedRecs.mapPartitions(flatMapSetup).collect(); > The problem starts here for me. > Since i have more than the default number of partitions (200) i do a > re-partition to the unique number of partitions in my BigTbl. However if i > re-partition the designName gets completely confused. Instead of processing a > total of 25000 unique values, i get only 15000 values processed. For some > reason the repartition completely messes up the uniqueness of the previous > step. The statement that i use to 'distribute by' and subsequently repartition > long distinctValues = hiveContext.sql("select KEY from > BigTbl").distinct().count(); > JavaRDD partitionedRecs = hiveContext.sql("select * from BigTbl > DISTRIBUTE by SKU ").repartition((int)distinctValues).toJavaRDD().cache(); > I tried another solution, that is by changing the > spark.sql.shuffle.partitions during SparkContext creation. Even that did not > help. I get the same issue. > new JavaSparkContext(new > SparkConf().set("spark.driver.maxResultSize","0").set("spark.sql.shuffle.partitions","26000")) > Is there a way to solve this issue please or is this a DISTRIBUTE BY bug ? We > are using Spark 1.6.1 > Regards > Bala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20309) Repartitioning - more than the default number of partitions
[ https://issues.apache.org/jira/browse/SPARK-20309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965714#comment-15965714 ] Sean Owen commented on SPARK-20309: --- It's not clear what you're asking, but this should go to u...@spark.apache.org, not JIRA. Repartitioning isn't going to preserve some partition invariant from the original partitioning, in general. It doesn't sound like a problem. > Repartitioning - more than the default number of partitions > --- > > Key: SPARK-20309 > URL: https://issues.apache.org/jira/browse/SPARK-20309 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 1.6.1 > Environment: Spark 1.6.1, Hadoop 2.6.2, Redhat Linux >Reporter: balaji krishnan > > We have a requirement to process roughly 45MM rows. Each row has a KEY > column. The unique number of KEYS are roughly 25000. I do the following Spark > SQL to get all 25000 partitions > hiveContext.sql("select * from BigTbl distribute by KEY") > This nicely creates all required partitions. Using MapPartitions we process > all of these partitions in parallel. I have a variable which should be > initiated once per partition. Lets call that variable name as designName. > This variable is initialized with the KEY value (partition value) for future > identification purposes. > FlatMapFunctionflatMapSetup = new > FlatMapFunction () { > String designName= null; > @Override > public java.lang.Iterable call(java.util.Iterator it) > throws Exception { > while (it.hasNext()) { > Row row = it.next(); //row is of type org.apache.spark.sql.Row > if (designName == null) { > designName = row.getString(1); // row is of type org.apache.spark.sql.Row > } > . > } > List Dsg = partitionedRecs.mapPartitions(flatMapSetup).collect(); > The problem starts here for me. > Since i have more than the default number of partitions (200) i do a > re-partition to the unique number of partitions in my BigTbl. However if i > re-partition the designName gets completely confused. Instead of processing a > total of 25000 unique values, i get only 15000 values processed. For some > reason the repartition completely messes up the uniqueness of the previous > step. The statement that i use to 'distribute by' and subsequently repartition > long distinctValues = hiveContext.sql("select KEY from > BigTbl").distinct().count(); > JavaRDD partitionedRecs = hiveContext.sql("select * from BigTbl > DISTRIBUTE by SKU ").repartition((int)distinctValues).toJavaRDD().cache(); > I tried another solution, that is by changing the > spark.sql.shuffle.partitions during SparkContext creation. Even that did not > help. I get the same issue. > new JavaSparkContext(new > SparkConf().set("spark.driver.maxResultSize","0").set("spark.sql.shuffle.partitions","26000")) > Is there a way to solve this issue please or is this a DISTRIBUTE BY bug ? We > are using Spark 1.6.1 > Regards > Bala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20309) Repartitioning - more than the default number of partitions
balaji krishnan created SPARK-20309: --- Summary: Repartitioning - more than the default number of partitions Key: SPARK-20309 URL: https://issues.apache.org/jira/browse/SPARK-20309 Project: Spark Issue Type: Bug Components: Java API, SQL Affects Versions: 1.6.1 Environment: Spark 1.6.1, Hadoop 2.6.2, Redhat Linux Reporter: balaji krishnan We have a requirement to process roughly 45MM rows. Each row has a KEY column. The unique number of KEYS are roughly 25000. I do the following Spark SQL to get all 25000 partitions hiveContext.sql("select * from BigTbl distribute by KEY") This nicely creates all required partitions. Using MapPartitions we process all of these partitions in parallel. I have a variable which should be initiated once per partition. Lets call that variable name as designName. This variable is initialized with the KEY value (partition value) for future identification purposes. FlatMapFunctionflatMapSetup = new FlatMapFunction () { String designName= null; @Override public java.lang.Iterable call(java.util.Iterator it) throws Exception { while (it.hasNext()) { Row row = it.next(); //row is of type org.apache.spark.sql.Row if (designName == null) { designName = row.getString(1); // row is of type org.apache.spark.sql.Row } . } List Dsg = partitionedRecs.mapPartitions(flatMapSetup).collect(); The problem starts here for me. Since i have more than the default number of partitions (200) i do a re-partition to the unique number of partitions in my BigTbl. However if i re-partition the designName gets completely confused. Instead of processing a total of 25000 unique values, i get only 15000 values processed. For some reason the repartition completely messes up the uniqueness of the previous step. The statement that i use to 'distribute by' and subsequently repartition long distinctValues = hiveContext.sql("select KEY from BigTbl").distinct().count(); JavaRDD partitionedRecs = hiveContext.sql("select * from BigTbl DISTRIBUTE by SKU ").repartition((int)distinctValues).toJavaRDD().cache(); I tried another solution, that is by changing the spark.sql.shuffle.partitions during SparkContext creation. Even that did not help. I get the same issue. new JavaSparkContext(new SparkConf().set("spark.driver.maxResultSize","0").set("spark.sql.shuffle.partitions","26000")) Is there a way to solve this issue please or is this a DISTRIBUTE BY bug ? We are using Spark 1.6.1 Regards Bala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20308) org.apache.spark.shuffle.FetchFailedException: Too large frame
[ https://issues.apache.org/jira/browse/SPARK-20308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artur Sukhenko resolved SPARK-20308. Resolution: Duplicate > org.apache.spark.shuffle.FetchFailedException: Too large frame > -- > > Key: SPARK-20308 > URL: https://issues.apache.org/jira/browse/SPARK-20308 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: Stanislav Chernichkin > > Spark uses custom frame decoder (TransportFrameDecoder) which does not > support frames larger than 2G. This lead to fails when shuffling using large > partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18692) Test Java 8 unidoc build on Jenkins master builder
[ https://issues.apache.org/jira/browse/SPARK-18692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-18692: - Assignee: Hyukjin Kwon > Test Java 8 unidoc build on Jenkins master builder > -- > > Key: SPARK-18692 > URL: https://issues.apache.org/jira/browse/SPARK-18692 > Project: Spark > Issue Type: Test > Components: Build, Documentation >Reporter: Joseph K. Bradley >Assignee: Hyukjin Kwon > Labels: jenkins > Fix For: 2.2.0 > > > [SPARK-3359] fixed the unidoc build for Java 8, but it is easy to break. It > would be great to add this build to the Spark master builder on Jenkins to > make it easier to identify PRs which break doc builds. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18692) Test Java 8 unidoc build on Jenkins master builder
[ https://issues.apache.org/jira/browse/SPARK-18692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18692. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17477 [https://github.com/apache/spark/pull/17477] > Test Java 8 unidoc build on Jenkins master builder > -- > > Key: SPARK-18692 > URL: https://issues.apache.org/jira/browse/SPARK-18692 > Project: Spark > Issue Type: Test > Components: Build, Documentation >Reporter: Joseph K. Bradley > Labels: jenkins > Fix For: 2.2.0 > > > [SPARK-3359] fixed the unidoc build for Java 8, but it is easy to break. It > would be great to add this build to the Spark master builder on Jenkins to > make it easier to identify PRs which break doc builds. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965670#comment-15965670 ] Herman van Hovell commented on SPARK-20184: --- I just tried your example using the master branch, both take about 0.9 seconds on my local machine. So I am not sure what the problem is... Which version of Spark are you using? What JVM are you running on? How are you starting spark? > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18791) Stream-Stream Joins
[ https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965657#comment-15965657 ] xianyao jiang commented on SPARK-18791: --- when this feature will be provided? is there any idea about it? we want to use the structured streaming, but we need the stream-stream join function. > Stream-Stream Joins > --- > > Key: SPARK-18791 > URL: https://issues.apache.org/jira/browse/SPARK-18791 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust > > Just a placeholder for now. Please comment with your requirements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20306) beeline connect spark thrift server failure
[ https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965654#comment-15965654 ] Sean Owen commented on SPARK-20306: --- This is some env or config issue, not Spark. > beeline connect spark thrift server failure > > > Key: SPARK-20306 > URL: https://issues.apache.org/jira/browse/SPARK-20306 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.2 >Reporter: sydt > > Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and > error is : > Error occurred during processing of message. > java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: > Unsupported mechanism type GSSAPI > The connect command is : > !connect > jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn > . However, it works well before. > When my hadoop cluster equipped with federation ,it can not work well -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs
[ https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20296: - Assignee: Jason Tokayer > UnsupportedOperationChecker text on distinct aggregations differs from docs > --- > > Key: SPARK-20296 > URL: https://issues.apache.org/jira/browse/SPARK-20296 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Jason Tokayer >Assignee: Jason Tokayer >Priority: Trivial > Fix For: 2.1.2, 2.2.0 > > > In the unsupported operations section in the docs > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html > it states that "Distinct operations on streaming Datasets are not > supported.". However, in > ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```, > the error message is ```Distinct aggregations are not supported on streaming > DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete > output mode. Consider using approximate distinct aggregation```. > It seems that the error message is incorrect. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs
[ https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20296. --- Resolution: Fixed Fix Version/s: 2.1.2 2.2.0 Issue resolved by pull request 17609 [https://github.com/apache/spark/pull/17609] > UnsupportedOperationChecker text on distinct aggregations differs from docs > --- > > Key: SPARK-20296 > URL: https://issues.apache.org/jira/browse/SPARK-20296 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Jason Tokayer >Priority: Trivial > Fix For: 2.2.0, 2.1.2 > > > In the unsupported operations section in the docs > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html > it states that "Distinct operations on streaming Datasets are not > supported.". However, in > ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```, > the error message is ```Distinct aggregations are not supported on streaming > DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete > output mode. Consider using approximate distinct aggregation```. > It seems that the error message is incorrect. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20308) org.apache.spark.shuffle.FetchFailedException: Too large frame
Stanislav Chernichkin created SPARK-20308: - Summary: org.apache.spark.shuffle.FetchFailedException: Too large frame Key: SPARK-20308 URL: https://issues.apache.org/jira/browse/SPARK-20308 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 2.1.0 Reporter: Stanislav Chernichkin Spark uses custom frame decoder (TransportFrameDecoder) which does not support frames larger than 2G. This lead to fails when shuffling using large partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965647#comment-15965647 ] pralabhkumar commented on SPARK-20199: -- For GBM its using Random Forest ,and to add randomness to tree ,there should be featureSubsetStrategy . FeatureSubsetStrategy should not be hardcoded for GBM . > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anne Rutten updated SPARK-20307: Description: when training a model in SparkR with string variables (tested with spark.randomForest, but i assume is valid for all spark.xx functions that apply a StringIndexer under the hood), testing on a new dataset with factor levels that are not in the training set will throw an "Unseen label" error. I think this can be solved if there's a method to pass setHandleInvalid on to the StringIndexers when calling spark.randomForest. code snippet: {code} # (i've run this in Zeppelin which already has SparkR and the context loaded) #library(SparkR) #sparkR.session(master = "local[*]") data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE), someString = base::sample(c("this", "that"), 100, replace=TRUE), stringsAsFactors=FALSE) trainidxs = base::sample(nrow(data), nrow(data)*0.7) traindf = as.DataFrame(data[trainidxs,]) testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other"))) rf = spark.randomForest(traindf, clicked~., type="classification", maxDepth=10, maxBins=41, numTrees = 100) predictions = predict(rf, testdf) SparkR::collect(predictions) {code} stack trace: {quote} Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (string) => double) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Unseen label: the other. at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170) at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166) ... 16 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at
[jira] [Updated] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anne Rutten updated SPARK-20307: Description: when training a model in SparkR with string variables (tested with spark.randomForest, but i assume is valid for all spark.xx functions that apply a StringIndexer under the hood), testing on a new dataset with factor levels that are not in the training set will throw an "Unseen label" error. I think this can be solved if there's a method to pass setHandleInvalid on to the StringIndexers when calling spark.randomForest. code snippet: {code} # (i've run this in Zeppelin which already has SparkR and the context loaded) #library(SparkR) #sparkR.session(master = "local[*]") data = data.frame(clicked = sample(c(0,1),100,replace=TRUE), someString = sample(c("this", "that"), 100, replace=TRUE), stringsAsFactors=FALSE) trainidxs = base::sample(nrow(data), nrow(data)*0.7) traindf = as.DataFrame(data[trainidxs,]) testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other"))) rf = spark.randomForest(traindf, clicked~., type="classification", maxDepth=10, maxBins=41, numTrees = 100) predictions = predict(rf, testdf, allow.new.labels=TRUE) SparkR::collect(predictions) {code} stack trace: {quote} Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (string) => double) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Unseen label: the other. at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170) at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166) ... 16 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at
[jira] [Updated] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-20199: - Priority: Major (was: Minor) > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anne Rutten updated SPARK-20307: Priority: Minor (was: Major) > SparkR: pass on setHandleInvalid to spark.mllib functions that use > StringIndexer > > > Key: SPARK-20307 > URL: https://issues.apache.org/jira/browse/SPARK-20307 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Anne Rutten >Priority: Minor > > when training a model in SparkR with string variables (tested with > spark.randomForest, but i assume is valid for all spark.xx functions that > apply a StringIndexer under the hood), testing on a new dataset with > may throw an "Unseen label" error when not all factor levels are present in > the training set. I think this can be solved if there's a method to pass > setHandleInvalid on to the StringIndexers when calling spark.randomForest. > code snippet: > {code} > # (i've run this in Zeppelin which already has SparkR and the context loaded) > #library(SparkR) > #sparkR.session(master = "local[*]") > data = data.frame(clicked = sample(c(0,1),100,replace=TRUE), > someString = sample(c("this", "that"), 100, > replace=TRUE), stringsAsFactors=FALSE) > trainidxs = base::sample(nrow(data), nrow(data)*0.7) > traindf = as.DataFrame(data[trainidxs,]) > testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other"))) > rf = spark.randomForest(traindf, clicked~., type="classification", > maxDepth=10, > maxBins=41, > numTrees = 100) > predictions = predict(rf, testdf, allow.new.labels=TRUE) > SparkR::collect(predictions) > {code} > stack trace: > {quote} > Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor > driver): org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$4: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Unseen label: the other. > at > org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170) > at > org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166) > ... 16 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at >
[jira] [Updated] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anne Rutten updated SPARK-20307: Component/s: (was: MLlib) > SparkR: pass on setHandleInvalid to spark.mllib functions that use > StringIndexer > > > Key: SPARK-20307 > URL: https://issues.apache.org/jira/browse/SPARK-20307 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Anne Rutten > > when training a model in SparkR with string variables (tested with > spark.randomForest, but i assume is valid for all spark.xx functions that > apply a StringIndexer under the hood), testing on a new dataset with > may throw an "Unseen label" error when not all factor levels are present in > the training set. I think this can be solved if there's a method to pass > setHandleInvalid on to the StringIndexers when calling spark.randomForest. > code snippet: > {code} > # (i've run this in Zeppelin which already has SparkR and the context loaded) > #library(SparkR) > #sparkR.session(master = "local[*]") > data = data.frame(clicked = sample(c(0,1),100,replace=TRUE), > someString = sample(c("this", "that"), 100, > replace=TRUE), stringsAsFactors=FALSE) > trainidxs = base::sample(nrow(data), nrow(data)*0.7) > traindf = as.DataFrame(data[trainidxs,]) > testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other"))) > rf = spark.randomForest(traindf, clicked~., type="classification", > maxDepth=10, > maxBins=41, > numTrees = 100) > predictions = predict(rf, testdf, allow.new.labels=TRUE) > SparkR::collect(predictions) > {code} > stack trace: > {quote} > Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor > driver): org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$4: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Unseen label: the other. > at > org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170) > at > org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166) > ... 16 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at >
[jira] [Updated] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anne Rutten updated SPARK-20307: Component/s: SparkR > SparkR: pass on setHandleInvalid to spark.mllib functions that use > StringIndexer > > > Key: SPARK-20307 > URL: https://issues.apache.org/jira/browse/SPARK-20307 > Project: Spark > Issue Type: Improvement > Components: MLlib, SparkR >Affects Versions: 2.1.0 >Reporter: Anne Rutten > > when training a model in SparkR with string variables (tested with > spark.randomForest, but i assume is valid for all spark.xx functions that > apply a StringIndexer under the hood), testing on a new dataset with > may throw an "Unseen label" error when not all factor levels are present in > the training set. I think this can be solved if there's a method to pass > setHandleInvalid on to the StringIndexers when calling spark.randomForest. > code snippet: > {code} > # (i've run this in Zeppelin which already has SparkR and the context loaded) > #library(SparkR) > #sparkR.session(master = "local[*]") > data = data.frame(clicked = sample(c(0,1),100,replace=TRUE), > someString = sample(c("this", "that"), 100, > replace=TRUE), stringsAsFactors=FALSE) > trainidxs = base::sample(nrow(data), nrow(data)*0.7) > traindf = as.DataFrame(data[trainidxs,]) > testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other"))) > rf = spark.randomForest(traindf, clicked~., type="classification", > maxDepth=10, > maxBins=41, > numTrees = 100) > predictions = predict(rf, testdf, allow.new.labels=TRUE) > SparkR::collect(predictions) > {code} > stack trace: > {quote} > Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor > driver): org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$4: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Unseen label: the other. > at > org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170) > at > org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166) > ... 16 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
[jira] [Created] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer
Anne Rutten created SPARK-20307: --- Summary: SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer Key: SPARK-20307 URL: https://issues.apache.org/jira/browse/SPARK-20307 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.1.0 Reporter: Anne Rutten when training a model in SparkR with string variables (tested with spark.randomForest, but i assume is valid for all spark.xx functions that apply a StringIndexer under the hood), testing on a new dataset with may throw an "Unseen label" error when not all factor levels are present in the training set. I think this can be solved if there's a method to pass setHandleInvalid on to the StringIndexers when calling spark.randomForest. code snippet: {code} # (i've run this in Zeppelin which already has SparkR and the context loaded) #library(SparkR) #sparkR.session(master = "local[*]") data = data.frame(clicked = sample(c(0,1),100,replace=TRUE), someString = sample(c("this", "that"), 100, replace=TRUE), stringsAsFactors=FALSE) trainidxs = base::sample(nrow(data), nrow(data)*0.7) traindf = as.DataFrame(data[trainidxs,]) testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other"))) rf = spark.randomForest(traindf, clicked~., type="classification", maxDepth=10, maxBins=41, numTrees = 100) predictions = predict(rf, testdf, allow.new.labels=TRUE) SparkR::collect(predictions) {code} stack trace: {quote} Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (string) => double) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Unseen label: the other. at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170) at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166) ... 16 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at
[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965632#comment-15965632 ] Apache Spark commented on SPARK-6227: - User 'MLnick' has created a pull request for this issue: https://github.com/apache/spark/pull/17621 > PCA and SVD for PySpark > --- > > Key: SPARK-6227 > URL: https://issues.apache.org/jira/browse/SPARK-6227 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.2.1 >Reporter: Julien Amelot >Assignee: Manoj Kumar > > The Dimensionality Reduction techniques are not available via Python (Scala + > Java only). > * Principal component analysis (PCA) > * Singular value decomposition (SVD) > Doc: > http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965601#comment-15965601 ] Hyukjin Kwon commented on SPARK-20294: -- {quote} If we now that the rdd is not empty (the function tests for that before line 354), then we should at least use the first row as a fallback if the sampling fails. {quote} I think this can be just in application-side if we both agree that it is a narrow case. {code} small_rdd = sc.parallelize([(1, '2'), (2, 'foo')]) try: df = small_rdd.toDF(sampleRatio=0.01) except ValueError: df = small_rdd.toDF() {code} > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > {code} > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > {code} > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965597#comment-15965597 ] Fei Wang edited comment on SPARK-20184 at 4/12/17 9:21 AM: --- try this : 1. create table {code} val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", "c15", "c16", "c17", "c18", "c19", "c20") df.write.saveAsTable("sum_table_50w_3") {code} 2. query the table {code} select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 group by dim_1, dim_2 limit 100 {code} was (Author: scwf): try this : 1. create table {code} val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", "c15", "c16", "c17", "c18", "c19", "c20") df.write.saveAsTable("sum_table_50w_3") {code} 2. query the table select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 group by dim_1, dim_2 limit 100 > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965597#comment-15965597 ] Fei Wang edited comment on SPARK-20184 at 4/12/17 9:21 AM: --- try this : 1. create table {code} val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", "c15", "c16", "c17", "c18", "c19", "c20") df.write.saveAsTable("sum_table_50w_3") {code} 2. query the table select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 group by dim_1, dim_2 limit 100 was (Author: scwf): try this : 1. create table [code] val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", "c15", "c16", "c17", "c18", "c19", "c20") df.write.saveAsTable("sum_table_50w_3") df.write.format("csv").saveAsTable("sum_table_50w_1") [code] 2. query the table select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 group by dim_1, dim_2 limit 100 > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965597#comment-15965597 ] Fei Wang commented on SPARK-20184: -- try this : 1. create table [code] val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", "c15", "c16", "c17", "c18", "c19", "c20") df.write.saveAsTable("sum_table_50w_3") df.write.format("csv").saveAsTable("sum_table_50w_1") [code] 2. query the table select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 group by dim_1, dim_2 limit 100 > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965590#comment-15965590 ] João Pedro Jericó commented on SPARK-20294: --- Yes, I think that this would be a nice solution, i.e., given that we *know* it's not empty, we should always try to do something. Right now spark skips directly to sampling and raises an error if it fails. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > {code} > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > {code} > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20306) beeline connect spark thrift server failure
[ https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sydt updated SPARK-20306: - Description: Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and error is : Error occurred during processing of message. java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Unsupported mechanism type GSSAPI The connect command is : !connect jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn . However, it works well before. When my hadoop cluster equipped with federation ,it can not work well was: Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and error is : ERROR server.TThreadPoolServer: Error occurred during processing of message. Peer indicated failure: Unsupported mechanism type GSSAPI. The connect command is : !connect jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn . However, it works well before. When my hadoop cluster equipped with federation ,it can not work well > beeline connect spark thrift server failure > > > Key: SPARK-20306 > URL: https://issues.apache.org/jira/browse/SPARK-20306 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.2 >Reporter: sydt > > Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and > error is : > Error occurred during processing of message. > java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: > Unsupported mechanism type GSSAPI > The connect command is : > !connect > jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn > . However, it works well before. > When my hadoop cluster equipped with federation ,it can not work well -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20306) beeline connect spark thrift server failure
[ https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sydt updated SPARK-20306: - Description: Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and error is : ERROR server.TThreadPoolServer: Error occurred during processing of message. Peer indicated failure: Unsupported mechanism type GSSAPI. The connect command is : !connect jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn . However, it works well before. When my hadoop cluster equipped with federation ,it can not work well was: Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and error is : Peer indicated failure: Unsupported mechanism type GSSAPI. The connect command is : !connect jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn . However, it works well before. When my hadoop cluster equipped with federation ,it can not work well > beeline connect spark thrift server failure > > > Key: SPARK-20306 > URL: https://issues.apache.org/jira/browse/SPARK-20306 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.2 >Reporter: sydt > > Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and > error is : > ERROR server.TThreadPoolServer: Error occurred during processing of message. > Peer indicated failure: Unsupported mechanism type GSSAPI. > The connect command is : > !connect > jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn > . However, it works well before. > When my hadoop cluster equipped with federation ,it can not work well -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965579#comment-15965579 ] Sean Owen commented on SPARK-20294: --- CC [~andrewor14] who might be able to comment better on this. If there's no sampling, it tries the first element, then first 100, then fails and asks you to try sampling. If there's sampling, it just tries to operate on the sample. Would it be more sensible to: - Try the first element - Try the first hundred - Try a sample > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > {code} > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > {code} > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20306) beeline connect spark thrift server failure
sydt created SPARK-20306: Summary: beeline connect spark thrift server failure Key: SPARK-20306 URL: https://issues.apache.org/jira/browse/SPARK-20306 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.6.2 Reporter: sydt Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and error is : Peer indicated failure: Unsupported mechanism type GSSAPI. The connect command is : !connect jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn . However, it works well before. When my hadoop cluster equipped with federation ,it can not work well -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965568#comment-15965568 ] João Pedro Jericó edited comment on SPARK-20294 at 4/12/17 8:56 AM: Yes, if sampling ration is not given it infers for the first... still, maybe I'm nitpicking but my use case here is that I'm working with RDDs pulled from data that can vary from being ~50 rows long to 100k+, and I need to do a flatMap operation on them and then send them back to being a DF. However, the schema for the first row is not necessarily the schema for the whole thing because some Rows are missing some entries. The solution I used was to set the samplingRatio at 0.01, which works very well except for the RDDs below 100 entries, where the sampling ratio is so small it has a chance of failing. The solution I came up with was to set sampling ration as {code:python} min(100., N) / N {code}, which is either 1% of the DF or everything if the dataframe is smaller than 100, but I think this is not ideal. If we now that the rdd is not empty (the function tests for that before line 354), then we should at least use the first row as a fallback if the sampling fails. was (Author: jpjandrade): Yes, if sampling ration is not given it infers for the first... maybe I'm nitpicking but my use case here is that I'm working with RDDs pulled from data that can vary from being ~50 rows long to 100k+, and I need to do a flatMap operation on them and then send them back to being a DF. However, the schema for the first row is not necessarily the schema for the whole thing because some Rows are missing some entries. The solution I used was to set the samplingRatio at 0.01, which works very well except for the RDDs below 100 entries, where the sampling ratio is so small it has a chance of failing. The solution I came up with was to set sampling ration as {code:python} min(100., N) / N {code}, which is either 1% of the DF or everything if the dataframe is smaller than 100, but I think this is not ideal. If we now that the rdd is not empty (the function tests for that before line 354), then we should at least use the first row as a fallback if the sampling fails. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] João Pedro Jericó updated SPARK-20294: -- Description: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: {code} small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() {code} This will fail with high probability because when sampling the small_rdd with the .sample method it will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. This is probably a problem with the other Spark APIs however I don't have the knowledge to look at the source code for other languages. was: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method it will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. This is probably a problem with the other Spark APIs however I don't have the knowledge to look at the source code for other languages. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > {code} > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > {code} > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965568#comment-15965568 ] João Pedro Jericó edited comment on SPARK-20294 at 4/12/17 8:57 AM: Yes, if sampling ration is not given it infers for the first... still, maybe I'm nitpicking but my use case here is that I'm working with RDDs pulled from data that can vary from being ~50 rows long to 100k+, and I need to do a flatMap operation on them and then send them back to being a DF. However, the schema for the first row is not necessarily the schema for the whole thing because some Rows are missing some entries. The solution I used was to set the samplingRatio at 0.01, which works very well except for the RDDs below 100 entries, where the sampling ratio is so small it has a chance of failing. The solution I came up with was to set sampling ration as {code} min(100., N) / N {code}, which is either 1% of the DF or everything if the dataframe is smaller than 100, but I think this is not ideal. If we now that the rdd is not empty (the function tests for that before line 354), then we should at least use the first row as a fallback if the sampling fails. was (Author: jpjandrade): Yes, if sampling ration is not given it infers for the first... still, maybe I'm nitpicking but my use case here is that I'm working with RDDs pulled from data that can vary from being ~50 rows long to 100k+, and I need to do a flatMap operation on them and then send them back to being a DF. However, the schema for the first row is not necessarily the schema for the whole thing because some Rows are missing some entries. The solution I used was to set the samplingRatio at 0.01, which works very well except for the RDDs below 100 entries, where the sampling ratio is so small it has a chance of failing. The solution I came up with was to set sampling ration as {code:python} min(100., N) / N {code}, which is either 1% of the DF or everything if the dataframe is smaller than 100, but I think this is not ideal. If we now that the rdd is not empty (the function tests for that before line 354), then we should at least use the first row as a fallback if the sampling fails. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965568#comment-15965568 ] João Pedro Jericó commented on SPARK-20294: --- Yes, if sampling ration is not given it infers for the first... maybe I'm nitpicking but my use case here is that I'm working with RDDs pulled from data that can vary from being ~50 rows long to 100k+, and I need to do a flatMap operation on them and then send them back to being a DF. However, the schema for the first row is not necessarily the schema for the whole thing because some Rows are missing some entries. The solution I used was to set the samplingRatio at 0.01, which works very well except for the RDDs below 100 entries, where the sampling ratio is so small it has a chance of failing. The solution I came up with was to set sampling ration as {code:python} min(100., N) / N {code}, which is either 1% of the DF or everything if the dataframe is smaller than 100, but I think this is not ideal. If we now that the rdd is not empty (the function tests for that before line 354), then we should at least use the first row as a fallback if the sampling fails. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20237) Spark-1.6 current and later versions of memory management issues
[ https://issues.apache.org/jira/browse/SPARK-20237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20237: -- Description: In spark-1.6 and later versions, there is a problem with its memory management UnifiedMemoryManager. Spark.memory.storageFraction configuration should be at least storage Memory memory. In the memory management UnifiedMemoryManager, the calculation of Execution memory can be up to storage how much memory can borrow,using {{val memoryReclaimableFromStorage = math.max(storageMemoryPool.memoryFree,storageMemoryPool.poolSize - storageRegionSize)}}. When {{storageMemoryPool.memoryFree > storageMemoryPool.poolSize - storageRegionSize}}, the size of the a will be chosen, that is,storage Memory will reduce the storageMemoryPool.memoryFree so much. Because of {{storageMemoryPool.memoryFree > storageMemoryPool.poolSize - storageRegionSize}}, so {{storageMemoryPool.poolSize - storageMemoryPool.memoryFree < storageRegionSize}} Now {{storageMemoryPool.poolSize < storageRegionSize,storageRegionSize}} is the smallest proportion of frame definition,so there is a problem. To solve this problem, we define the function as {{val memoryReclaimableFromStorage = storageMemoryPool.poolSize - storageRegionSize}}. Experimental proof: I added some log information to the UnifiedMemoryManager file as follows: {code} logInfo("storageMemoryPool.memoryFree %f".format(storageMemoryPool.memoryFree/1024.0/1024.0)) logInfo("onHeapExecutionMemoryPool.memoryFree %f".format(onHeapExecutionMemoryPool.memoryFree/1024.0/1024.0)) logInfo("storageMemoryPool.memoryUsed %f".format( storageMemoryPool.memoryUsed/1024.0/1024.0)) logInfo("onHeapExecutionMemoryPool.memoryUsed %f".format(onHeapExecutionMemoryPool.memoryUsed/1024.0/1024.0)) logInfo("storageMemoryPool.poolSize %f".format( storageMemoryPool.poolSize/1024.0/1024.0)) logInfo("onHeapExecutionMemoryPool.poolSize %f".format(onHeapExecutionMemoryPool.poolSize/1024.0/1024.0)) {code} When I run the PageRank program, the input file for PageRank is generated by the BigDataBench-Chinese Academy of Sciences and is used to evaluate large data analysis system tools with a size of 676M. The information submitted is as follows: {code} ./bin/spark-submit --class org.apache.spark.examples.SparkPageRank \ --master yarn \ --deploy-mode cluster \ --num-executors 1 \ --driver-memory 4g \ --executor-memory 7g \ --executor-cores 6 \ --queue thequeue \ ./examples/target/scala-2.10/spark-examples-1.6.2-hadoop2.2.0.jar \ /test/Google_genGraph_23.txt {code} The configuration is as follows: {code} spark.memory.useLegacyMode=false spark.memory.fraction=0.75 spark.memory.storageFraction=0.2 Log information is as follows: 17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: storageMemoryPool.memoryFree 0.00 17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: onHeapExecutionMemoryPool.memoryFree 5663.325877 17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: storageMemoryPool.memoryUsed 0.299123 M 17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: onHeapExecutionMemoryPool.memoryUsed 0.00 17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: storageMemoryPool.poolSize 0.299123 17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: onHeapExecutionMemoryPool.poolSize 5663.325877 {code} According to the configuration, storageMemoryPool.poolSize at least 1G or more, but the log information is only 0.299123 M, so there is an error. was: In spark-1.6 and later versions, there is a problem with its memory management UnifiedMemoryManager. Spark.memory.storageFraction configuration should be at least storage Memory memory. In the memory management UnifiedMemoryManager, the calculation of Execution memory can be up to storage how much memory can borrow,using val memoryReclaimableFromStorage = math.max(storageMemoryPool.memoryFree,storageMemoryPool.poolSize - storageRegionSize). When storageMemoryPool.memoryFree > storageMemoryPool.poolSize - storageRegionSize, the size of the a will be chosen, that is,storage Memory will reduce the storageMemoryPool.memoryFree so much. Because of storageMemoryPool.memoryFree > storageMemoryPool.poolSize - storageRegionSize, so storageMemoryPool.poolSize - storageMemoryPool.memoryFree < storageRegionSize Now storageMemoryPool.poolSize < storageRegionSize,storageRegionSize is the smallest proportion of frame definition,so there is a problem. To solve this problem, we define the function as val memoryReclaimableFromStorage = storageMemoryPool.poolSize - storageRegionSize. Experimental proof: I added some log information to the UnifiedMemoryManager file as follows: logInfo("storageMemoryPool.memoryFree %f".format(storageMemoryPool.memoryFree/1024.0/1024.0)) logInfo("onHeapExecutionMemoryPool.memoryFree
[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist
[ https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965556#comment-15965556 ] Miguel Pérez commented on SPARK-20286: -- I'm zero familiar with the code, but I'll try to send a merge request when I have time to look it up. I'll be grateful for any kind of help. Thank you. > dynamicAllocation.executorIdleTimeout is ignored after unpersist > > > Key: SPARK-20286 > URL: https://issues.apache.org/jira/browse/SPARK-20286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Miguel Pérez > > With dynamic allocation enabled, it seems that executors with cached data > which are unpersisted are still being killed using the > {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of > {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration > ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor > with unpersisted data won't be released until the job ends. > *How to reproduce* > - Set different values for {{dynamicAllocation.executorIdleTimeout}} and > {{dynamicAllocation.cachedExecutorIdleTimeout}} > - Load a file into a RDD and persist it > - Execute an action on the RDD (like a count) so some executors are activated. > - When the action has finished, unpersist the RDD > - The application UI removes correctly the persisted data from the *Storage* > tab, but if you look in the *Executors* tab, you will find that the executors > remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is > reached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20302) Short circuit cast when from and to types are structurally the same
[ https://issues.apache.org/jira/browse/SPARK-20302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20302. - Resolution: Fixed Fix Version/s: 2.2.0 > Short circuit cast when from and to types are structurally the same > --- > > Key: SPARK-20302 > URL: https://issues.apache.org/jira/browse/SPARK-20302 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.2.0 > > > When we perform a cast expression and the from and to types are structurally > the same (having the same structure but different field names), we should be > able to skip the actual cast. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20305) Master may keep in the state of "COMPELETING_RECOVERY",then all the application registered cannot get resources, when the leader master change.
[ https://issues.apache.org/jira/browse/SPARK-20305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LvDongrong updated SPARK-20305: --- Attachment: failfetchresources.PNG > Master may keep in the state of "COMPELETING_RECOVERY",then all the > application registered cannot get resources, when the leader master change. > --- > > Key: SPARK-20305 > URL: https://issues.apache.org/jira/browse/SPARK-20305 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: LvDongrong >Priority: Critical > Attachments: failfetchresources.PNG > > > Master may keep in the state of "COMPELETING_RECOVERY",then all the > application registered cannot get resources, when the leader master change. > This happend when a exception was thrown during the Master trying to > recovery(completeRecovery method in the master.scala ). Then the leader will > always in COMPLETING_RECOVERY state, for the leader can only change to alive > from state of RecoveryState.RECOVERING. > !failfetchresources.PNG! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20305) Master may keep in the state of "COMPELETING_RECOVERY",then all the application registered cannot get resources, when the leader master change.
[ https://issues.apache.org/jira/browse/SPARK-20305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LvDongrong updated SPARK-20305: --- Attachment: (was: failfetchresources.PNG) > Master may keep in the state of "COMPELETING_RECOVERY",then all the > application registered cannot get resources, when the leader master change. > --- > > Key: SPARK-20305 > URL: https://issues.apache.org/jira/browse/SPARK-20305 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: LvDongrong >Priority: Critical > > Master may keep in the state of "COMPELETING_RECOVERY",then all the > application registered cannot get resources, when the leader master change. > This happend when a exception was thrown during the Master trying to > recovery(completeRecovery method in the master.scala ). Then the leader will > always in COMPLETING_RECOVERY state, for the leader can only change to alive > from state of RecoveryState.RECOVERING. > !failfetchresources.PNG! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20298) Spelling mistake: charactor
[ https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20298. --- Resolution: Fixed Assignee: Brendan Dwyer Fix Version/s: 2.2.0 Resolved by https://github.com/apache/spark/pull/17611 > Spelling mistake: charactor > --- > > Key: SPARK-20298 > URL: https://issues.apache.org/jira/browse/SPARK-20298 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Brendan Dwyer >Assignee: Brendan Dwyer >Priority: Trivial > Fix For: 2.2.0 > > > "charactor" should be "character" > {code} > R/pkg/R/DataFrame.R:2821: stop("path should be charactor, NULL > or omitted.") > R/pkg/R/DataFrame.R:2828: stop("mode should be charactor or > omitted. It is 'error' by default.") > R/pkg/R/DataFrame.R:3043: stop("value should be an integer, > numeric, charactor or named list.") > R/pkg/R/DataFrame.R:3055: stop("Each item in value should be > an integer, numeric or charactor.") > R/pkg/R/DataFrame.R:3601: stop("outputMode should be charactor > or omitted.") > R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or > omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2929: "path should be > charactor, NULL or omitted.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2931: "mode should be > charactor or omitted. It is 'error' by default.") > R/pkg/inst/tests/testthat/test_sparkSQL.R:2950: "path should be > charactor, NULL or omitted.") > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20305) Master may keep in the state of "COMPELETING_RECOVERY",then all the application registered cannot get resources, when the leader master change.
[ https://issues.apache.org/jira/browse/SPARK-20305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LvDongrong updated SPARK-20305: --- Attachment: failfetchresources.PNG > Master may keep in the state of "COMPELETING_RECOVERY",then all the > application registered cannot get resources, when the leader master change. > --- > > Key: SPARK-20305 > URL: https://issues.apache.org/jira/browse/SPARK-20305 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: LvDongrong >Priority: Critical > Attachments: failfetchresources.PNG > > > Master may keep in the state of "COMPELETING_RECOVERY",then all the > application registered cannot get resources, when the leader master change. > This happend when a exception was thrown during the Master trying to > recovery(completeRecovery method in the master.scala ). Then the leader will > always in COMPLETING_RECOVERY state, for the leader can only change to alive > from state of RecoveryState.RECOVERING. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20305) Master may keep in the state of "COMPELETING_RECOVERY",then all the application registered cannot get resources, when the leader master change.
[ https://issues.apache.org/jira/browse/SPARK-20305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LvDongrong updated SPARK-20305: --- Description: Master may keep in the state of "COMPELETING_RECOVERY",then all the application registered cannot get resources, when the leader master change. This happend when a exception was thrown during the Master trying to recovery(completeRecovery method in the master.scala ). Then the leader will always in COMPLETING_RECOVERY state, for the leader can only change to alive from state of RecoveryState.RECOVERING. !failfetchresources.PNG! was: Master may keep in the state of "COMPELETING_RECOVERY",then all the application registered cannot get resources, when the leader master change. This happend when a exception was thrown during the Master trying to recovery(completeRecovery method in the master.scala ). Then the leader will always in COMPLETING_RECOVERY state, for the leader can only change to alive from state of RecoveryState.RECOVERING. > Master may keep in the state of "COMPELETING_RECOVERY",then all the > application registered cannot get resources, when the leader master change. > --- > > Key: SPARK-20305 > URL: https://issues.apache.org/jira/browse/SPARK-20305 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: LvDongrong >Priority: Critical > Attachments: failfetchresources.PNG > > > Master may keep in the state of "COMPELETING_RECOVERY",then all the > application registered cannot get resources, when the leader master change. > This happend when a exception was thrown during the Master trying to > recovery(completeRecovery method in the master.scala ). Then the leader will > always in COMPLETING_RECOVERY state, for the leader can only change to alive > from state of RecoveryState.RECOVERING. > !failfetchresources.PNG! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org