[jira] [Resolved] (SPARK-16888) Implements eval method for expression AssertNotNull
[ https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16888. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14486 [https://github.com/apache/spark/pull/14486] > Implements eval method for expression AssertNotNull > --- > > Key: SPARK-16888 > URL: https://issues.apache.org/jira/browse/SPARK-16888 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong >Assignee: Sean Zhong >Priority: Minor > Fix For: 2.1.0 > > > We should support eval() method for expression AssertNotNull > Currently, it reports UnsupportedOperationException when used in projection > of LocalRelation. > {code} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull > scala> import org.apache.spark.sql.Column > scala> case class A(a: Int) > scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, > Nil))).explain > java.lang.UnsupportedOperationException: Only code-generated evaluation is > supported. > at > org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16888) Implements eval method for expression AssertNotNull
[ https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16888: Assignee: Sean Zhong > Implements eval method for expression AssertNotNull > --- > > Key: SPARK-16888 > URL: https://issues.apache.org/jira/browse/SPARK-16888 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong >Assignee: Sean Zhong >Priority: Minor > Fix For: 2.1.0 > > > We should support eval() method for expression AssertNotNull > Currently, it reports UnsupportedOperationException when used in projection > of LocalRelation. > {code} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull > scala> import org.apache.spark.sql.Column > scala> case class A(a: Int) > scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, > Nil))).explain > java.lang.UnsupportedOperationException: Only code-generated evaluation is > supported. > at > org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16876) Add match Column expression for regular expression matching in Scala API
[ https://issues.apache.org/jira/browse/SPARK-16876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407200#comment-15407200 ] Kapil Singh commented on SPARK-16876: - Yes these are similar, but SPARK-16203 is for PySpark while this ticket is for Scala > Add match Column expression for regular expression matching in Scala API > - > > Key: SPARK-16876 > URL: https://issues.apache.org/jira/browse/SPARK-16876 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kapil Singh >Priority: Minor > > RegExpExtract expression gets only the i_th_ regular expression match. A > match expression should be there to get all the matches as an array of > strings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407104#comment-15407104 ] Sean Owen commented on SPARK-16798: --- That's odd to be sure. It's as if the distributePartition function deserializes on the executor with numPartitions = 0 for some reason. Is there any chance you're mixing versions of Spark here? You also say you have made modifications to Spark itself, though elsewhere, and I wonder if somehow there's a subtle connection. This one is hard if there's not something like a reproduction against upstream Spark. > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 > > > Key: SPARK-16798 > URL: https://issues.apache.org/jira/browse/SPARK-16798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Code at https://github.com/metamx/druid-spark-batch which was working under > 1.5.2 has ceased to function under 2.0.0 with the below stacktrace. > {code} > java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13710) Spark shell shows ERROR when launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407096#comment-15407096 ] Sean Owen commented on SPARK-13710: --- Possibly related to https://github.com/apache/spark/commit/4775eb414fa8285cfdc301e52dac52a2ef64c9e1 / SPARK-16770 ? > Spark shell shows ERROR when launching on Windows > - > > Key: SPARK-13710 > URL: https://issues.apache.org/jira/browse/SPARK-13710 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Reporter: Masayoshi TSUZUKI >Assignee: Michel Lemay >Priority: Minor > Fix For: 2.0.0 > > > On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message > and stacktrace. > {noformat} > C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell > [ERROR] Terminal initialization failed; falling back to unsupported > java.lang.NoClassDefFoundError: Could not initialize class > scala.tools.fusesource_embedded.jansi.internal.Kernel32 > at > scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) > at > scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204) > at > scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82) > at > scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101) > at > scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209) > at > scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61) > at > scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862) > at > scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at scala.util.Try$.apply(Try.scala:192) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875) > at > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) > at > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) > at scala.collection.immutable.Stream.collect(Stream.scala:435) > at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911) > at > scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97) > at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911) > at org.apache.spark.repl.Main$.doMain(Main.scala:64) > at org.apache.spark.repl.Main$.main(Main.scala:47) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at >
[jira] [Commented] (SPARK-16888) Implements eval method for expression AssertNotNull
[ https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407077#comment-15407077 ] Apache Spark commented on SPARK-16888: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/14486 > Implements eval method for expression AssertNotNull > --- > > Key: SPARK-16888 > URL: https://issues.apache.org/jira/browse/SPARK-16888 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong >Priority: Minor > > We should support eval() method for expression AssertNotNull > Currently, it reports UnsupportedOperationException when used in projection > of LocalRelation. > {code} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull > scala> import org.apache.spark.sql.Column > scala> case class A(a: Int) > scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, > Nil))).explain > java.lang.UnsupportedOperationException: Only code-generated evaluation is > supported. > at > org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16888) Implements eval method for expression AssertNotNull
[ https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16888: Assignee: (was: Apache Spark) > Implements eval method for expression AssertNotNull > --- > > Key: SPARK-16888 > URL: https://issues.apache.org/jira/browse/SPARK-16888 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong >Priority: Minor > > We should support eval() method for expression AssertNotNull > Currently, it reports UnsupportedOperationException when used in projection > of LocalRelation. > {code} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull > scala> import org.apache.spark.sql.Column > scala> case class A(a: Int) > scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, > Nil))).explain > java.lang.UnsupportedOperationException: Only code-generated evaluation is > supported. > at > org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16888) Implements eval method for expression AssertNotNull
[ https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16888: Assignee: Apache Spark > Implements eval method for expression AssertNotNull > --- > > Key: SPARK-16888 > URL: https://issues.apache.org/jira/browse/SPARK-16888 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong >Assignee: Apache Spark >Priority: Minor > > We should support eval() method for expression AssertNotNull > Currently, it reports UnsupportedOperationException when used in projection > of LocalRelation. > {code} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull > scala> import org.apache.spark.sql.Column > scala> case class A(a: Int) > scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, > Nil))).explain > java.lang.UnsupportedOperationException: Only code-generated evaluation is > supported. > at > org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16885) Spark shell failed to run in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-16885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407073#comment-15407073 ] Sean Owen commented on SPARK-16885: --- I'm pretty certain it's because you specified a directory, not a file? try a file to confirm. > Spark shell failed to run in yarn-client mode > - > > Key: SPARK-16885 > URL: https://issues.apache.org/jira/browse/SPARK-16885 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 > Environment: Ubuntu 12.04 > Hadoop 2.7.2 + Yarn >Reporter: Yury Zhyshko > Attachments: spark-env.sh > > > I've installed Hadoop + Yarn in pseudo distributed mode following these > instructions: > https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node > After that I downloaded and installed a prebuild Spark for Hadoop 2.7 > The command that I used to run a shell: > ./bin/spark-shell --master yarn --deploy-mode client --conf > spark.yarn.archive=/home/yzhishko/work/spark/jars > Here is the error: > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/08/03 17:13:50 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/08/03 17:13:52 ERROR spark.SparkContext: Error initializing SparkContext. > java.lang.IllegalArgumentException: Can not create a Path from an empty string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) > at org.apache.hadoop.fs.Path.(Path.java:134) > at org.apache.hadoop.fs.Path.(Path.java:93) > at > org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:338) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433) > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:472) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149) > at org.apache.spark.SparkContext.(SparkContext.scala:500) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:101) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) > at > scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807) > at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) > at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) > at >
[jira] [Created] (SPARK-16888) Implements eval method for expression AssertNotNull
Sean Zhong created SPARK-16888: -- Summary: Implements eval method for expression AssertNotNull Key: SPARK-16888 URL: https://issues.apache.org/jira/browse/SPARK-16888 Project: Spark Issue Type: Improvement Reporter: Sean Zhong Priority: Minor We should support eval() method for expression AssertNotNull Currently, it reports UnsupportedOperationException when used in projection of LocalRelation. {code} scala> import org.apache.spark.sql.catalyst.dsl.expressions._ scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull scala> import org.apache.spark.sql.Column scala> case class A(a: Int) scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain java.lang.UnsupportedOperationException: Only code-generated evaluation is supported. at org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16887) Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH
[ https://issues.apache.org/jira/browse/SPARK-16887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407053#comment-15407053 ] Apache Spark commented on SPARK-16887: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/14492 > Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH > > > Key: SPARK-16887 > URL: https://issues.apache.org/jira/browse/SPARK-16887 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Yin Huai >Assignee: Yin Huai > > To deploy Spark, it can be pretty convenient to put all jars (spark jars, > hadoop jars, and other libs' jars) that we want to include in the classpath > of Spark in the same dir, which may not be spark's assembly dir. So, I am > proposing to also add SPARK_DIST_CLASSPATH to the LAUNCH_CLASSPATH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options
[ https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407043#comment-15407043 ] Hyukjin Kwon commented on SPARK-16610: -- Actually, my initial proposal of the above was including that but I followed the suggestion at the end. I hope I didn't misunderstood both you and [~rxin]. Do you mind if I ask feedback please? > When writing ORC files, orc.compress should not be overridden if users do not > set "compression" in the options > -- > > Key: SPARK-16610 > URL: https://issues.apache.org/jira/browse/SPARK-16610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai > > For ORC source, Spark SQL has a writer option {{compression}}, which is used > to set the codec and its value will be also set to orc.compress (the orc conf > used for codec). However, if a user only set {{orc.compress}} in the writer > option, we should not use the default value of "compression" (snappy) as the > codec. Instead, we should respect the value of {{orc.compress}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16887) Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH
[ https://issues.apache.org/jira/browse/SPARK-16887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-16887: Assignee: Yin Huai > Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH > > > Key: SPARK-16887 > URL: https://issues.apache.org/jira/browse/SPARK-16887 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Yin Huai >Assignee: Yin Huai > > To deploy Spark, it can be pretty convenient to put all jars (spark jars, > hadoop jars, and other libs' jars) that we want to include in the classpath > of Spark in the same dir, which may not be spark's assembly dir. So, I am > proposing to also add SPARK_DIST_CLASSPATH to the LAUNCH_CLASSPATH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16887) Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH
Yin Huai created SPARK-16887: Summary: Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH Key: SPARK-16887 URL: https://issues.apache.org/jira/browse/SPARK-16887 Project: Spark Issue Type: Bug Components: Spark Submit Reporter: Yin Huai To deploy Spark, it can be pretty convenient to put all jars (spark jars, hadoop jars, and other libs' jars) that we want to include in the classpath of Spark in the same dir, which may not be spark's assembly dir. So, I am proposing to also add SPARK_DIST_CLASSPATH to the LAUNCH_CLASSPATH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset
[ https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407031#comment-15407031 ] Ganesh Chand commented on SPARK-16886: -- Submitted a pull request - https://github.com/apache/spark/pull/14491 > StructuredNetworkWordCount code comment incorrectly refers to DataFrame > instead of Dataset > -- > > Key: SPARK-16886 > URL: https://issues.apache.org/jira/browse/SPARK-16886 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.0.0 >Reporter: Ganesh Chand >Priority: Trivial > Labels: comment, examples > Original Estimate: 1h > Remaining Estimate: 1h > > Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has > the following code comment: > {code} > // Create DataFrame representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > The above comment should be > {code} > // Create Dataset representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > because .as[String] returns a Dataset. > Affects: > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala > > * > examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java > > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala > * > examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset
[ https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16886: Assignee: (was: Apache Spark) > StructuredNetworkWordCount code comment incorrectly refers to DataFrame > instead of Dataset > -- > > Key: SPARK-16886 > URL: https://issues.apache.org/jira/browse/SPARK-16886 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.0.0 >Reporter: Ganesh Chand >Priority: Trivial > Labels: comment, examples > Original Estimate: 1h > Remaining Estimate: 1h > > Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has > the following code comment: > {code} > // Create DataFrame representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > The above comment should be > {code} > // Create Dataset representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > because .as[String] returns a Dataset. > Affects: > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala > > * > examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java > > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala > * > examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset
[ https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407032#comment-15407032 ] Apache Spark commented on SPARK-16886: -- User 'ganeshchand' has created a pull request for this issue: https://github.com/apache/spark/pull/14491 > StructuredNetworkWordCount code comment incorrectly refers to DataFrame > instead of Dataset > -- > > Key: SPARK-16886 > URL: https://issues.apache.org/jira/browse/SPARK-16886 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.0.0 >Reporter: Ganesh Chand >Priority: Trivial > Labels: comment, examples > Original Estimate: 1h > Remaining Estimate: 1h > > Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has > the following code comment: > {code} > // Create DataFrame representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > The above comment should be > {code} > // Create Dataset representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > because .as[String] returns a Dataset. > Affects: > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala > > * > examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java > > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala > * > examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset
[ https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16886: Assignee: Apache Spark > StructuredNetworkWordCount code comment incorrectly refers to DataFrame > instead of Dataset > -- > > Key: SPARK-16886 > URL: https://issues.apache.org/jira/browse/SPARK-16886 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.0.0 >Reporter: Ganesh Chand >Assignee: Apache Spark >Priority: Trivial > Labels: comment, examples > Original Estimate: 1h > Remaining Estimate: 1h > > Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has > the following code comment: > {code} > // Create DataFrame representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > The above comment should be > {code} > // Create Dataset representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > because .as[String] returns a Dataset. > Affects: > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala > > * > examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java > > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala > * > examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16877) Add a rule for preventing use Java's Override annotation
[ https://issues.apache.org/jira/browse/SPARK-16877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16877: Assignee: (was: Apache Spark) > Add a rule for preventing use Java's Override annotation > > > Key: SPARK-16877 > URL: https://issues.apache.org/jira/browse/SPARK-16877 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Hyukjin Kwon >Priority: Minor > > This was found in https://github.com/apache/spark/pull/14419 > It seems using {{@Override}} annotation in Scala code instead of {{override}} > modifier just passes Scala style checking. > It might better add a rule for checking this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16877) Add a rule for preventing use Java's Override annotation
[ https://issues.apache.org/jira/browse/SPARK-16877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16877: Assignee: Apache Spark > Add a rule for preventing use Java's Override annotation > > > Key: SPARK-16877 > URL: https://issues.apache.org/jira/browse/SPARK-16877 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > This was found in https://github.com/apache/spark/pull/14419 > It seems using {{@Override}} annotation in Scala code instead of {{override}} > modifier just passes Scala style checking. > It might better add a rule for checking this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16877) Add a rule for preventing use Java's Override annotation
[ https://issues.apache.org/jira/browse/SPARK-16877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407025#comment-15407025 ] Apache Spark commented on SPARK-16877: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14490 > Add a rule for preventing use Java's Override annotation > > > Key: SPARK-16877 > URL: https://issues.apache.org/jira/browse/SPARK-16877 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Hyukjin Kwon >Priority: Minor > > This was found in https://github.com/apache/spark/pull/14419 > It seems using {{@Override}} annotation in Scala code instead of {{override}} > modifier just passes Scala style checking. > It might better add a rule for checking this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16873) force spill NPE
[ https://issues.apache.org/jira/browse/SPARK-16873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16873. - Resolution: Fixed Assignee: sharkd tu Fix Version/s: 2.1.0 2.0.1 1.6.3 > force spill NPE > --- > > Key: SPARK-16873 > URL: https://issues.apache.org/jira/browse/SPARK-16873 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: sharkd tu >Assignee: sharkd tu > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > 16/07/31 20:52:38 INFO executor.Executor: Running task 1090.1 in stage 18.0 > (TID 23793) > 16/07/31 20:52:38 INFO storage.ShuffleBlockFetcherIterator: Getting 5000 > non-empty blocks out of 5000 blocks > 16/07/31 20:52:38 INFO collection.ExternalSorter: Thread 151 spilling > in-memory map of 410.3 MB to disk (1 time so far) > 16/07/31 20:52:38 INFO storage.ShuffleBlockFetcherIterator: Started 30 remote > fetches in 214 ms > 16/07/31 20:52:44 INFO collection.ExternalSorter: spill memory to > file:/data5/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-e9cc29b9-ca1a-460a-ad76- > 32f8ee437e51/0d/temp_shuffle_dc8f4489-2289-4b99-a605-81c873dc9e17, > fileSize:46.0 MB > 16/07/31 20:52:44 WARN memory.ExecutionMemoryPool: Internal error: release > called on 430168384 bytes but task only has > 424958272 bytes of memory from the on-heap execution pool > 16/07/31 20:52:45 INFO collection.ExternalSorter: Thread 152 spilling > in-memory map of 398.7 MB to disk (1 time so far) > 16/07/31 20:52:52 INFO collection.ExternalSorter: Thread 151 spilling > in-memory map of 389.6 MB to disk (2 times so far) > 16/07/31 20:52:54 INFO collection.ExternalSorter: spill memory to > file:/data11/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-56e1ec4d-9560-4019-83f7- > ec6fd3fe78f9/24/temp_shuffle_10a210a4-38df-40e6-821d-00e15da12eaa, > fileSize:45.5 MB > 16/07/31 20:52:54 WARN memory.ExecutionMemoryPool: Internal error: release > called on 413772021 bytes but task only has > 408561909 bytes of memory from the on-heap execution pool > 16/07/31 20:52:55 INFO collection.ExternalSorter: spill memory to > file:/data2/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-f4fe32a2-c930-4b49-8feb- > 722536c290d7/10/temp_shuffle_4ae9dd49-6039-4a44-901a-ee4650351ebc, > fileSize:44.7 MB > 16/07/31 20:52:56 INFO collection.ExternalSorter: Thread 152 spilling > in-memory map of 389.6 MB to disk (2 times so far) > 16/07/31 20:53:30 INFO collection.ExternalSorter: spill memory to > file:/data6/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-95b49f89-4155-435c-826b- > 7a616662d47a/1b/temp_shuffle_ea43939f-04d4-42a9-805f-44e01afc9a13, > fileSize:44.7 MB > 16/07/31 20:53:30 INFO collection.ExternalSorter: Thread 151 spilling > in-memory map of 389.6 MB to disk (3 times so far) > 16/07/31 20:53:49 INFO collection.ExternalSorter: spill memory to > file:/data10/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-4f736364-e192-4f31-bfe9- > 8eaa29ff2114/3f/temp_shuffle_680e07ec-404e-4710-9d14-1665b300e05c, > fileSize:44.7 MB > 16/07/31 20:54:04 INFO collection.ExternalSorter: Task 23585 force spilling > in-memory map to disk and it will release 164.3 > MB memory > 16/07/31 20:54:04 INFO collection.ExternalSorter: spill memory to > file:/data4/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-db5f46c3-d7a4-4f93-8b77- > 565e469696fb/09/temp_shuffle_ec3ece08-4569-4197-893a-4a5dfcbbf9fa, > fileSize:0.0 B > 16/07/31 20:54:04 WARN memory.TaskMemoryManager: leak 164.3 MB memory from > org.apache.spark.util.collection.ExternalSorter@3db4b52d > 16/07/31 20:54:04 ERROR executor.Executor: Managed memory leak detected; size > = 190458101 bytes, TID = 23585 > 16/07/31 20:54:04 ERROR executor.Executor: Exception in task 1013.0 in stage > 18.0 (TID 23585) > java.lang.NullPointerException > at > org.apache.spark.util.collection.ExternalSorter$SpillReader.cleanup(ExternalSorter.scala:625) > at > org.apache.spark.util.collection.ExternalSorter$SpillReader.nextBatchStream(ExternalSorter.scala:540) > at > org.apache.spark.util.collection.ExternalSorter$SpillReader.(ExternalSorter.scala:508) > at > org.apache.spark.util.collection.ExternalSorter$SpillableIterator.spill(ExternalSorter.scala:814) > at > org.apache.spark.util.collection.ExternalSorter.forceSpill(ExternalSorter.scala:254) > at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:109) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:154) > at >
[jira] [Updated] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset
[ https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ganesh Chand updated SPARK-16886: - Description: Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has the following code comment: {code} // Create DataFrame representing the stream of input lines from connection to host:port val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .load().as[String] {code} The above comment should be {code} // Create Dataset representing the stream of input lines from connection to host:port val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .load().as[String] {code} because .as[String] returns a Dataset. Affects: * examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala * examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java * examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala * examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java was: Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has the following code comment: {code} // Create DataFrame representing the stream of input lines from connection to host:port val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .load().as[String] {code} The above comment should be {code} // Create Dataset representing the stream of input lines from connection to host:port val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .load().as[String] {code} because .as[String] returns a Dataset. Files affected are: * examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala > StructuredNetworkWordCount code comment incorrectly refers to DataFrame > instead of Dataset > -- > > Key: SPARK-16886 > URL: https://issues.apache.org/jira/browse/SPARK-16886 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.0.0 >Reporter: Ganesh Chand >Priority: Trivial > Labels: comment, examples > Original Estimate: 1h > Remaining Estimate: 1h > > Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has > the following code comment: > {code} > // Create DataFrame representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > The above comment should be > {code} > // Create Dataset representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > because .as[String] returns a Dataset. > Affects: > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala > > * > examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java > > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala > * > examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset
[ https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ganesh Chand updated SPARK-16886: - Description: Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has the following code comment: {code} // Create DataFrame representing the stream of input lines from connection to host:port val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .load().as[String] {code} The above comment should be {code} // Create Dataset representing the stream of input lines from connection to host:port val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .load().as[String] {code} because .as[String] returns a Dataset. Files affected are: * examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala was: Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has the following code comment: {code} // Create DataFrame representing the stream of input lines from connection to host:port val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .load().as[String] {code} The above comment should be {code} // Create Dataset representing the stream of input lines from connection to host:port val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .load().as[String] {code} because .as[String] returns a Dataset. > StructuredNetworkWordCount code comment incorrectly refers to DataFrame > instead of Dataset > -- > > Key: SPARK-16886 > URL: https://issues.apache.org/jira/browse/SPARK-16886 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.0.0 >Reporter: Ganesh Chand >Priority: Trivial > Labels: comment, examples > Original Estimate: 1h > Remaining Estimate: 1h > > Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has > the following code comment: > {code} > // Create DataFrame representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > The above comment should be > {code} > // Create Dataset representing the stream of input lines from connection to > host:port > val lines = spark.readStream > .format("socket") >.option("host", host) >.option("port", port) >.load().as[String] > {code} > because .as[String] returns a Dataset. > Files affected are: > * > examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406967#comment-15406967 ] Charles Allen edited comment on SPARK-16798 at 8/4/16 1:41 AM: --- [~srowen] I manually went in with IntelliJ debugging and can confirm that the driver DOES have a valid positive integer value for numPartitions when in the DRIVER. But when running in the mesos executor or in local[4], the task has a value of 0 consistently. I have been able to reproduce this with TPCH data, so I can share it around if you can point me to someone who can help debug what might have changed. The code snippet below is from RDD.scala, with my own comments added {code:title=RDD.scala} def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null) : RDD[T] = withScope { require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.") // Correct on DRIVER if (shuffle) { /** Distributes elements evenly across output partitions, starting from a random partition. */ val distributePartition = (index: Int, items: Iterator[T]) => { var position = (new Random(index)).nextInt(numPartitions) // numPartitions == 0 in TASK items.map { t => // Note that the hash code of the key will just be the key itself. The HashPartitioner // will mod it with the number of total partitions. position = position + 1 (position, t) } } : Iterator[(Int, T)] // include a shuffle step so that our upstream tasks are still distributed new CoalescedRDD( new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition), new HashPartitioner(numPartitions)), numPartitions, partitionCoalescer).values } else { new CoalescedRDD(this, numPartitions, partitionCoalescer) } } {code} was (Author: drcrallen): [~srowen] I manually went in with IntelliJ debugging and can confirm that the driver DOES have a valid positive integer value for numPartitions when in the DRIVER. But when running in the mesos executor or in local[4], the task has a value of 0 consistently. I have been able to reproduce this with TPCH data, so I can share it around if you can point me to someone who can help debug what might have changed. The code snippet below is from RDD.scala, with my own comments addedl {code:title=RDD.scala} def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null) : RDD[T] = withScope { require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.") // Correct on DRIVER if (shuffle) { /** Distributes elements evenly across output partitions, starting from a random partition. */ val distributePartition = (index: Int, items: Iterator[T]) => { var position = (new Random(index)).nextInt(numPartitions) // numPartitions == 0 in TASK items.map { t => // Note that the hash code of the key will just be the key itself. The HashPartitioner // will mod it with the number of total partitions. position = position + 1 (position, t) } } : Iterator[(Int, T)] // include a shuffle step so that our upstream tasks are still distributed new CoalescedRDD( new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition), new HashPartitioner(numPartitions)), numPartitions, partitionCoalescer).values } else { new CoalescedRDD(this, numPartitions, partitionCoalescer) } } {code} > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 > > > Key: SPARK-16798 > URL: https://issues.apache.org/jira/browse/SPARK-16798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Code at https://github.com/metamx/druid-spark-batch which was working under > 1.5.2 has ceased to function under 2.0.0 with the below stacktrace. > {code} > java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at >
[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406967#comment-15406967 ] Charles Allen commented on SPARK-16798: --- [~srowen] I manually went in with IntelliJ debugging and can confirm that the driver DOES have a valid positive integer value for numPartitions when in the DRIVER. But when running in the mesos executor or in local[4], the task has a value of 0 consistently. I have been able to reproduce this with TPCH data, so I can share it around if you can point me to someone who can help debug what might have changed. The code snippet below is from RDD.scala, with my own comments addedl {code:title=RDD.scala} def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null) : RDD[T] = withScope { require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.") // Correct on DRIVER if (shuffle) { /** Distributes elements evenly across output partitions, starting from a random partition. */ val distributePartition = (index: Int, items: Iterator[T]) => { var position = (new Random(index)).nextInt(numPartitions) // numPartitions == 0 in TASK items.map { t => // Note that the hash code of the key will just be the key itself. The HashPartitioner // will mod it with the number of total partitions. position = position + 1 (position, t) } } : Iterator[(Int, T)] // include a shuffle step so that our upstream tasks are still distributed new CoalescedRDD( new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition), new HashPartitioner(numPartitions)), numPartitions, partitionCoalescer).values } else { new CoalescedRDD(this, numPartitions, partitionCoalescer) } } {code} > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 > > > Key: SPARK-16798 > URL: https://issues.apache.org/jira/browse/SPARK-16798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Code at https://github.com/metamx/druid-spark-batch which was working under > 1.5.2 has ceased to function under 2.0.0 with the below stacktrace. > {code} > java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset
Ganesh Chand created SPARK-16886: Summary: StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset Key: SPARK-16886 URL: https://issues.apache.org/jira/browse/SPARK-16886 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 2.0.0 Reporter: Ganesh Chand Priority: Trivial Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has the following code comment: {code} // Create DataFrame representing the stream of input lines from connection to host:port val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .load().as[String] {code} The above comment should be {code} // Create Dataset representing the stream of input lines from connection to host:port val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .load().as[String] {code} because .as[String] returns a Dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16885) Spark shell failed to run in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-16885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yury Zhyshko updated SPARK-16885: - Attachment: spark-env.sh > Spark shell failed to run in yarn-client mode > - > > Key: SPARK-16885 > URL: https://issues.apache.org/jira/browse/SPARK-16885 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 > Environment: Ubuntu 12.04 > Hadoop 2.7.2 + Yarn >Reporter: Yury Zhyshko > Attachments: spark-env.sh > > > I've installed Hadoop + Yarn in pseudo distributed mode following these > instructions: > https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node > After that I downloaded and installed a prebuild Spark for Hadoop 2.7 > The command that I used to run a shell: > ./bin/spark-shell --master yarn --deploy-mode client --conf > spark.yarn.archive=/home/yzhishko/work/spark/jars > Here is the error: > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/08/03 17:13:50 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/08/03 17:13:52 ERROR spark.SparkContext: Error initializing SparkContext. > java.lang.IllegalArgumentException: Can not create a Path from an empty string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) > at org.apache.hadoop.fs.Path.(Path.java:134) > at org.apache.hadoop.fs.Path.(Path.java:93) > at > org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:338) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433) > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:472) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149) > at org.apache.spark.SparkContext.(SparkContext.scala:500) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:101) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) > at > scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807) > at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) > at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) > at > org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) > at
[jira] [Updated] (SPARK-16885) Spark shell failed to run in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-16885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yury Zhyshko updated SPARK-16885: - Description: I've installed Hadoop + Yarn in pseudo distributed mode following these instructions: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node After that I downloaded and installed a prebuild Spark for Hadoop 2.7 The command that I used to run a shell: ./bin/spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=/home/yzhishko/work/spark/jars Here is the error: Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 16/08/03 17:13:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/08/03 17:13:52 ERROR spark.SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) at org.apache.hadoop.fs.Path.(Path.java:134) at org.apache.hadoop.fs.Path.(Path.java:93) at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:338) at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:472) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149) at org.apache.spark.SparkContext.(SparkContext.scala:500) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:101) at $line3.$read$$iw$$iw.(:15) at $line3.$read$$iw.(:31) at $line3.$read.(:33) at $line3.$read$.(:37) at $line3.$read$.() at $line3.$eval$.$print$lzycompute(:7) at $line3.$eval$.$print(:6) at $line3.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807) at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37) at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) at
[jira] [Created] (SPARK-16885) Spark shell failed to run in yarn-client mode
Yury Zhyshko created SPARK-16885: Summary: Spark shell failed to run in yarn-client mode Key: SPARK-16885 URL: https://issues.apache.org/jira/browse/SPARK-16885 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.0.0 Environment: Ubuntu 12.04 Hadoop 2.7.2 + Yarn Reporter: Yury Zhyshko I've installed Hadoop + Yarn in pseudo distributed mode following these instructions: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node After that I downloaded and installed a prebuild Spark 2.7 The command that I used to run a shell: ./bin/spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=/home/yzhishko/work/spark/jars Here is the error: Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 16/08/03 17:13:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/08/03 17:13:52 ERROR spark.SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) at org.apache.hadoop.fs.Path.(Path.java:134) at org.apache.hadoop.fs.Path.(Path.java:93) at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:338) at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:472) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149) at org.apache.spark.SparkContext.(SparkContext.scala:500) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:101) at $line3.$read$$iw$$iw.(:15) at $line3.$read$$iw.(:31) at $line3.$read.(:33) at $line3.$read$.(:37) at $line3.$read$.() at $line3.$eval$.$print$lzycompute(:7) at $line3.$eval$.$print(:6) at $line3.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807) at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37) at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920) at
[jira] [Updated] (SPARK-16814) Fix deprecated use of ParquetWriter in Parquet test suites
[ https://issues.apache.org/jira/browse/SPARK-16814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16814: -- Priority: Minor (was: Major) > Fix deprecated use of ParquetWriter in Parquet test suites > -- > > Key: SPARK-16814 > URL: https://issues.apache.org/jira/browse/SPARK-16814 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk >Assignee: holdenk >Priority: Minor > Fix For: 2.1.0 > > > Replace deprecated ParquetWriter with the new builders -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16826: Assignee: (was: Apache Spark) > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406868#comment-15406868 ] Apache Spark commented on SPARK-16826: -- User 'sylvinus' has created a pull request for this issue: https://github.com/apache/spark/pull/14488 > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16826: Assignee: Apache Spark > java.util.Hashtable limits the throughput of PARSE_URL() > > > Key: SPARK-16826 > URL: https://issues.apache.org/jira/browse/SPARK-16826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Apache Spark > > Hello! > I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of > {{parse_url(url, "host")}} in Spark SQL. > Unfortunately it seems that there is an internal thread-safe cache in there, > and the instances end up being 90% idle. > When I view the thread dump for my executors, most of the executor threads > are "BLOCKED", in that state: > {code} > java.util.Hashtable.get(Hashtable.java:362) > java.net.URL.getURLStreamHandler(URL.java:1135) > java.net.URL.(URL.java:599) > java.net.URL.(URL.java:490) > java.net.URL.(URL.java:439) > org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) > org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) > org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > org.apache.spark.scheduler.Task.run(Task.scala:85) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} > However, when I switch from 1 executor with 36 cores to 9 executors with 4 > cores, throughput is almost 10x higher and the CPUs are back at ~100% use. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16814) Fix deprecated use of ParquetWriter in Parquet test suites
[ https://issues.apache.org/jira/browse/SPARK-16814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16814: -- Assignee: holdenk > Fix deprecated use of ParquetWriter in Parquet test suites > -- > > Key: SPARK-16814 > URL: https://issues.apache.org/jira/browse/SPARK-16814 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk >Assignee: holdenk > Fix For: 2.1.0 > > > Replace deprecated ParquetWriter with the new builders -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16814) Fix deprecated use of ParquetWriter in Parquet test suites
[ https://issues.apache.org/jira/browse/SPARK-16814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16814. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14419 [https://github.com/apache/spark/pull/14419] > Fix deprecated use of ParquetWriter in Parquet test suites > -- > > Key: SPARK-16814 > URL: https://issues.apache.org/jira/browse/SPARK-16814 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > Fix For: 2.1.0 > > > Replace deprecated ParquetWriter with the new builders -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version
[ https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16770: -- Assignee: Stefan Schulze > Spark shell not usable with german keyboard due to JLine version > > > Key: SPARK-16770 > URL: https://issues.apache.org/jira/browse/SPARK-16770 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Stefan Schulze >Assignee: Stefan Schulze >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > It is impossible to enter a right square bracket with a single keystroke > using a german keyboard layout. The problem is known from former Scala > version, responsible is jline-2.12.jar (see > https://issues.scala-lang.org/browse/SI-8759). > Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version
[ https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16770. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14429 [https://github.com/apache/spark/pull/14429] > Spark shell not usable with german keyboard due to JLine version > > > Key: SPARK-16770 > URL: https://issues.apache.org/jira/browse/SPARK-16770 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Stefan Schulze >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > It is impossible to enter a right square bracket with a single keystroke > using a german keyboard layout. The problem is known from former Scala > version, responsible is jline-2.12.jar (see > https://issues.scala-lang.org/browse/SI-8759). > Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
[ https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Ke updated SPARK-16870: - Description: here my workload and what I found I run a large number jobs with spark-sql at the same time. and meet the error that print timeout (some job contains the broadcast-join operator) : 16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute StatementOperation$$execute(SparkExecuteStatementOperation.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation. scala:154) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation. scala:151) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1793) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:16 4) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at
[jira] [Updated] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
[ https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Ke updated SPARK-16870: - Description: here my workload and what I found I run a large number jobs with spark-sql at the same time. and meet the error that print timeout (some job contains the broadcast-join operator) : 16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute StatementOperation$$execute(SparkExecuteStatementOperation.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation. scala:154) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation. scala:151) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1793) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:16 4) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at
[jira] [Updated] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
[ https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Ke updated SPARK-16870: - Description: here my workload and what I found I run a large number jobs with spark-sql at the same time. and meet the error that print timeout (some job contains the broadcast-join operator) : 16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute StatementOperation$$execute(SparkExecuteStatementOperation.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation. scala:154) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation. scala:151) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1793) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:16 4) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at
[jira] [Commented] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
[ https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406834#comment-15406834 ] Liang Ke commented on SPARK-16870: -- thx :) I have update it > add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help > people to how to fix this timeout error when it happenned > - > > Key: SPARK-16870 > URL: https://issues.apache.org/jira/browse/SPARK-16870 > Project: Spark > Issue Type: Improvement >Reporter: Liang Ke >Priority: Trivial > > here my workload and what I found > I run a large number jobs with spark-sql at the same time. and meet the error > that print timeout (some job contains the broadcast-join operator) : > 16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing > query, currentState RUNNING, > java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute > StatementOperation$$execute(SparkExecuteStatementOperation.scala:211) > at >
[jira] [Updated] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
[ https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Ke updated SPARK-16870: - Description: here my workload and what I found I run a large number jobs with spark-sql at the same time. and meet the error that print timeout (some job contains the broadcast-join operator) : 16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute StatementOperation$$execute(SparkExecuteStatementOperation.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation. scala:154) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation. scala:151) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1793) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:16 4) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at
[jira] [Assigned] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file
[ https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16884: Assignee: (was: Apache Spark) > Move DataSourceScanExec out of ExistingRDD.scala file > - > > Key: SPARK-16884 > URL: https://issues.apache.org/jira/browse/SPARK-16884 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file
[ https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406779#comment-15406779 ] Apache Spark commented on SPARK-16884: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/14487 > Move DataSourceScanExec out of ExistingRDD.scala file > - > > Key: SPARK-16884 > URL: https://issues.apache.org/jira/browse/SPARK-16884 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file
[ https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16884: Assignee: Apache Spark > Move DataSourceScanExec out of ExistingRDD.scala file > - > > Key: SPARK-16884 > URL: https://issues.apache.org/jira/browse/SPARK-16884 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file
[ https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-16884: --- Issue Type: Improvement (was: Bug) > Move DataSourceScanExec out of ExistingRDD.scala file > - > > Key: SPARK-16884 > URL: https://issues.apache.org/jira/browse/SPARK-16884 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file
Eric Liang created SPARK-16884: -- Summary: Move DataSourceScanExec out of ExistingRDD.scala file Key: SPARK-16884 URL: https://issues.apache.org/jira/browse/SPARK-16884 Project: Spark Issue Type: Bug Reporter: Eric Liang Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame
Hossein Falaki created SPARK-16883: -- Summary: SQL decimal type is not properly cast to number when collecting SparkDataFrame Key: SPARK-16883 URL: https://issues.apache.org/jira/browse/SPARK-16883 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.0 Reporter: Hossein Falaki To reproduce run following code. As you can see "y" is a list of values. {code} registerTempTable(createDataFrame(iris), "iris") str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y from iris limit 5"))) 'data.frame': 5 obs. of 2 variables: $ x: num 1 1 1 1 1 $ y:List of 5 ..$ : num 2 ..$ : num 2 ..$ : num 2 ..$ : num 2 ..$ : num 2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406754#comment-15406754 ] Sylvain Zimmer commented on SPARK-16700: The verifySchema flag works great, and the {{dict}} issue seems to be fixed for me. Thanks a lot!! > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Davies Liu > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13710) Spark shell shows ERROR when launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406744#comment-15406744 ] Arsen Vladimirskiy edited comment on SPARK-13710 at 8/3/16 10:38 PM: - I noticed that when using the binary package "spark-2.0.0-bin-without-hadoop.tgz" (i.e. with user-provided Hadoop pointed to via "export SPARK_DIST_CLASSPATH=$(hadoop classpath)") the same error still happens. java.lang.NoClassDefFoundError: Could not initialize class scala.tools.fusesource_embedded.jansi.internal.Kernel32 at scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) I compared the jars provided with spark-2.0.0-bin-with-hadoop-2.7 to the ones provided with spark-2.0.0-bin-without-hadoop and noticed that jline-2.12.jar is present in the "with-hadoop" but is missing from the "without-hadoop" binary package. When I copy the jline-2.12.jar to the jars folder of "withou-hadoop", I can start bin\spark-shell without getting this error. Is there a reason jline-2.12.jar is not part of the "without-hadoop" package? was (Author: arsenvlad): I noticed that when using the binary package "spark-2.0.0-bin-without-hadoop.tgz" (i.e. with user-provided Hadoop pointed to via "export SPARK_DIST_CLASSPATH=$(hadoop classpath)") the same error still happens. java.lang.NoClassDefFoundError: Could not initialize class scala.tools.fusesource_embedded.jansi.internal.Kernel32 at scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) I compared the jars provided with spark-2.0.0-bin-with-hadoop-2.7 to the ones provided with spark-2.0.0-bin-without-hadoop and noticed that jline-2.12.jar is present in the "with-hadoop" but is missing from the "without-hadoop" binary package. When I copy the jline-2.12.jar to the jars folder of "withou-hadoop", I can start bin\spark-shell starts without encountering this error. Is there a reason jline-2.12.jar is not part of the "without-hadoop" package? > Spark shell shows ERROR when launching on Windows > - > > Key: SPARK-13710 > URL: https://issues.apache.org/jira/browse/SPARK-13710 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Reporter: Masayoshi TSUZUKI >Assignee: Michel Lemay >Priority: Minor > Fix For: 2.0.0 > > > On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message > and stacktrace. > {noformat} > C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell > [ERROR] Terminal initialization failed; falling back to unsupported > java.lang.NoClassDefFoundError: Could not initialize class > scala.tools.fusesource_embedded.jansi.internal.Kernel32 > at > scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) > at > scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204) > at > scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82) > at > scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101) > at > scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209) > at > scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61) > at > scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862) > at > scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at scala.util.Try$.apply(Try.scala:192) > at >
[jira] [Commented] (SPARK-13710) Spark shell shows ERROR when launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406744#comment-15406744 ] Arsen Vladimirskiy commented on SPARK-13710: I noticed that when using the binary package "spark-2.0.0-bin-without-hadoop.tgz" (i.e. with user-provided Hadoop pointed to via "export SPARK_DIST_CLASSPATH=$(hadoop classpath)") the same error still happens. java.lang.NoClassDefFoundError: Could not initialize class scala.tools.fusesource_embedded.jansi.internal.Kernel32 at scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) I compared the jars provided with spark-2.0.0-bin-*with*-hadoop-2.7 to the ones provided with spark-2.0.0-bin-*without*-hadoop and noticed that jline-2.12.jar is present in the "with-hadoop" but is missing from the "without-hadoop" binary package. When I copy the jline-2.12.jar to the jars folder of "withou-hadoop", I can start bin\spark-shell starts without encountering this error. Is there a reason jline-2.12.jar is not part of the "without-hadoop" package? > Spark shell shows ERROR when launching on Windows > - > > Key: SPARK-13710 > URL: https://issues.apache.org/jira/browse/SPARK-13710 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Reporter: Masayoshi TSUZUKI >Assignee: Michel Lemay >Priority: Minor > Fix For: 2.0.0 > > > On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message > and stacktrace. > {noformat} > C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell > [ERROR] Terminal initialization failed; falling back to unsupported > java.lang.NoClassDefFoundError: Could not initialize class > scala.tools.fusesource_embedded.jansi.internal.Kernel32 > at > scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) > at > scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204) > at > scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82) > at > scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101) > at > scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209) > at > scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61) > at > scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862) > at > scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at scala.util.Try$.apply(Try.scala:192) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875) > at > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) > at > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) > at scala.collection.immutable.Stream.collect(Stream.scala:435) > at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911) > at >
[jira] [Comment Edited] (SPARK-13710) Spark shell shows ERROR when launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406744#comment-15406744 ] Arsen Vladimirskiy edited comment on SPARK-13710 at 8/3/16 10:38 PM: - I noticed that when using the binary package "spark-2.0.0-bin-without-hadoop.tgz" (i.e. with user-provided Hadoop pointed to via "export SPARK_DIST_CLASSPATH=$(hadoop classpath)") the same error still happens. java.lang.NoClassDefFoundError: Could not initialize class scala.tools.fusesource_embedded.jansi.internal.Kernel32 at scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) I compared the jars provided with spark-2.0.0-bin-with-hadoop-2.7 to the ones provided with spark-2.0.0-bin-without-hadoop and noticed that jline-2.12.jar is present in the "with-hadoop" but is missing from the "without-hadoop" binary package. When I copy the jline-2.12.jar to the jars folder of "withou-hadoop", I can start bin\spark-shell starts without encountering this error. Is there a reason jline-2.12.jar is not part of the "without-hadoop" package? was (Author: arsenvlad): I noticed that when using the binary package "spark-2.0.0-bin-without-hadoop.tgz" (i.e. with user-provided Hadoop pointed to via "export SPARK_DIST_CLASSPATH=$(hadoop classpath)") the same error still happens. java.lang.NoClassDefFoundError: Could not initialize class scala.tools.fusesource_embedded.jansi.internal.Kernel32 at scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) I compared the jars provided with spark-2.0.0-bin-*with*-hadoop-2.7 to the ones provided with spark-2.0.0-bin-*without*-hadoop and noticed that jline-2.12.jar is present in the "with-hadoop" but is missing from the "without-hadoop" binary package. When I copy the jline-2.12.jar to the jars folder of "withou-hadoop", I can start bin\spark-shell starts without encountering this error. Is there a reason jline-2.12.jar is not part of the "without-hadoop" package? > Spark shell shows ERROR when launching on Windows > - > > Key: SPARK-13710 > URL: https://issues.apache.org/jira/browse/SPARK-13710 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Reporter: Masayoshi TSUZUKI >Assignee: Michel Lemay >Priority: Minor > Fix For: 2.0.0 > > > On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message > and stacktrace. > {noformat} > C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell > [ERROR] Terminal initialization failed; falling back to unsupported > java.lang.NoClassDefFoundError: Could not initialize class > scala.tools.fusesource_embedded.jansi.internal.Kernel32 > at > scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) > at > scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204) > at > scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82) > at > scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101) > at > scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209) > at > scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61) > at > scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862) > at > scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at scala.util.Try$.apply(Try.scala:192) > at >
[jira] [Resolved] (SPARK-15344) Unable to set default log level for PySpark
[ https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-15344. Resolution: Not A Problem See Felix's reply above. > Unable to set default log level for PySpark > --- > > Key: SPARK-15344 > URL: https://issues.apache.org/jira/browse/SPARK-15344 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Minor > > After this patch: > https://github.com/apache/spark/pull/12648 > I'm unable to set default log level for Pyspark. > It's always WARN. > Below setting doesn't work: > {code} > mbrynski@jupyter:~/spark$ cat conf/log4j.properties > # Set everything to be logged to the console > log4j.rootCategory=INFO, console > log4j.appender.console=org.apache.log4j.ConsoleAppender > log4j.appender.console.target=System.err > log4j.appender.console.layout=org.apache.log4j.PatternLayout > log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p > %c{1}: %m%n > # Set the default spark-shell log level to WARN. When running the > spark-shell, the > # log level for this class is used to overwrite the root logger's log level, > so that > # the user can have different defaults for the shell and regular Spark apps. > log4j.logger.org.apache.spark.repl.Main=INFO > # Settings to quiet third party logs that are too verbose > log4j.logger.org.spark_project.jetty=WARN > log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR > log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO > log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO > log4j.logger.org.apache.parquet=ERROR > log4j.logger.parquet=ERROR > # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent > UDFs in SparkSQL with Hive support > log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL > log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16455) Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling new tasks when cluster is restarting
[ https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406672#comment-15406672 ] YangyangLiu commented on SPARK-16455: - Ok. I get your point. I will wait for few days. If no one leave comments to keep this feature, I will close the PR for this issue. > Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling > new tasks when cluster is restarting > > > Key: SPARK-16455 > URL: https://issues.apache.org/jira/browse/SPARK-16455 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: YangyangLiu >Priority: Minor > > In our case, we are implementing a new mechanism which will let driver > survive when cluster is temporarily down and restarting. So when the service > provided by cluster is not available, scheduler should stop scheduling new > tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to > avoid new task scheduling when it's necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16796) Visible passwords on Spark environment page
[ https://issues.apache.org/jira/browse/SPARK-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406571#comment-15406571 ] Apache Spark commented on SPARK-16796: -- User 'Devian-ua' has created a pull request for this issue: https://github.com/apache/spark/pull/14484 > Visible passwords on Spark environment page > --- > > Key: SPARK-16796 > URL: https://issues.apache.org/jira/browse/SPARK-16796 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Artur >Assignee: Artur >Priority: Trivial > Fix For: 1.6.3, 2.0.1, 2.1.0 > > Attachments: > Mask_spark_ssl_keyPassword_spark_ssl_keyStorePassword_spark_ssl_trustStorePassword_from_We1.patch > > > Spark properties (passwords): > spark.ssl.keyPassword,spark.ssl.keyStorePassword,spark.ssl.trustStorePassword > are visible in Web UI in environment page. > Can we mask them from Web UI? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14204) [SQL] Failure to register URL-derived JDBC driver on executors in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-14204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14204: -- Fix Version/s: 2.1.0 2.0.1 > [SQL] Failure to register URL-derived JDBC driver on executors in cluster mode > -- > > Key: SPARK-14204 > URL: https://issues.apache.org/jira/browse/SPARK-14204 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Kevin McHale >Assignee: Kevin McHale > Labels: JDBC, SQL > Fix For: 1.6.2, 2.0.1, 2.1.0 > > > DataFrameReader JDBC methods throw an IllegalStateException when: > 1. the JDBC driver is contained in a user-provided jar, and > 2. the user does not specify which driver to use, but rather allows spark > to determine the driver from the JDBC URL. > This broke some of our database ETL jobs at @premisedata when we upgraded > from 1.6.0 to 1.6.1. > I have tracked the problem down to a regression introduced in the fix for > SPARK-12579: > https://github.com/apache/spark/commit/7f37c1e45d52b7823d566349e2be21366d73651f#diff-391379a5ec51082e2ae1209db15c02b3R53 > The issue is that DriverRegistry.register is not called on the executors for > a JDBC driver that is derived from the JDBC path. > The problem can be demonstrated within spark-shell, provided you're in > cluster mode and you've deployed a JDBC driver (e.g. postgresql.Driver) via > the --jars argument: > {code} > import > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.createConnectionFactory > val factory = > createConnectionFactory("jdbc:postgresql://whatever.you.want/database?user=user=password", > new java.util.Properties) > sc.parallelize(1 to 100).foreach { _ => factory() } // throws exception > {code} > A sufficient fix is to apply DriverRegistry.register to the `driverClass` > variable, rather than to `userSpecifiedDriverClass`, at the code link > provided above. I will submit a PR for this shortly. > In the meantime, a temporary workaround is to manually specify the JDBC > driver class in the Properties object passed to DataFrameReader.jdbc, or in > the options used in other entry points, which will force the executors to > register the class properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Claussen closed SPARK-16857. - Usage error. > CrossValidator and KMeans throws IllegalArgumentException > - > > Key: SPARK-16857 > URL: https://issues.apache.org/jira/browse/SPARK-16857 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: spark-jobserver docker image. Spark 1.6.1 on ubuntu, > Hadoop 2.4 >Reporter: Ryan Claussen > > I am attempting to use CrossValidation to train KMeans model. When I attempt > to fit the data spark throws an IllegalArgumentException as below since the > KMeans algorithm outputs an Integer into the prediction column instead of a > Double. Before I go too far: is using CrossValidation with Kmeans > supported? > Here's the exception: > {quote} > java.lang.IllegalArgumentException: requirement failed: Column prediction > must be of type DoubleType but was actually IntegerType. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39) > at > spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Here is the code I'm using to set up my cross validator. As the stack trace > above indicates it is failing at the fit step when > {quote} > ... > val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures") > val labelConverter = new > IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) > val pipeline = new Pipeline().setStages(Array(labelIndexer, > featureIndexer, mpc, labelConverter)) > val evaluator = new > MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction") > val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, > 200, 500)).build() > val cv = new > CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3) > val cvModel = cv.fit(trainingData) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16646: - Assignee: Wenchen Fan (was: Hyukjin Kwon) > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan > Fix For: 2.0.1, 2.1.0 > > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16735. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14439 [https://github.com/apache/spark/pull/14439] > Fail to create a map contains decimal type with literals having different > inferred precessions and scales > - > > Key: SPARK-16735 > URL: https://issues.apache.org/jira/browse/SPARK-16735 > Project: Spark > Issue Type: Sub-task >Affects Versions: 2.0.0, 2.0.1 >Reporter: Liang Ke > Fix For: 2.0.1, 2.1.0 > > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > spark-sql> select map(0.1,0.01, 0.2,0.033); > Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS > DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due > to data type mismatch: The given values of function map should all be the > same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16735: - Assignee: Wenchen Fan > Fail to create a map contains decimal type with literals having different > inferred precessions and scales > - > > Key: SPARK-16735 > URL: https://issues.apache.org/jira/browse/SPARK-16735 > Project: Spark > Issue Type: Sub-task >Affects Versions: 2.0.0, 2.0.1 >Reporter: Liang Ke >Assignee: Wenchen Fan > Fix For: 2.0.1, 2.1.0 > > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > spark-sql> select map(0.1,0.01, 0.2,0.033); > Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS > DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due > to data type mismatch: The given values of function map should all be the > same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types
[ https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16646. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14439 [https://github.com/apache/spark/pull/14439] > LEAST doesn't accept numeric arguments with different data types > > > Key: SPARK-16646 > URL: https://issues.apache.org/jira/browse/SPARK-16646 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Hyukjin Kwon > Fix For: 2.0.1, 2.1.0 > > > {code:sql} > SELECT LEAST(1, 1.5); > {code} > {noformat} > Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, > CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should > all have the same type, got LEAST (ArrayBuffer(IntegerType, > DecimalType(2,1))).; line 1 pos 7 (state=,code=0) > {noformat} > This query works for 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406430#comment-15406430 ] Michael Allman edited comment on SPARK-16320 at 8/3/16 7:02 PM: Thank you for the new benchmarks, [~maver1ck]. The article you mention refers specifically to tuning for large executor heaps. In the end they recommend the G1GC. At VideoAmp, we rarely use executor heaps larger than 16 GB. Instead we rely on off-heap memory for large cached RDDs, etc. Since 16 GB is a relatively "small" heap size, I wonder how much impact the difference between G1GC and ParallelGC would make in this kind of scenario. We'll try running some of our jobs/queries with both GCs to see what we get. We'll also try PR 14465. Thanks again for your diligence! was (Author: michael): Thank you for the new benchmarks, [~maver1ck]. The article you mention refers specifically to tuning for large executor heaps. In the end they recommend the G1GC. At VideoAmp, we rarely use executor heaps larger than 16 GB. Instead we rely on off-heap memory for large cached RDDs, etc. Since 16 GB is a relatively "small" heap size, I wonder how much impact the difference between G1GC and ParallelGC would make in this kind of scenario. We'll try running some of our jobs/queries with both GCs to see what we get. Thanks again for your diligence! > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406430#comment-15406430 ] Michael Allman commented on SPARK-16320: Thank you for the new benchmarks, [~maver1ck]. The article you mention refers specifically to tuning for large executor heaps. In the end they recommend the G1GC. At VideoAmp, we rarely use executor heaps larger than 16 GB. Instead we rely on off-heap memory for large cached RDDs, etc. Since 16 GB is a relatively "small" heap size, I wonder how much impact the difference between G1GC and ParallelGC would make in this kind of scenario. We'll try running some of our jobs/queries with both GCs to see what we get. Thanks again for your diligence! > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405576#comment-15405576 ] Maciej Bryński edited comment on SPARK-16320 at 8/3/16 7:01 PM: You're right. I think I found solution for SPARK-16320 in meantime. It was *changing GC from G1GC to ParallelGC.* ||Query||Spark 1.6 G1GC||Spark 1.6 ParallelGC||Spark 2.0 G1GC||Spark 2.0 ParallelGC||Spark 2.0 G1GC with PR14465||Spark 2.0 ParallelGC with PR14465|| |id > some_id|28s|30s|1:31|1:10|30s|29s| |nested_column.id > some_id|1:50|2:03|2:18|1:51|2:15|1:53| |id > some_id + python flatMap|1:57|1:59|5:09|2:53|3:47|1:51| It looks like Spark 2.0 has problems with G1GC. And there is no such a problem with 1.6. _Every measurement was minimum from 4 tries. I was using 15 cores and 30G RAM per executor on servers with 20 (40 with HT) cores and 128GB RAM_ was (Author: maver1ck): You're right. I think I found solution for SPARK-16320 in meantime. It was *changing GC from G1GC to ParallelGC.* ||Query||Spark 1.6 G1GC||Spark 1.6 ParallelGC||Spark 2.0 G1GC||Spark 2.0 ParallelGC||Spark 2.0 G1GC with PR14465||Spark 2.0 ParallelGC with PR14465|| |id > some_id|28s|30s|1:31|1:10|30s|29s| |nested_column.id > some_id|1:50|2:03|2:18|1:51|2:15|1:53| |id > some_id + python flatMap|1:57|1:59|5:09|2:53|3:47|1:51| It looks like Spark 2.0 has problems with G1GC. And there is no such a problem with 1.6. _Every measurement was minimum from 4 tries. I was using 15 cores and 30G RAM per executor on servers with 20 (with HT) cores and 128GB RAM_ > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16882) Failures in JobGenerator Thread are Swallowed, Job Does Not Fail
Brian Schrameck created SPARK-16882: --- Summary: Failures in JobGenerator Thread are Swallowed, Job Does Not Fail Key: SPARK-16882 URL: https://issues.apache.org/jira/browse/SPARK-16882 Project: Spark Issue Type: Bug Components: Scheduler, Streaming Affects Versions: 1.5.0 Environment: CDH 5.6.1, CentOS 6.7 Reporter: Brian Schrameck Using the fileStream functionality and reading a directory with a large number of files over a long period of time, JVM garbage collection limits can be reached. In this case, the JobGenerator thread threw the exception, but it was completely swallowed and did not cause the job to fail. There were no errors in the ApplicationMaster, and the job just silently sat there not processing any further batches. It would be expected that any fatal exception, not necessarily specific to this OutOfMemoryError, be handled appropriately and the job should be killed with the correct failure code. We are running in YARN cluster mode on a CDH 5.6.1 cluster. {noformat}Exception in thread "JobGenerator" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:68) at java.lang.StringBuilder.(StringBuilder.java:89) at org.apache.hadoop.fs.Path.(Path.java:109) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:430) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1494) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1534) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:569) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1494) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1534) at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:195) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:146) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:120) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:247){noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405576#comment-15405576 ] Maciej Bryński edited comment on SPARK-16320 at 8/3/16 6:50 PM: You're right. I think I found solution for SPARK-16320 in meantime. It was *changing GC from G1GC to ParallelGC.* ||Query||Spark 1.6 G1GC||Spark 1.6 ParallelGC||Spark 2.0 G1GC||Spark 2.0 ParallelGC||Spark 2.0 G1GC with PR14465||Spark 2.0 ParallelGC with PR14465|| |id > some_id|28s|30s|1:31|1:10|30s|29s| |nested_column.id > some_id|1:50|2:03|2:18|1:51|2:15|1:53| |id > some_id + python flatMap|1:57|1:59|5:09|2:53|3:47|1:51| It looks like Spark 2.0 has problems with G1GC. And there is no such a problem with 1.6. _Every measurement was minimum from 4 tries. I was using 15 cores and 30G RAM per executor on servers with 20 (with HT) cores and 128GB RAM_ was (Author: maver1ck): You're right. I think I found solution for SPARK-16320 in meantime. It was *changing GC from G1GC to ParallelGC.* ||Query||Spark 1.6 ParallelGC||Spark 2.0 G1GC||Spark 2.0 ParallelGC||Spark 2.0 ParallelGC with PR14465|| |id > some_id|30s| |1:10|29s| |nested_column.id > some_id|2:03|2:18|1:51|1:53| |id > some_id + python flatMap|1:59| |2:53|1:51| > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16455) Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling new tasks when cluster is restarting
[ https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406375#comment-15406375 ] Kay Ousterhout commented on SPARK-16455: Given that this feature is only needed for an internal feature that's not part of the main Spark code, I don't think it makes sense to add additional complexity to Spark to support it. Also, isClusterAvailableForNewOffers is in a class that's private to Spark, so it doesn't make sense to add this Spark-private method that's not used anywhere in Spark. I'd be in favor of closing this as "will not fix", unless others have reasons they think this makes sense to add? > Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling > new tasks when cluster is restarting > > > Key: SPARK-16455 > URL: https://issues.apache.org/jira/browse/SPARK-16455 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: YangyangLiu >Priority: Minor > > In our case, we are implementing a new mechanism which will let driver > survive when cluster is temporarily down and restarting. So when the service > provided by cluster is not available, scheduler should stop scheduling new > tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to > avoid new task scheduling when it's necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406372#comment-15406372 ] Maciej Bryński commented on SPARK-16320: Yes. I also added Spark 1.6 with G1GC. Have in mind that all GC has default configuration. There is nice article by Databricks, but I didn't make such GC tuning as they did. https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406361#comment-15406361 ] Joseph K. Bradley commented on SPARK-16578: --- [~junyangq] has done some work on this. [~wm624] are you also working on it? Pinging both to resolve duplicate work. Thanks! > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406351#comment-15406351 ] Yin Huai commented on SPARK-16714: -- [~petermaxlee] Can you create a new jira for your pr? I just merged [~cloud_fan]'s quick fix. > Fail to create a decimal arrays with literals having different inferred > precessions and scales > -- > > Key: SPARK-16714 > URL: https://issues.apache.org/jira/browse/SPARK-16714 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Wenchen Fan > Fix For: 2.0.1, 2.1.0 > > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > > {code} > select array(0.001, 0.02) > {code} > causes > {code} > org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS > DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input > to function array should all be the same type, but it's [decimal(3,3), > decimal(2,2)]; line 1 pos 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16596) Refactor DataSourceScanExec to do partition discovery at execution instead of planning time
[ https://issues.apache.org/jira/browse/SPARK-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-16596. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14241 [https://github.com/apache/spark/pull/14241] > Refactor DataSourceScanExec to do partition discovery at execution instead of > planning time > --- > > Key: SPARK-16596 > URL: https://issues.apache.org/jira/browse/SPARK-16596 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > Partition discovery is rather expensive, so we should do it at execution time > instead of during physical planning. Right now there is not much benefit > since ListingFileCatalog will read scan for all partitions at planning time > anyways, but this can be optimized in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16753) Spark SQL doesn't handle skewed dataset joins properly
[ https://issues.apache.org/jira/browse/SPARK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406348#comment-15406348 ] Thomas Sebastian commented on SPARK-16753: -- [~jurriaanpruis]try a repartition before your join,so that the partition size tried to join would become even and improve the performance. But pls make sure you have enough cores to process these partitions. > Spark SQL doesn't handle skewed dataset joins properly > -- > > Key: SPARK-16753 > URL: https://issues.apache.org/jira/browse/SPARK-16753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1 >Reporter: Jurriaan Pruis > Attachments: screenshot-1.png > > > I'm having issues with joining a 1 billion row dataframe with skewed data > with multiple dataframes with sizes ranging from 100,000 to 10 million rows. > This means some of the joins (about half of them) can be done using > broadcast, but not all. > Because the data in the large dataframe is skewed we get out of memory errors > in the executors or errors like: > `org.apache.spark.shuffle.FetchFailedException: Too large frame`. > We tried a lot of things, like broadcast joining the skewed rows separately > and unioning them with the dataset containing the sort merge joined data. > Which works perfectly when doing one or two joins, but when doing 10 joins > like this the query planner gets confused (see [SPARK-15326]). > As most of the rows are skewed on the NULL value we use a hack where we put > unique values in those NULL columns so the data is properly distributed over > all partitions. This works fine for NULL values, but since this table is > growing rapidly and we have skewed data for non-NULL values as well this > isn't a full solution to the problem. > Right now this specific spark task runs well 30% of the time and it's getting > worse and worse because of the increasing amount of data. > How to approach these kinds of joins using Spark? It seems weird that I can't > find proper solutions for this problem/other people having the same kind of > issues when Spark profiles itself as a large-scale data processing engine. > Doing joins on big datasets should be a thing Spark should have no problem > with out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16714. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14439 [https://github.com/apache/spark/pull/14439] > Fail to create a decimal arrays with literals having different inferred > precessions and scales > -- > > Key: SPARK-16714 > URL: https://issues.apache.org/jira/browse/SPARK-16714 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai > Fix For: 2.0.1, 2.1.0 > > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > > {code} > select array(0.001, 0.02) > {code} > causes > {code} > org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS > DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input > to function array should all be the same type, but it's [decimal(3,3), > decimal(2,2)]; line 1 pos 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16714: - Assignee: Wenchen Fan > Fail to create a decimal arrays with literals having different inferred > precessions and scales > -- > > Key: SPARK-16714 > URL: https://issues.apache.org/jira/browse/SPARK-16714 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yin Huai >Assignee: Wenchen Fan > Fix For: 2.0.1, 2.1.0 > > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > > {code} > select array(0.001, 0.02) > {code} > causes > {code} > org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS > DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input > to function array should all be the same type, but it's [decimal(3,3), > decimal(2,2)]; line 1 pos 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406328#comment-15406328 ] Thomas Sebastian commented on SPARK-16827: -- I understand what you have mentioned is w.r.t to the 1.6 vs 2.0; Having said that when I faced performance issue, because of skewed partitions, I had to go for a re-partition before the join, which helped improving the performance way better. One thing, I am trying to understand is the real execution stages from webUI and how they can be mapped to the explain plans.Please share some light in this , if it is possible, may be a little out of context here. > Query with Join produces excessive amount of shuffle data > - > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options
[ https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16610: - Target Version/s: 2.0.1, 2.1.0 > When writing ORC files, orc.compress should not be overridden if users do not > set "compression" in the options > -- > > Key: SPARK-16610 > URL: https://issues.apache.org/jira/browse/SPARK-16610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai > > For ORC source, Spark SQL has a writer option {{compression}}, which is used > to set the codec and its value will be also set to orc.compress (the orc conf > used for codec). However, if a user only set {{orc.compress}} in the writer > option, we should not use the default value of "compression" (snappy) as the > codec. Instead, we should respect the value of {{orc.compress}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options
[ https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406329#comment-15406329 ] Yin Huai commented on SPARK-16610: -- It is a ORC setting. It should be accepted by the dataframe option (when you use df.write.option). I do not think it makes sense to drop this conf key provided by ORC and just use Spark's key. > When writing ORC files, orc.compress should not be overridden if users do not > set "compression" in the options > -- > > Key: SPARK-16610 > URL: https://issues.apache.org/jira/browse/SPARK-16610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai > > For ORC source, Spark SQL has a writer option {{compression}}, which is used > to set the codec and its value will be also set to orc.compress (the orc conf > used for codec). However, if a user only set {{orc.compress}} in the writer > option, we should not use the default value of "compression" (snappy) as the > codec. Instead, we should respect the value of {{orc.compress}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16881) Migrate Mesos configs to use ConfigEntry
[ https://issues.apache.org/jira/browse/SPARK-16881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-16881: Description: https://github.com/apache/spark/pull/14414#discussion_r73032190 We'd like to migrate Mesos' use of config vars to the new ConfigEntry class so we can a) define all our configs in one place like YARN does, and b) take use of features like default handling and generics was:https://github.com/apache/spark/pull/14414#discussion_r73032190 > Migrate Mesos configs to use ConfigEntry > > > Key: SPARK-16881 > URL: https://issues.apache.org/jira/browse/SPARK-16881 > Project: Spark > Issue Type: Task > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Michael Gummelt >Priority: Minor > > https://github.com/apache/spark/pull/14414#discussion_r73032190 > We'd like to migrate Mesos' use of config vars to the new ConfigEntry class > so we can a) define all our configs in one place like YARN does, and b) take > use of features like default handling and generics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16857. --- Resolution: Not A Problem The right way to do this is to evaluate based on the intra-cluster distance (or silhouette coefficient or whatever), not as a multi-class evaluation problem. I don't actually know that this is implemented, but that's the way forward. > CrossValidator and KMeans throws IllegalArgumentException > - > > Key: SPARK-16857 > URL: https://issues.apache.org/jira/browse/SPARK-16857 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: spark-jobserver docker image. Spark 1.6.1 on ubuntu, > Hadoop 2.4 >Reporter: Ryan Claussen > > I am attempting to use CrossValidation to train KMeans model. When I attempt > to fit the data spark throws an IllegalArgumentException as below since the > KMeans algorithm outputs an Integer into the prediction column instead of a > Double. Before I go too far: is using CrossValidation with Kmeans > supported? > Here's the exception: > {quote} > java.lang.IllegalArgumentException: requirement failed: Column prediction > must be of type DoubleType but was actually IntegerType. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39) > at > spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Here is the code I'm using to set up my cross validator. As the stack trace > above indicates it is failing at the fit step when > {quote} > ... > val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures") > val labelConverter = new > IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) > val pipeline = new Pipeline().setStages(Array(labelIndexer, > featureIndexer, mpc, labelConverter)) > val evaluator = new > MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction") > val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, > 200, 500)).build() > val cv = new > CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3) > val cvModel = cv.fit(trainingData) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16881) Migrate Mesos configs to use ConfigEntry
[ https://issues.apache.org/jira/browse/SPARK-16881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406312#comment-15406312 ] Sean Owen commented on SPARK-16881: --- (Worth summarizing the change or just at least copying the two relevant comments here.) > Migrate Mesos configs to use ConfigEntry > > > Key: SPARK-16881 > URL: https://issues.apache.org/jira/browse/SPARK-16881 > Project: Spark > Issue Type: Task > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Michael Gummelt >Priority: Minor > > https://github.com/apache/spark/pull/14414#discussion_r73032190 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406311#comment-15406311 ] Ryan Claussen commented on SPARK-16857: --- Basically I want to use the {code}Validator{code} as a way to tune the k-means hyperparameters. I understand that I'm clustering and not classifying but I thought I might be able to emulate the same pipeline in k-means by using a subclass of {code}Evaluator{code}. Does this mean Spark might be missing an evaluator for clustering or is hyperparameter tuning just something that isn't done with k-means? > CrossValidator and KMeans throws IllegalArgumentException > - > > Key: SPARK-16857 > URL: https://issues.apache.org/jira/browse/SPARK-16857 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: spark-jobserver docker image. Spark 1.6.1 on ubuntu, > Hadoop 2.4 >Reporter: Ryan Claussen > > I am attempting to use CrossValidation to train KMeans model. When I attempt > to fit the data spark throws an IllegalArgumentException as below since the > KMeans algorithm outputs an Integer into the prediction column instead of a > Double. Before I go too far: is using CrossValidation with Kmeans > supported? > Here's the exception: > {quote} > java.lang.IllegalArgumentException: requirement failed: Column prediction > must be of type DoubleType but was actually IntegerType. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39) > at > spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Here is the code I'm using to set up my cross validator. As the stack trace > above indicates it is failing at the fit step when > {quote} > ... > val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures") > val labelConverter = new > IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) > val pipeline = new Pipeline().setStages(Array(labelIndexer, > featureIndexer, mpc, labelConverter)) > val evaluator = new > MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction") > val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, > 200, 500)).build() > val cv = new > CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3) > val cvModel = cv.fit(trainingData) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Zhong closed SPARK-16841. -- Resolution: Not A Problem > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306 ] Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM: This jira is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. was (Author: clockfly): This jira is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306 ] Sean Zhong commented on SPARK-16841: This PR is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table
[ https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306 ] Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM: This jira is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. was (Author: clockfly): This PR is created after analyzing the performance impact of https://github.com/apache/spark/pull/12352, which added the row level metrics and caused 15% performance regression. And I can verify the performance regression consistently by comparing performance code before and after https://github.com/apache/spark/pull/12352. But the problem is that I cannot reproduce the same performance regression consistently on Spark trunk code, the performance improvement after the fix on trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The phenomenon I observed is that when running the same benchmark code repeatedly in same spark shell for 100 times, the time it takes for each run doesn't converge, and I cannot get an exact performance number. For example, if we run the below code for 100 times, {code} spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect())) {code} I observed: 1. For the first run, it may take > 9000 ms 2. Then for the next few runs, it is much faster, around 4700ms 3. After that, the performance suddenly becomes worse. It may take around 8500 ms for each run. I guess the phenomenon has something to do with Java JIT and our codegen logic (Because of codegen, we are creating new class type for each run in spark-shell, which may impact code cache). Since I cannot verify this improvement consistently on trunk, I am going to close this jira. > Improves the row level metrics performance when reading Parquet table > - > > Key: SPARK-16841 > URL: https://issues.apache.org/jira/browse/SPARK-16841 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong > > When reading Parquet table, Spark adds row level metrics like recordsRead, > bytesRead > (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93). > The implementation is not very efficient. When parquet vectorized reader is > not used, it may take 20% of read time to update these metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16881) Migrate Mesos configs to use ConfigEntry
Michael Gummelt created SPARK-16881: --- Summary: Migrate Mesos configs to use ConfigEntry Key: SPARK-16881 URL: https://issues.apache.org/jira/browse/SPARK-16881 Project: Spark Issue Type: Task Components: Mesos Affects Versions: 2.0.0 Reporter: Michael Gummelt Priority: Minor https://github.com/apache/spark/pull/14414#discussion_r73032190 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406245#comment-15406245 ] Sean Owen commented on SPARK-6305: -- Mostly to silence noisy libraries manually (i.e. still taking effect even if the user provides customer config) and to set up friendlier logging in standalone examples. You can probably search for usages in an IDE to get flavor of it. Hey, maybe you see usages that are probably pointless or for which there's a better approach. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406207#comment-15406207 ] Mikael Ståldal commented on SPARK-6305: --- Yes, if you need to set logging level, then you need a concrete logging implementation, and not (only) slf4j-api. Why does Spark need to set logging level? > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406191#comment-15406191 ] Sean Owen commented on SPARK-6305: -- How would you set logging levels in slf4j though? I thought it didn't expose that. I don't think log4j 2 lacks functionality, it's just that it was exposed in different ways, and took some work to translate, not entirely directly. That wasn't a big deal. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16880) Improve ANN training, add training data persist if needed
[ https://issues.apache.org/jira/browse/SPARK-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16880: Assignee: Apache Spark > Improve ANN training, add training data persist if needed > - > > Key: SPARK-16880 > URL: https://issues.apache.org/jira/browse/SPARK-16880 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > The ANN layer training does not persist input data RDD, > so that it may cause overhead cost if the RDD need to compute from lineage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16880) Improve ANN training, add training data persist if needed
[ https://issues.apache.org/jira/browse/SPARK-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16880: Assignee: (was: Apache Spark) > Improve ANN training, add training data persist if needed > - > > Key: SPARK-16880 > URL: https://issues.apache.org/jira/browse/SPARK-16880 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > The ANN layer training does not persist input data RDD, > so that it may cause overhead cost if the RDD need to compute from lineage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16880) Improve ANN training, add training data persist if needed
[ https://issues.apache.org/jira/browse/SPARK-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406156#comment-15406156 ] Apache Spark commented on SPARK-16880: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/14483 > Improve ANN training, add training data persist if needed > - > > Key: SPARK-16880 > URL: https://issues.apache.org/jira/browse/SPARK-16880 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > The ANN layer training does not persist input data RDD, > so that it may cause overhead cost if the RDD need to compute from lineage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16880) Improve ANN training, add training data persist if needed
[ https://issues.apache.org/jira/browse/SPARK-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-16880: --- Component/s: MLlib ML > Improve ANN training, add training data persist if needed > - > > Key: SPARK-16880 > URL: https://issues.apache.org/jira/browse/SPARK-16880 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > The ANN layer training does not persist input data RDD, > so that it may cause overhead cost if the RDD need to compute from lineage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16880) Improve ANN training, add training data persist if needed
Weichen Xu created SPARK-16880: -- Summary: Improve ANN training, add training data persist if needed Key: SPARK-16880 URL: https://issues.apache.org/jira/browse/SPARK-16880 Project: Spark Issue Type: Bug Reporter: Weichen Xu The ANN layer training does not persist input data RDD, so that it may cause overhead cost if the RDD need to compute from lineage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406148#comment-15406148 ] Mikael Ståldal commented on SPARK-6305: --- Did you know that Log4j 2.x have it's own bridge from Log4j 1.x (not using slf4j)? See: http://logging.apache.org/log4j/2.x/manual/migration.html > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406142#comment-15406142 ] Mikael Ståldal commented on SPARK-6305: --- Which methods in Log4j 1.x are you missing counterparts in Log4j 2.x for? > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406140#comment-15406140 ] Mikael Ståldal commented on SPARK-6305: --- Would it be possible for Spark to depend on slf4j-api only (and leave it up to any application using it choose the concrete logging implementation)? Or does Spark itself need a concrete logging implementation? Why? > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406125#comment-15406125 ] Michael Allman commented on SPARK-16320: Hi [~maver1ck]. These are excellent findings! I'm curious to see a few more benchmarks. Could you add the two missing timings for Spark 2.0 G1GC, and another column for Spark 2.0 G1GC with the PR? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16879) unify logical plans for CREATE TABLE and CTAS
[ https://issues.apache.org/jira/browse/SPARK-16879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16879: Assignee: Apache Spark (was: Wenchen Fan) > unify logical plans for CREATE TABLE and CTAS > - > > Key: SPARK-16879 > URL: https://issues.apache.org/jira/browse/SPARK-16879 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org