[jira] [Resolved] (SPARK-16888) Implements eval method for expression AssertNotNull

2016-08-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16888.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14486
[https://github.com/apache/spark/pull/14486]

> Implements eval method for expression AssertNotNull
> ---
>
> Key: SPARK-16888
> URL: https://issues.apache.org/jira/browse/SPARK-16888
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
> Fix For: 2.1.0
>
>
> We should support eval() method for expression AssertNotNull
> Currently, it reports UnsupportedOperationException when used in projection 
> of LocalRelation.
> {code}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
> scala> import org.apache.spark.sql.Column
> scala> case class A(a: Int)
> scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, 
> Nil))).explain
> java.lang.UnsupportedOperationException: Only code-generated evaluation is 
> supported.
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16888) Implements eval method for expression AssertNotNull

2016-08-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16888:

Assignee: Sean Zhong

> Implements eval method for expression AssertNotNull
> ---
>
> Key: SPARK-16888
> URL: https://issues.apache.org/jira/browse/SPARK-16888
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
> Fix For: 2.1.0
>
>
> We should support eval() method for expression AssertNotNull
> Currently, it reports UnsupportedOperationException when used in projection 
> of LocalRelation.
> {code}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
> scala> import org.apache.spark.sql.Column
> scala> case class A(a: Int)
> scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, 
> Nil))).explain
> java.lang.UnsupportedOperationException: Only code-generated evaluation is 
> supported.
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16876) Add match Column expression for regular expression matching in Scala API

2016-08-03 Thread Kapil Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407200#comment-15407200
 ] 

Kapil Singh commented on SPARK-16876:
-

Yes these are similar, but SPARK-16203 is for PySpark while this ticket is for 
Scala

> Add match Column expression for regular expression matching in Scala API 
> -
>
> Key: SPARK-16876
> URL: https://issues.apache.org/jira/browse/SPARK-16876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kapil Singh
>Priority: Minor
>
> RegExpExtract expression gets only the i_th_ regular expression match. A 
> match expression should be there to get all the matches as an array of 
> strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407104#comment-15407104
 ] 

Sean Owen commented on SPARK-16798:
---

That's odd to be sure. It's as if the distributePartition function deserializes 
on the executor with numPartitions = 0 for some reason. Is there any chance 
you're mixing versions of Spark here? You also say you have made modifications 
to Spark itself, though elsewhere, and I wonder if somehow there's a subtle 
connection. This one is hard if there's not something like a reproduction 
against upstream Spark.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-08-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407096#comment-15407096
 ] 

Sean Owen commented on SPARK-13710:
---

Possibly related to 
https://github.com/apache/spark/commit/4775eb414fa8285cfdc301e52dac52a2ef64c9e1 
/ SPARK-16770 ?

> Spark shell shows ERROR when launching on Windows
> -
>
> Key: SPARK-13710
> URL: https://issues.apache.org/jira/browse/SPARK-13710
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Reporter: Masayoshi TSUZUKI
>Assignee: Michel Lemay
>Priority: Minor
> Fix For: 2.0.0
>
>
> On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message 
> and stacktrace.
> {noformat}
> C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
> [ERROR] Terminal initialization failed; falling back to unsupported
> java.lang.NoClassDefFoundError: Could not initialize class 
> scala.tools.fusesource_embedded.jansi.internal.Kernel32
> at 
> scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
> at 
> scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
> at 
> scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
> at 
> scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
> at 
> scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
> at 
> scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
> at 
> scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
> at 
> scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
> at scala.collection.immutable.Stream.collect(Stream.scala:435)
> at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
> at org.apache.spark.repl.Main$.doMain(Main.scala:64)
> at org.apache.spark.repl.Main$.main(Main.scala:47)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> 

[jira] [Commented] (SPARK-16888) Implements eval method for expression AssertNotNull

2016-08-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407077#comment-15407077
 ] 

Apache Spark commented on SPARK-16888:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14486

> Implements eval method for expression AssertNotNull
> ---
>
> Key: SPARK-16888
> URL: https://issues.apache.org/jira/browse/SPARK-16888
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>Priority: Minor
>
> We should support eval() method for expression AssertNotNull
> Currently, it reports UnsupportedOperationException when used in projection 
> of LocalRelation.
> {code}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
> scala> import org.apache.spark.sql.Column
> scala> case class A(a: Int)
> scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, 
> Nil))).explain
> java.lang.UnsupportedOperationException: Only code-generated evaluation is 
> supported.
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16888) Implements eval method for expression AssertNotNull

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16888:


Assignee: (was: Apache Spark)

> Implements eval method for expression AssertNotNull
> ---
>
> Key: SPARK-16888
> URL: https://issues.apache.org/jira/browse/SPARK-16888
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>Priority: Minor
>
> We should support eval() method for expression AssertNotNull
> Currently, it reports UnsupportedOperationException when used in projection 
> of LocalRelation.
> {code}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
> scala> import org.apache.spark.sql.Column
> scala> case class A(a: Int)
> scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, 
> Nil))).explain
> java.lang.UnsupportedOperationException: Only code-generated evaluation is 
> supported.
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16888) Implements eval method for expression AssertNotNull

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16888:


Assignee: Apache Spark

> Implements eval method for expression AssertNotNull
> ---
>
> Key: SPARK-16888
> URL: https://issues.apache.org/jira/browse/SPARK-16888
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>Assignee: Apache Spark
>Priority: Minor
>
> We should support eval() method for expression AssertNotNull
> Currently, it reports UnsupportedOperationException when used in projection 
> of LocalRelation.
> {code}
> scala> import org.apache.spark.sql.catalyst.dsl.expressions._
> scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
> scala> import org.apache.spark.sql.Column
> scala> case class A(a: Int)
> scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, 
> Nil))).explain
> java.lang.UnsupportedOperationException: Only code-generated evaluation is 
> supported.
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16885) Spark shell failed to run in yarn-client mode

2016-08-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407073#comment-15407073
 ] 

Sean Owen commented on SPARK-16885:
---

I'm pretty certain it's because you specified a directory, not a file? try a 
file to confirm.

> Spark shell failed to run in yarn-client mode
> -
>
> Key: SPARK-16885
> URL: https://issues.apache.org/jira/browse/SPARK-16885
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: Ubuntu 12.04
> Hadoop 2.7.2 + Yarn
>Reporter: Yury Zhyshko
> Attachments: spark-env.sh
>
>
> I've installed Hadoop + Yarn in pseudo distributed mode following these 
> instructions: 
> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node
> After that I downloaded and installed a prebuild Spark for Hadoop 2.7
> The command that I used to run a shell: 
> ./bin/spark-shell --master yarn --deploy-mode client --conf 
> spark.yarn.archive=/home/yzhishko/work/spark/jars
> Here is the error:
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/08/03 17:13:50 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/08/03 17:13:52 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Can not create a Path from an empty string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
>   at org.apache.hadoop.fs.Path.(Path.java:134)
>   at org.apache.hadoop.fs.Path.(Path.java:93)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:338)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:472)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:500)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:101)
>   at $line3.$read$$iw$$iw.(:15)
>   at $line3.$read$$iw.(:31)
>   at $line3.$read.(:33)
>   at $line3.$read$.(:37)
>   at $line3.$read$.()
>   at $line3.$eval$.$print$lzycompute(:7)
>   at $line3.$eval$.$print(:6)
>   at $line3.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
>   at 
> scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
>   at 
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
>   at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
>   at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
>   at 
> 

[jira] [Created] (SPARK-16888) Implements eval method for expression AssertNotNull

2016-08-03 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-16888:
--

 Summary: Implements eval method for expression AssertNotNull
 Key: SPARK-16888
 URL: https://issues.apache.org/jira/browse/SPARK-16888
 Project: Spark
  Issue Type: Improvement
Reporter: Sean Zhong
Priority: Minor


We should support eval() method for expression AssertNotNull

Currently, it reports UnsupportedOperationException when used in projection of 
LocalRelation.

{code}
scala> import org.apache.spark.sql.catalyst.dsl.expressions._
scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
scala> import org.apache.spark.sql.Column
scala> case class A(a: Int)
scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, 
Nil))).explain

java.lang.UnsupportedOperationException: Only code-generated evaluation is 
supported.
  at 
org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
  ...

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16887) Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH

2016-08-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407053#comment-15407053
 ] 

Apache Spark commented on SPARK-16887:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/14492

> Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH
> 
>
> Key: SPARK-16887
> URL: https://issues.apache.org/jira/browse/SPARK-16887
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> To deploy Spark, it can be pretty convenient to put all jars (spark jars, 
> hadoop jars, and other libs' jars) that we want to include in the classpath 
> of Spark in the same dir, which may not be spark's assembly dir. So, I am 
> proposing to also add SPARK_DIST_CLASSPATH to the LAUNCH_CLASSPATH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options

2016-08-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407043#comment-15407043
 ] 

Hyukjin Kwon commented on SPARK-16610:
--

Actually, my initial proposal of the above was including that but I followed 
the suggestion at the end.

I hope I didn't misunderstood both you and [~rxin].

Do you mind if I ask feedback please? 

 





> When writing ORC files, orc.compress should not be overridden if users do not 
> set "compression" in the options
> --
>
> Key: SPARK-16610
> URL: https://issues.apache.org/jira/browse/SPARK-16610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> For ORC source, Spark SQL has a writer option {{compression}}, which is used 
> to set the codec and its value will be also set to orc.compress (the orc conf 
> used for codec). However, if a user only set {{orc.compress}} in the writer 
> option, we should not use the default value of "compression" (snappy) as the 
> codec. Instead, we should respect the value of {{orc.compress}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16887) Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH

2016-08-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-16887:


Assignee: Yin Huai

> Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH
> 
>
> Key: SPARK-16887
> URL: https://issues.apache.org/jira/browse/SPARK-16887
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> To deploy Spark, it can be pretty convenient to put all jars (spark jars, 
> hadoop jars, and other libs' jars) that we want to include in the classpath 
> of Spark in the same dir, which may not be spark's assembly dir. So, I am 
> proposing to also add SPARK_DIST_CLASSPATH to the LAUNCH_CLASSPATH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16887) Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH

2016-08-03 Thread Yin Huai (JIRA)
Yin Huai created SPARK-16887:


 Summary: Add SPARK_DIST_CLASSPATH to LAUNCH_CLASSPATH
 Key: SPARK-16887
 URL: https://issues.apache.org/jira/browse/SPARK-16887
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Reporter: Yin Huai


To deploy Spark, it can be pretty convenient to put all jars (spark jars, 
hadoop jars, and other libs' jars) that we want to include in the classpath of 
Spark in the same dir, which may not be spark's assembly dir. So, I am 
proposing to also add SPARK_DIST_CLASSPATH to the LAUNCH_CLASSPATH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset

2016-08-03 Thread Ganesh Chand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407031#comment-15407031
 ] 

Ganesh Chand commented on SPARK-16886:
--

Submitted a pull request - https://github.com/apache/spark/pull/14491

> StructuredNetworkWordCount code comment incorrectly refers to DataFrame 
> instead of Dataset
> --
>
> Key: SPARK-16886
> URL: https://issues.apache.org/jira/browse/SPARK-16886
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.0.0
>Reporter: Ganesh Chand
>Priority: Trivial
>  Labels: comment, examples
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
> the following code comment:
> {code}
> // Create DataFrame representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> The above comment should be
> {code}
> // Create Dataset representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> because .as[String] returns a Dataset.
> Affects:
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
>  
> * 
> examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
>  
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala
> * 
> examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16886:


Assignee: (was: Apache Spark)

> StructuredNetworkWordCount code comment incorrectly refers to DataFrame 
> instead of Dataset
> --
>
> Key: SPARK-16886
> URL: https://issues.apache.org/jira/browse/SPARK-16886
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.0.0
>Reporter: Ganesh Chand
>Priority: Trivial
>  Labels: comment, examples
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
> the following code comment:
> {code}
> // Create DataFrame representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> The above comment should be
> {code}
> // Create Dataset representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> because .as[String] returns a Dataset.
> Affects:
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
>  
> * 
> examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
>  
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala
> * 
> examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset

2016-08-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407032#comment-15407032
 ] 

Apache Spark commented on SPARK-16886:
--

User 'ganeshchand' has created a pull request for this issue:
https://github.com/apache/spark/pull/14491

> StructuredNetworkWordCount code comment incorrectly refers to DataFrame 
> instead of Dataset
> --
>
> Key: SPARK-16886
> URL: https://issues.apache.org/jira/browse/SPARK-16886
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.0.0
>Reporter: Ganesh Chand
>Priority: Trivial
>  Labels: comment, examples
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
> the following code comment:
> {code}
> // Create DataFrame representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> The above comment should be
> {code}
> // Create Dataset representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> because .as[String] returns a Dataset.
> Affects:
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
>  
> * 
> examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
>  
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala
> * 
> examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16886:


Assignee: Apache Spark

> StructuredNetworkWordCount code comment incorrectly refers to DataFrame 
> instead of Dataset
> --
>
> Key: SPARK-16886
> URL: https://issues.apache.org/jira/browse/SPARK-16886
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.0.0
>Reporter: Ganesh Chand
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: comment, examples
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
> the following code comment:
> {code}
> // Create DataFrame representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> The above comment should be
> {code}
> // Create Dataset representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> because .as[String] returns a Dataset.
> Affects:
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
>  
> * 
> examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
>  
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala
> * 
> examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16877) Add a rule for preventing use Java's Override annotation

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16877:


Assignee: (was: Apache Spark)

> Add a rule for preventing use Java's Override annotation
> 
>
> Key: SPARK-16877
> URL: https://issues.apache.org/jira/browse/SPARK-16877
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This was found in https://github.com/apache/spark/pull/14419
> It seems using {{@Override}} annotation in Scala code instead of {{override}} 
> modifier just passes Scala style checking.
> It might better add a rule for checking this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16877) Add a rule for preventing use Java's Override annotation

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16877:


Assignee: Apache Spark

> Add a rule for preventing use Java's Override annotation
> 
>
> Key: SPARK-16877
> URL: https://issues.apache.org/jira/browse/SPARK-16877
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> This was found in https://github.com/apache/spark/pull/14419
> It seems using {{@Override}} annotation in Scala code instead of {{override}} 
> modifier just passes Scala style checking.
> It might better add a rule for checking this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16877) Add a rule for preventing use Java's Override annotation

2016-08-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407025#comment-15407025
 ] 

Apache Spark commented on SPARK-16877:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14490

> Add a rule for preventing use Java's Override annotation
> 
>
> Key: SPARK-16877
> URL: https://issues.apache.org/jira/browse/SPARK-16877
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This was found in https://github.com/apache/spark/pull/14419
> It seems using {{@Override}} annotation in Scala code instead of {{override}} 
> modifier just passes Scala style checking.
> It might better add a rule for checking this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16873) force spill NPE

2016-08-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16873.
-
   Resolution: Fixed
 Assignee: sharkd tu
Fix Version/s: 2.1.0
   2.0.1
   1.6.3

> force spill NPE
> ---
>
> Key: SPARK-16873
> URL: https://issues.apache.org/jira/browse/SPARK-16873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: sharkd tu
>Assignee: sharkd tu
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> 16/07/31 20:52:38 INFO executor.Executor: Running task 1090.1 in stage 18.0 
> (TID 23793)
> 16/07/31 20:52:38 INFO storage.ShuffleBlockFetcherIterator: Getting 5000 
> non-empty blocks out of 5000 blocks
> 16/07/31 20:52:38 INFO collection.ExternalSorter: Thread 151 spilling 
> in-memory map of 410.3 MB to disk (1 time so far)
> 16/07/31 20:52:38 INFO storage.ShuffleBlockFetcherIterator: Started 30 remote 
> fetches in 214 ms
> 16/07/31 20:52:44 INFO collection.ExternalSorter: spill memory to 
> file:/data5/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-e9cc29b9-ca1a-460a-ad76-
> 32f8ee437e51/0d/temp_shuffle_dc8f4489-2289-4b99-a605-81c873dc9e17, 
> fileSize:46.0 MB
> 16/07/31 20:52:44 WARN memory.ExecutionMemoryPool: Internal error: release 
> called on 430168384 bytes but task only has 
> 424958272 bytes of memory from the on-heap execution pool
> 16/07/31 20:52:45 INFO collection.ExternalSorter: Thread 152 spilling 
> in-memory map of 398.7 MB to disk (1 time so far)
> 16/07/31 20:52:52 INFO collection.ExternalSorter: Thread 151 spilling 
> in-memory map of 389.6 MB to disk (2 times so far)
> 16/07/31 20:52:54 INFO collection.ExternalSorter: spill memory to 
> file:/data11/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-56e1ec4d-9560-4019-83f7-
> ec6fd3fe78f9/24/temp_shuffle_10a210a4-38df-40e6-821d-00e15da12eaa, 
> fileSize:45.5 MB
> 16/07/31 20:52:54 WARN memory.ExecutionMemoryPool: Internal error: release 
> called on 413772021 bytes but task only has 
> 408561909 bytes of memory from the on-heap execution pool
> 16/07/31 20:52:55 INFO collection.ExternalSorter: spill memory to 
> file:/data2/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-f4fe32a2-c930-4b49-8feb-
> 722536c290d7/10/temp_shuffle_4ae9dd49-6039-4a44-901a-ee4650351ebc, 
> fileSize:44.7 MB
> 16/07/31 20:52:56 INFO collection.ExternalSorter: Thread 152 spilling 
> in-memory map of 389.6 MB to disk (2 times so far)
> 16/07/31 20:53:30 INFO collection.ExternalSorter: spill memory to 
> file:/data6/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-95b49f89-4155-435c-826b-
> 7a616662d47a/1b/temp_shuffle_ea43939f-04d4-42a9-805f-44e01afc9a13, 
> fileSize:44.7 MB
> 16/07/31 20:53:30 INFO collection.ExternalSorter: Thread 151 spilling 
> in-memory map of 389.6 MB to disk (3 times so far)
> 16/07/31 20:53:49 INFO collection.ExternalSorter: spill memory to 
> file:/data10/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-4f736364-e192-4f31-bfe9-
> 8eaa29ff2114/3f/temp_shuffle_680e07ec-404e-4710-9d14-1665b300e05c, 
> fileSize:44.7 MB
> 16/07/31 20:54:04 INFO collection.ExternalSorter: Task 23585 force spilling 
> in-memory map to disk and  it will release 164.3 
> MB memory
> 16/07/31 20:54:04 INFO collection.ExternalSorter: spill memory to 
> file:/data4/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-db5f46c3-d7a4-4f93-8b77-
> 565e469696fb/09/temp_shuffle_ec3ece08-4569-4197-893a-4a5dfcbbf9fa, 
> fileSize:0.0 B
> 16/07/31 20:54:04 WARN memory.TaskMemoryManager: leak 164.3 MB memory from 
> org.apache.spark.util.collection.ExternalSorter@3db4b52d
> 16/07/31 20:54:04 ERROR executor.Executor: Managed memory leak detected; size 
> = 190458101 bytes, TID = 23585
> 16/07/31 20:54:04 ERROR executor.Executor: Exception in task 1013.0 in stage 
> 18.0 (TID 23585)
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.collection.ExternalSorter$SpillReader.cleanup(ExternalSorter.scala:625)
>   at 
> org.apache.spark.util.collection.ExternalSorter$SpillReader.nextBatchStream(ExternalSorter.scala:540)
>   at 
> org.apache.spark.util.collection.ExternalSorter$SpillReader.(ExternalSorter.scala:508)
>   at 
> org.apache.spark.util.collection.ExternalSorter$SpillableIterator.spill(ExternalSorter.scala:814)
>   at 
> org.apache.spark.util.collection.ExternalSorter.forceSpill(ExternalSorter.scala:254)
>   at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:109)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:154)
>   at 
> 

[jira] [Updated] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset

2016-08-03 Thread Ganesh Chand (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesh Chand updated SPARK-16886:
-
Description: 
Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
the following code comment:
{code}
// Create DataFrame representing the stream of input lines from connection to 
host:port
val lines = spark.readStream
  .format("socket")
   .option("host", host)
   .option("port", port)
   .load().as[String]
{code}

The above comment should be
{code}
// Create Dataset representing the stream of input lines from connection to 
host:port
val lines = spark.readStream
  .format("socket")
   .option("host", host)
   .option("port", port)
   .load().as[String]
{code}

because .as[String] returns a Dataset.

Affects:
* 
examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
 
* 
examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
 
* 
examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala
* 
examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java






  was:
Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
the following code comment:
{code}
// Create DataFrame representing the stream of input lines from connection to 
host:port
val lines = spark.readStream
  .format("socket")
   .option("host", host)
   .option("port", port)
   .load().as[String]
{code}

The above comment should be
{code}
// Create Dataset representing the stream of input lines from connection to 
host:port
val lines = spark.readStream
  .format("socket")
   .option("host", host)
   .option("port", port)
   .load().as[String]
{code}

because .as[String] returns a Dataset.

Files affected are:

* 
examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
 




> StructuredNetworkWordCount code comment incorrectly refers to DataFrame 
> instead of Dataset
> --
>
> Key: SPARK-16886
> URL: https://issues.apache.org/jira/browse/SPARK-16886
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.0.0
>Reporter: Ganesh Chand
>Priority: Trivial
>  Labels: comment, examples
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
> the following code comment:
> {code}
> // Create DataFrame representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> The above comment should be
> {code}
> // Create Dataset representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> because .as[String] returns a Dataset.
> Affects:
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
>  
> * 
> examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
>  
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala
> * 
> examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset

2016-08-03 Thread Ganesh Chand (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesh Chand updated SPARK-16886:
-
Description: 
Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
the following code comment:
{code}
// Create DataFrame representing the stream of input lines from connection to 
host:port
val lines = spark.readStream
  .format("socket")
   .option("host", host)
   .option("port", port)
   .load().as[String]
{code}

The above comment should be
{code}
// Create Dataset representing the stream of input lines from connection to 
host:port
val lines = spark.readStream
  .format("socket")
   .option("host", host)
   .option("port", port)
   .load().as[String]
{code}

because .as[String] returns a Dataset.

Files affected are:

* 
examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
 



  was:
Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
the following code comment:
{code}
// Create DataFrame representing the stream of input lines from connection to 
host:port
val lines = spark.readStream
  .format("socket")
   .option("host", host)
   .option("port", port)
   .load().as[String]
{code}

The above comment should be
{code}
// Create Dataset representing the stream of input lines from connection to 
host:port
val lines = spark.readStream
  .format("socket")
   .option("host", host)
   .option("port", port)
   .load().as[String]
{code}

because .as[String] returns a Dataset.



> StructuredNetworkWordCount code comment incorrectly refers to DataFrame 
> instead of Dataset
> --
>
> Key: SPARK-16886
> URL: https://issues.apache.org/jira/browse/SPARK-16886
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.0.0
>Reporter: Ganesh Chand
>Priority: Trivial
>  Labels: comment, examples
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
> the following code comment:
> {code}
> // Create DataFrame representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> The above comment should be
> {code}
> // Create Dataset representing the stream of input lines from connection to 
> host:port
> val lines = spark.readStream
>   .format("socket")
>.option("host", host)
>.option("port", port)
>.load().as[String]
> {code}
> because .as[String] returns a Dataset.
> Files affected are:
> * 
> examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-03 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406967#comment-15406967
 ] 

Charles Allen edited comment on SPARK-16798 at 8/4/16 1:41 AM:
---

[~srowen] I manually went in with IntelliJ debugging and can confirm that the 
driver DOES have a valid positive integer value for numPartitions when in the 
DRIVER. But when running in the mesos executor or in local[4], the task has a 
value of 0 consistently.

I have been able to reproduce this with TPCH data, so I can share it around if 
you can point me to someone who can help debug what might have changed.

The code snippet below is from RDD.scala, with my own comments added

{code:title=RDD.scala}
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
   partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
  (implicit ord: Ordering[T] = null)
  : RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be 
positive.") // Correct on DRIVER
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions) // 
numPartitions == 0 in TASK
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
  new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
  }
{code}


was (Author: drcrallen):
[~srowen] I manually went in with IntelliJ debugging and can confirm that the 
driver DOES have a valid positive integer value for numPartitions when in the 
DRIVER. But when running in the mesos executor or in local[4], the task has a 
value of 0 consistently.

I have been able to reproduce this with TPCH data, so I can share it around if 
you can point me to someone who can help debug what might have changed.

The code snippet below is from RDD.scala, with my own comments addedl

{code:title=RDD.scala}
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
   partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
  (implicit ord: Ordering[T] = null)
  : RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be 
positive.") // Correct on DRIVER
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions) // 
numPartitions == 0 in TASK
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
  new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
  }
{code}

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> 

[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-03 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406967#comment-15406967
 ] 

Charles Allen commented on SPARK-16798:
---

[~srowen] I manually went in with IntelliJ debugging and can confirm that the 
driver DOES have a valid positive integer value for numPartitions when in the 
DRIVER. But when running in the mesos executor or in local[4], the task has a 
value of 0 consistently.

I have been able to reproduce this with TPCH data, so I can share it around if 
you can point me to someone who can help debug what might have changed.

The code snippet below is from RDD.scala, with my own comments addedl

{code:title=RDD.scala}
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
   partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
  (implicit ord: Ordering[T] = null)
  : RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be 
positive.") // Correct on DRIVER
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions) // 
numPartitions == 0 in TASK
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
  new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
  }
{code}

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16886) StructuredNetworkWordCount code comment incorrectly refers to DataFrame instead of Dataset

2016-08-03 Thread Ganesh Chand (JIRA)
Ganesh Chand created SPARK-16886:


 Summary: StructuredNetworkWordCount code comment incorrectly 
refers to DataFrame instead of Dataset
 Key: SPARK-16886
 URL: https://issues.apache.org/jira/browse/SPARK-16886
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 2.0.0
Reporter: Ganesh Chand
Priority: Trivial


Both StructuredNetworkWordCount.scala and StructuredNetworkWordCount.java has 
the following code comment:
{code}
// Create DataFrame representing the stream of input lines from connection to 
host:port
val lines = spark.readStream
  .format("socket")
   .option("host", host)
   .option("port", port)
   .load().as[String]
{code}

The above comment should be
{code}
// Create Dataset representing the stream of input lines from connection to 
host:port
val lines = spark.readStream
  .format("socket")
   .option("host", host)
   .option("port", port)
   .load().as[String]
{code}

because .as[String] returns a Dataset.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16885) Spark shell failed to run in yarn-client mode

2016-08-03 Thread Yury Zhyshko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yury Zhyshko updated SPARK-16885:
-
Attachment: spark-env.sh

> Spark shell failed to run in yarn-client mode
> -
>
> Key: SPARK-16885
> URL: https://issues.apache.org/jira/browse/SPARK-16885
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
> Environment: Ubuntu 12.04
> Hadoop 2.7.2 + Yarn
>Reporter: Yury Zhyshko
> Attachments: spark-env.sh
>
>
> I've installed Hadoop + Yarn in pseudo distributed mode following these 
> instructions: 
> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node
> After that I downloaded and installed a prebuild Spark for Hadoop 2.7
> The command that I used to run a shell: 
> ./bin/spark-shell --master yarn --deploy-mode client --conf 
> spark.yarn.archive=/home/yzhishko/work/spark/jars
> Here is the error:
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/08/03 17:13:50 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/08/03 17:13:52 ERROR spark.SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Can not create a Path from an empty string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
>   at org.apache.hadoop.fs.Path.(Path.java:134)
>   at org.apache.hadoop.fs.Path.(Path.java:93)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:338)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:472)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
>   at org.apache.spark.SparkContext.(SparkContext.scala:500)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:101)
>   at $line3.$read$$iw$$iw.(:15)
>   at $line3.$read$$iw.(:31)
>   at $line3.$read.(:33)
>   at $line3.$read$.(:37)
>   at $line3.$read$.()
>   at $line3.$eval$.$print$lzycompute(:7)
>   at $line3.$eval$.$print(:6)
>   at $line3.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
>   at 
> scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
>   at 
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
>   at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
>   at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
>   at 

[jira] [Updated] (SPARK-16885) Spark shell failed to run in yarn-client mode

2016-08-03 Thread Yury Zhyshko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yury Zhyshko updated SPARK-16885:
-
Description: 
I've installed Hadoop + Yarn in pseudo distributed mode following these 
instructions: 
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node

After that I downloaded and installed a prebuild Spark for Hadoop 2.7

The command that I used to run a shell: 

./bin/spark-shell --master yarn --deploy-mode client --conf 
spark.yarn.archive=/home/yzhishko/work/spark/jars

Here is the error:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/03 17:13:50 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/08/03 17:13:52 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
at org.apache.hadoop.fs.Path.(Path.java:134)
at org.apache.hadoop.fs.Path.(Path.java:93)
at 
org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:338)
at 
org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:472)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.(SparkContext.scala:500)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:101)
at $line3.$read$$iw$$iw.(:15)
at $line3.$read$$iw.(:31)
at $line3.$read.(:33)
at $line3.$read$.(:37)
at $line3.$read$.()
at $line3.$eval$.$print$lzycompute(:7)
at $line3.$eval$.$print(:6)
at $line3.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at 
scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at 
scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at 
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at 
scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at 

[jira] [Created] (SPARK-16885) Spark shell failed to run in yarn-client mode

2016-08-03 Thread Yury Zhyshko (JIRA)
Yury Zhyshko created SPARK-16885:


 Summary: Spark shell failed to run in yarn-client mode
 Key: SPARK-16885
 URL: https://issues.apache.org/jira/browse/SPARK-16885
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.0.0
 Environment: Ubuntu 12.04
Hadoop 2.7.2 + Yarn
Reporter: Yury Zhyshko


I've installed Hadoop + Yarn in pseudo distributed mode following these 
instructions: 
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node

After that I downloaded and installed a prebuild Spark 2.7

The command that I used to run a shell: 

./bin/spark-shell --master yarn --deploy-mode client --conf 
spark.yarn.archive=/home/yzhishko/work/spark/jars

Here is the error:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/03 17:13:50 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/08/03 17:13:52 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
at org.apache.hadoop.fs.Path.(Path.java:134)
at org.apache.hadoop.fs.Path.(Path.java:93)
at 
org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:338)
at 
org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:472)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.(SparkContext.scala:500)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:101)
at $line3.$read$$iw$$iw.(:15)
at $line3.$read$$iw.(:31)
at $line3.$read.(:33)
at $line3.$read$.(:37)
at $line3.$read$.()
at $line3.$eval$.$print$lzycompute(:7)
at $line3.$eval$.$print(:6)
at $line3.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at 
scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at 
scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at 
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at 
scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at 
scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
at 
org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
at 

[jira] [Updated] (SPARK-16814) Fix deprecated use of ParquetWriter in Parquet test suites

2016-08-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16814:
--
Priority: Minor  (was: Major)

> Fix deprecated use of ParquetWriter in Parquet test suites
> --
>
> Key: SPARK-16814
> URL: https://issues.apache.org/jira/browse/SPARK-16814
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.1.0
>
>
> Replace deprecated ParquetWriter with the new builders



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16826:


Assignee: (was: Apache Spark)

> java.util.Hashtable limits the throughput of PARSE_URL()
> 
>
> Key: SPARK-16826
> URL: https://issues.apache.org/jira/browse/SPARK-16826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.(URL.java:599)
> java.net.URL.(URL.java:490)
> java.net.URL.(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-08-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406868#comment-15406868
 ] 

Apache Spark commented on SPARK-16826:
--

User 'sylvinus' has created a pull request for this issue:
https://github.com/apache/spark/pull/14488

> java.util.Hashtable limits the throughput of PARSE_URL()
> 
>
> Key: SPARK-16826
> URL: https://issues.apache.org/jira/browse/SPARK-16826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.(URL.java:599)
> java.net.URL.(URL.java:490)
> java.net.URL.(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16826:


Assignee: Apache Spark

> java.util.Hashtable limits the throughput of PARSE_URL()
> 
>
> Key: SPARK-16826
> URL: https://issues.apache.org/jira/browse/SPARK-16826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Apache Spark
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.(URL.java:599)
> java.net.URL.(URL.java:490)
> java.net.URL.(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16814) Fix deprecated use of ParquetWriter in Parquet test suites

2016-08-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16814:
--
Assignee: holdenk

> Fix deprecated use of ParquetWriter in Parquet test suites
> --
>
> Key: SPARK-16814
> URL: https://issues.apache.org/jira/browse/SPARK-16814
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.1.0
>
>
> Replace deprecated ParquetWriter with the new builders



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16814) Fix deprecated use of ParquetWriter in Parquet test suites

2016-08-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16814.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14419
[https://github.com/apache/spark/pull/14419]

> Fix deprecated use of ParquetWriter in Parquet test suites
> --
>
> Key: SPARK-16814
> URL: https://issues.apache.org/jira/browse/SPARK-16814
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
> Fix For: 2.1.0
>
>
> Replace deprecated ParquetWriter with the new builders



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version

2016-08-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16770:
--
Assignee: Stefan Schulze

> Spark shell not usable with german keyboard due to JLine version
> 
>
> Key: SPARK-16770
> URL: https://issues.apache.org/jira/browse/SPARK-16770
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Stefan Schulze
>Assignee: Stefan Schulze
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> It is impossible to enter a right square bracket with a single keystroke 
> using a german keyboard layout. The problem is known from former Scala 
> version, responsible is jline-2.12.jar (see 
> https://issues.scala-lang.org/browse/SI-8759).
> Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version

2016-08-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16770.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14429
[https://github.com/apache/spark/pull/14429]

> Spark shell not usable with german keyboard due to JLine version
> 
>
> Key: SPARK-16770
> URL: https://issues.apache.org/jira/browse/SPARK-16770
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Stefan Schulze
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> It is impossible to enter a right square bracket with a single keystroke 
> using a german keyboard layout. The problem is known from former Scala 
> version, responsible is jline-2.12.jar (see 
> https://issues.scala-lang.org/browse/SI-8759).
> Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned

2016-08-03 Thread Liang Ke (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Ke updated SPARK-16870:
-
Description: 
here my workload and what I found 
I run a large number jobs with spark-sql at the same time. and meet the error 
that print timeout (some job contains the broadcast-join operator) : 
16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING,
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute
StatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.
scala:154)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.
scala:151)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1793)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:16
4)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 

[jira] [Updated] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned

2016-08-03 Thread Liang Ke (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Ke updated SPARK-16870:
-
Description: 
here my workload and what I found 
I run a large number jobs with spark-sql at the same time. and meet the error 
that print timeout (some job contains the broadcast-join operator) : 
16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING,
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute
StatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.
scala:154)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.
scala:151)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1793)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:16
4)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 

[jira] [Updated] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned

2016-08-03 Thread Liang Ke (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Ke updated SPARK-16870:
-
Description: 
here my workload and what I found 
I run a large number jobs with spark-sql at the same time. and meet the error 
that print timeout (some job contains the broadcast-join operator) : 
16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING,
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute
StatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.
scala:154)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.
scala:151)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1793)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:16
4)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 

[jira] [Commented] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned

2016-08-03 Thread Liang Ke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406834#comment-15406834
 ] 

Liang Ke commented on SPARK-16870:
--

thx :) I have update it

> add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help 
> people to how to fix this timeout error when it happenned
> -
>
> Key: SPARK-16870
> URL: https://issues.apache.org/jira/browse/SPARK-16870
> Project: Spark
>  Issue Type: Improvement
>Reporter: Liang Ke
>Priority: Trivial
>
> here my workload and what I found 
> I run a large number jobs with spark-sql at the same time. and meet the error 
> that print timeout (some job contains the broadcast-join operator) : 
> 16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING,
> java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute
> StatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> 

[jira] [Updated] (SPARK-16870) add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned

2016-08-03 Thread Liang Ke (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Ke updated SPARK-16870:
-
Description: 
here my workload and what I found 
I run a large number jobs with spark-sql at the same time. and meet the error 
that print timeout (some job contains the broadcast-join operator) : 
16/08/03 15:43:23 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING,
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.Filter.doExecute(basicOperators.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecute
StatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.
scala:154)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.
scala:151)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1793)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:16
4)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 

[jira] [Assigned] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16884:


Assignee: (was: Apache Spark)

> Move DataSourceScanExec out of ExistingRDD.scala file
> -
>
> Key: SPARK-16884
> URL: https://issues.apache.org/jira/browse/SPARK-16884
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file

2016-08-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406779#comment-15406779
 ] 

Apache Spark commented on SPARK-16884:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/14487

> Move DataSourceScanExec out of ExistingRDD.scala file
> -
>
> Key: SPARK-16884
> URL: https://issues.apache.org/jira/browse/SPARK-16884
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16884:


Assignee: Apache Spark

> Move DataSourceScanExec out of ExistingRDD.scala file
> -
>
> Key: SPARK-16884
> URL: https://issues.apache.org/jira/browse/SPARK-16884
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file

2016-08-03 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-16884:
---
Issue Type: Improvement  (was: Bug)

> Move DataSourceScanExec out of ExistingRDD.scala file
> -
>
> Key: SPARK-16884
> URL: https://issues.apache.org/jira/browse/SPARK-16884
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file

2016-08-03 Thread Eric Liang (JIRA)
Eric Liang created SPARK-16884:
--

 Summary: Move DataSourceScanExec out of ExistingRDD.scala file
 Key: SPARK-16884
 URL: https://issues.apache.org/jira/browse/SPARK-16884
 Project: Spark
  Issue Type: Bug
Reporter: Eric Liang
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16883) SQL decimal type is not properly cast to number when collecting SparkDataFrame

2016-08-03 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-16883:
--

 Summary: SQL decimal type is not properly cast to number when 
collecting SparkDataFrame
 Key: SPARK-16883
 URL: https://issues.apache.org/jira/browse/SPARK-16883
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Hossein Falaki


To reproduce run following code. As you can see "y" is a list of values.
{code}
registerTempTable(createDataFrame(iris), "iris")
str(collect(sql("select cast('1' as double) as x, cast('2' as decimal) as y  
from iris limit 5")))

'data.frame':   5 obs. of  2 variables:
 $ x: num  1 1 1 1 1
 $ y:List of 5
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
  ..$ : num 2
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-03 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406754#comment-15406754
 ] 

Sylvain Zimmer commented on SPARK-16700:


The verifySchema flag works great, and the {{dict}} issue seems to be fixed for 
me. Thanks a lot!!

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Davies Liu
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-08-03 Thread Arsen Vladimirskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406744#comment-15406744
 ] 

Arsen Vladimirskiy edited comment on SPARK-13710 at 8/3/16 10:38 PM:
-

I noticed that when using the binary package  
"spark-2.0.0-bin-without-hadoop.tgz" (i.e. with user-provided Hadoop pointed to 
via "export SPARK_DIST_CLASSPATH=$(hadoop classpath)") the same error still 
happens.

java.lang.NoClassDefFoundError: Could not initialize class 
scala.tools.fusesource_embedded.jansi.internal.Kernel32
at 
scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)

I compared the jars provided with spark-2.0.0-bin-with-hadoop-2.7 to the ones 
provided with spark-2.0.0-bin-without-hadoop and noticed that jline-2.12.jar is 
present in the "with-hadoop" but is missing from the "without-hadoop" binary 
package.

When I copy the jline-2.12.jar to the jars folder of "withou-hadoop", I can 
start bin\spark-shell  without getting this error.

Is there a reason jline-2.12.jar is not part of the "without-hadoop" package?


was (Author: arsenvlad):
I noticed that when using the binary package  
"spark-2.0.0-bin-without-hadoop.tgz" (i.e. with user-provided Hadoop pointed to 
via "export SPARK_DIST_CLASSPATH=$(hadoop classpath)") the same error still 
happens.

java.lang.NoClassDefFoundError: Could not initialize class 
scala.tools.fusesource_embedded.jansi.internal.Kernel32
at 
scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)

I compared the jars provided with spark-2.0.0-bin-with-hadoop-2.7 to the ones 
provided with spark-2.0.0-bin-without-hadoop and noticed that jline-2.12.jar is 
present in the "with-hadoop" but is missing from the "without-hadoop" binary 
package.

When I copy the jline-2.12.jar to the jars folder of "withou-hadoop", I can 
start bin\spark-shell starts without encountering this error.

Is there a reason jline-2.12.jar is not part of the "without-hadoop" package?

> Spark shell shows ERROR when launching on Windows
> -
>
> Key: SPARK-13710
> URL: https://issues.apache.org/jira/browse/SPARK-13710
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Reporter: Masayoshi TSUZUKI
>Assignee: Michel Lemay
>Priority: Minor
> Fix For: 2.0.0
>
>
> On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message 
> and stacktrace.
> {noformat}
> C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
> [ERROR] Terminal initialization failed; falling back to unsupported
> java.lang.NoClassDefFoundError: Could not initialize class 
> scala.tools.fusesource_embedded.jansi.internal.Kernel32
> at 
> scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
> at 
> scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
> at 
> scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
> at 
> scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
> at 
> scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
> at 
> scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
> at 
> scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
> at 
> scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> 

[jira] [Commented] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-08-03 Thread Arsen Vladimirskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406744#comment-15406744
 ] 

Arsen Vladimirskiy commented on SPARK-13710:


I noticed that when using the binary package  
"spark-2.0.0-bin-without-hadoop.tgz" (i.e. with user-provided Hadoop pointed to 
via "export SPARK_DIST_CLASSPATH=$(hadoop classpath)") the same error still 
happens.

java.lang.NoClassDefFoundError: Could not initialize class 
scala.tools.fusesource_embedded.jansi.internal.Kernel32
at 
scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)

I compared the jars provided with spark-2.0.0-bin-*with*-hadoop-2.7 to the ones 
provided with spark-2.0.0-bin-*without*-hadoop and noticed that jline-2.12.jar 
is present in the "with-hadoop" but is missing from the "without-hadoop" binary 
package.

When I copy the jline-2.12.jar to the jars folder of "withou-hadoop", I can 
start bin\spark-shell starts without encountering this error.

Is there a reason jline-2.12.jar is not part of the "without-hadoop" package?

> Spark shell shows ERROR when launching on Windows
> -
>
> Key: SPARK-13710
> URL: https://issues.apache.org/jira/browse/SPARK-13710
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Reporter: Masayoshi TSUZUKI
>Assignee: Michel Lemay
>Priority: Minor
> Fix For: 2.0.0
>
>
> On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message 
> and stacktrace.
> {noformat}
> C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
> [ERROR] Terminal initialization failed; falling back to unsupported
> java.lang.NoClassDefFoundError: Could not initialize class 
> scala.tools.fusesource_embedded.jansi.internal.Kernel32
> at 
> scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
> at 
> scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
> at 
> scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
> at 
> scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
> at 
> scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
> at 
> scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
> at 
> scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
> at 
> scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
> at scala.collection.immutable.Stream.collect(Stream.scala:435)
> at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> 

[jira] [Comment Edited] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-08-03 Thread Arsen Vladimirskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406744#comment-15406744
 ] 

Arsen Vladimirskiy edited comment on SPARK-13710 at 8/3/16 10:38 PM:
-

I noticed that when using the binary package  
"spark-2.0.0-bin-without-hadoop.tgz" (i.e. with user-provided Hadoop pointed to 
via "export SPARK_DIST_CLASSPATH=$(hadoop classpath)") the same error still 
happens.

java.lang.NoClassDefFoundError: Could not initialize class 
scala.tools.fusesource_embedded.jansi.internal.Kernel32
at 
scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)

I compared the jars provided with spark-2.0.0-bin-with-hadoop-2.7 to the ones 
provided with spark-2.0.0-bin-without-hadoop and noticed that jline-2.12.jar is 
present in the "with-hadoop" but is missing from the "without-hadoop" binary 
package.

When I copy the jline-2.12.jar to the jars folder of "withou-hadoop", I can 
start bin\spark-shell starts without encountering this error.

Is there a reason jline-2.12.jar is not part of the "without-hadoop" package?


was (Author: arsenvlad):
I noticed that when using the binary package  
"spark-2.0.0-bin-without-hadoop.tgz" (i.e. with user-provided Hadoop pointed to 
via "export SPARK_DIST_CLASSPATH=$(hadoop classpath)") the same error still 
happens.

java.lang.NoClassDefFoundError: Could not initialize class 
scala.tools.fusesource_embedded.jansi.internal.Kernel32
at 
scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)

I compared the jars provided with spark-2.0.0-bin-*with*-hadoop-2.7 to the ones 
provided with spark-2.0.0-bin-*without*-hadoop and noticed that jline-2.12.jar 
is present in the "with-hadoop" but is missing from the "without-hadoop" binary 
package.

When I copy the jline-2.12.jar to the jars folder of "withou-hadoop", I can 
start bin\spark-shell starts without encountering this error.

Is there a reason jline-2.12.jar is not part of the "without-hadoop" package?

> Spark shell shows ERROR when launching on Windows
> -
>
> Key: SPARK-13710
> URL: https://issues.apache.org/jira/browse/SPARK-13710
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Reporter: Masayoshi TSUZUKI
>Assignee: Michel Lemay
>Priority: Minor
> Fix For: 2.0.0
>
>
> On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message 
> and stacktrace.
> {noformat}
> C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
> [ERROR] Terminal initialization failed; falling back to unsupported
> java.lang.NoClassDefFoundError: Could not initialize class 
> scala.tools.fusesource_embedded.jansi.internal.Kernel32
> at 
> scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
> at 
> scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
> at 
> scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
> at 
> scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
> at 
> scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
> at 
> scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
> at 
> scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
> at 
> scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> 

[jira] [Resolved] (SPARK-15344) Unable to set default log level for PySpark

2016-08-03 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-15344.

Resolution: Not A Problem

See Felix's reply above.

> Unable to set default log level for PySpark
> ---
>
> Key: SPARK-15344
> URL: https://issues.apache.org/jira/browse/SPARK-15344
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> After this patch:
> https://github.com/apache/spark/pull/12648
> I'm unable to set default log level for Pyspark.
> It's always WARN.
> Below setting doesn't work: 
> {code}
> mbrynski@jupyter:~/spark$ cat conf/log4j.properties
> # Set everything to be logged to the console
> log4j.rootCategory=INFO, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
> %c{1}: %m%n
> # Set the default spark-shell log level to WARN. When running the 
> spark-shell, the
> # log level for this class is used to overwrite the root logger's log level, 
> so that
> # the user can have different defaults for the shell and regular Spark apps.
> log4j.logger.org.apache.spark.repl.Main=INFO
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark_project.jetty=WARN
> log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
> UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16455) Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling new tasks when cluster is restarting

2016-08-03 Thread YangyangLiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406672#comment-15406672
 ] 

YangyangLiu commented on SPARK-16455:
-

Ok. I get your point. I will wait for few days. If no one leave comments to 
keep this feature, I will close the PR for this issue.

> Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling 
> new tasks when cluster is restarting
> 
>
> Key: SPARK-16455
> URL: https://issues.apache.org/jira/browse/SPARK-16455
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: YangyangLiu
>Priority: Minor
>
> In our case, we are implementing a new mechanism which will let driver 
> survive when cluster is temporarily down and restarting. So when the service 
> provided by cluster is not available, scheduler should stop scheduling new 
> tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to 
> avoid new task scheduling when it's necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16796) Visible passwords on Spark environment page

2016-08-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406571#comment-15406571
 ] 

Apache Spark commented on SPARK-16796:
--

User 'Devian-ua' has created a pull request for this issue:
https://github.com/apache/spark/pull/14484

> Visible passwords on Spark environment page
> ---
>
> Key: SPARK-16796
> URL: https://issues.apache.org/jira/browse/SPARK-16796
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Artur
>Assignee: Artur
>Priority: Trivial
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
> Attachments: 
> Mask_spark_ssl_keyPassword_spark_ssl_keyStorePassword_spark_ssl_trustStorePassword_from_We1.patch
>
>
> Spark properties (passwords): 
> spark.ssl.keyPassword,spark.ssl.keyStorePassword,spark.ssl.trustStorePassword
> are visible in Web UI in environment page.
> Can we mask them from Web UI?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14204) [SQL] Failure to register URL-derived JDBC driver on executors in cluster mode

2016-08-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14204:
--
Fix Version/s: 2.1.0
   2.0.1

> [SQL] Failure to register URL-derived JDBC driver on executors in cluster mode
> --
>
> Key: SPARK-14204
> URL: https://issues.apache.org/jira/browse/SPARK-14204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Kevin McHale
>Assignee: Kevin McHale
>  Labels: JDBC, SQL
> Fix For: 1.6.2, 2.0.1, 2.1.0
>
>
> DataFrameReader JDBC methods throw an IllegalStateException when:
>   1. the JDBC driver is contained in a user-provided jar, and
>   2. the user does not specify which driver to use, but rather allows spark 
> to determine the driver from the JDBC URL.
> This broke some of our database ETL jobs at @premisedata when we upgraded 
> from 1.6.0 to 1.6.1.
> I have tracked the problem down to a regression introduced in the fix for 
> SPARK-12579: 
> https://github.com/apache/spark/commit/7f37c1e45d52b7823d566349e2be21366d73651f#diff-391379a5ec51082e2ae1209db15c02b3R53
> The issue is that DriverRegistry.register is not called on the executors for 
> a JDBC driver that is derived from the JDBC path.
> The problem can be demonstrated within spark-shell, provided you're in 
> cluster mode and you've deployed a JDBC driver (e.g. postgresql.Driver) via 
> the --jars argument:
> {code}
> import 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.createConnectionFactory
> val factory = 
> createConnectionFactory("jdbc:postgresql://whatever.you.want/database?user=user=password",
>  new java.util.Properties)
> sc.parallelize(1 to 100).foreach { _ => factory() } // throws exception
> {code}
> A sufficient fix is to apply DriverRegistry.register to the `driverClass` 
> variable, rather than to `userSpecifiedDriverClass`, at the code link 
> provided above.  I will submit a PR for this shortly.
> In the meantime, a temporary workaround is to manually specify the JDBC 
> driver class in the Properties object passed to DataFrameReader.jdbc, or in 
> the options used in other entry points, which will force the executors to 
> register the class properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-08-03 Thread Ryan Claussen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Claussen closed SPARK-16857.
-

Usage error.

> CrossValidator and KMeans throws IllegalArgumentException
> -
>
> Key: SPARK-16857
> URL: https://issues.apache.org/jira/browse/SPARK-16857
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
> Hadoop 2.4
>Reporter: Ryan Claussen
>
> I am attempting to use CrossValidation to train KMeans model. When I attempt 
> to fit the data spark throws an IllegalArgumentException as below since the 
> KMeans algorithm outputs an Integer into the prediction column instead of a 
> Double.   Before I go too far:  is using CrossValidation with Kmeans 
> supported?
> Here's the exception:
> {quote}
> java.lang.IllegalArgumentException: requirement failed: Column prediction 
> must be of type DoubleType but was actually IntegerType.
>  at scala.Predef$.require(Predef.scala:233)
>  at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>  at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> Here is the code I'm using to set up my cross validator.  As the stack trace 
> above indicates it is failing at the fit step when 
> {quote}
> ...
> val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
> val labelConverter = new 
> IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
> val pipeline = new Pipeline().setStages(Array(labelIndexer, 
> featureIndexer, mpc, labelConverter))
> val evaluator = new 
> MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")
> val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 
> 200, 500)).build()
> val cv = new 
> CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
> val cvModel = cv.fit(trainingData)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-08-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16646:
-
Assignee: Wenchen Fan  (was: Hyukjin Kwon)

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales

2016-08-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16735.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14439
[https://github.com/apache/spark/pull/14439]

> Fail to create a map contains decimal type with literals having different 
> inferred precessions and scales
> -
>
> Key: SPARK-16735
> URL: https://issues.apache.org/jira/browse/SPARK-16735
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liang Ke
> Fix For: 2.0.1, 2.1.0
>
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below.
> spark-sql> select map(0.1,0.01, 0.2,0.033);
> Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS 
> DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due 
> to data type mismatch: The given values of function map should all be the 
> same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales

2016-08-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16735:
-
Assignee: Wenchen Fan

> Fail to create a map contains decimal type with literals having different 
> inferred precessions and scales
> -
>
> Key: SPARK-16735
> URL: https://issues.apache.org/jira/browse/SPARK-16735
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liang Ke
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below.
> spark-sql> select map(0.1,0.01, 0.2,0.033);
> Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS 
> DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due 
> to data type mismatch: The given values of function map should all be the 
> same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16646) LEAST doesn't accept numeric arguments with different data types

2016-08-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16646.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14439
[https://github.com/apache/spark/pull/14439]

> LEAST doesn't accept numeric arguments with different data types
> 
>
> Key: SPARK-16646
> URL: https://issues.apache.org/jira/browse/SPARK-16646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Hyukjin Kwon
> Fix For: 2.0.1, 2.1.0
>
>
> {code:sql}
> SELECT LEAST(1, 1.5);
> {code}
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'least(1, 
> CAST(2.1 AS DECIMAL(2,1)))' due to data type mismatch: The expressions should 
> all have the same type, got LEAST (ArrayBuffer(IntegerType, 
> DecimalType(2,1))).; line 1 pos 7 (state=,code=0)
> {noformat}
> This query works for 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-03 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406430#comment-15406430
 ] 

Michael Allman edited comment on SPARK-16320 at 8/3/16 7:02 PM:


Thank you for the new benchmarks, [~maver1ck].

The article you mention refers specifically to tuning for large executor heaps. 
In the end they recommend the G1GC. At VideoAmp, we rarely use executor heaps 
larger than 16 GB. Instead we rely on off-heap memory for large cached RDDs, 
etc. Since 16 GB is a relatively "small" heap size, I wonder how much impact 
the difference between G1GC and ParallelGC would make in this kind of scenario. 
We'll try running some of our jobs/queries with both GCs to see what we get. 
We'll also try PR 14465.

Thanks again for your diligence!


was (Author: michael):
Thank you for the new benchmarks, [~maver1ck].

The article you mention refers specifically to tuning for large executor heaps. 
In the end they recommend the G1GC. At VideoAmp, we rarely use executor heaps 
larger than 16 GB. Instead we rely on off-heap memory for large cached RDDs, 
etc. Since 16 GB is a relatively "small" heap size, I wonder how much impact 
the difference between G1GC and ParallelGC would make in this kind of scenario. 
We'll try running some of our jobs/queries with both GCs to see what we get.

Thanks again for your diligence!

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-03 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406430#comment-15406430
 ] 

Michael Allman commented on SPARK-16320:


Thank you for the new benchmarks, [~maver1ck].

The article you mention refers specifically to tuning for large executor heaps. 
In the end they recommend the G1GC. At VideoAmp, we rarely use executor heaps 
larger than 16 GB. Instead we rely on off-heap memory for large cached RDDs, 
etc. Since 16 GB is a relatively "small" heap size, I wonder how much impact 
the difference between G1GC and ParallelGC would make in this kind of scenario. 
We'll try running some of our jobs/queries with both GCs to see what we get.

Thanks again for your diligence!

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405576#comment-15405576
 ] 

Maciej Bryński edited comment on SPARK-16320 at 8/3/16 7:01 PM:


You're right.
I think I found solution for SPARK-16320 in meantime.
It was *changing GC from G1GC to ParallelGC.*

||Query||Spark 1.6 G1GC||Spark 1.6 ParallelGC||Spark 2.0 G1GC||Spark 2.0 
ParallelGC||Spark 2.0 G1GC with PR14465||Spark 2.0 ParallelGC with PR14465||
|id > some_id|28s|30s|1:31|1:10|30s|29s|
|nested_column.id > some_id|1:50|2:03|2:18|1:51|2:15|1:53|
|id > some_id + python flatMap|1:57|1:59|5:09|2:53|3:47|1:51|

It looks like Spark 2.0 has problems with G1GC. And there is no such a problem 
with 1.6.


_Every measurement was minimum from 4 tries. I was using 15 cores and 30G RAM 
per executor on servers with 20 (40 with HT) cores and 128GB RAM_


was (Author: maver1ck):
You're right.
I think I found solution for SPARK-16320 in meantime.
It was *changing GC from G1GC to ParallelGC.*

||Query||Spark 1.6 G1GC||Spark 1.6 ParallelGC||Spark 2.0 G1GC||Spark 2.0 
ParallelGC||Spark 2.0 G1GC with PR14465||Spark 2.0 ParallelGC with PR14465||
|id > some_id|28s|30s|1:31|1:10|30s|29s|
|nested_column.id > some_id|1:50|2:03|2:18|1:51|2:15|1:53|
|id > some_id + python flatMap|1:57|1:59|5:09|2:53|3:47|1:51|

It looks like Spark 2.0 has problems with G1GC. And there is no such a problem 
with 1.6.


_Every measurement was minimum from 4 tries. I was using 15 cores and 30G RAM 
per executor on servers with 20 (with HT) cores and 128GB RAM_

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16882) Failures in JobGenerator Thread are Swallowed, Job Does Not Fail

2016-08-03 Thread Brian Schrameck (JIRA)
Brian Schrameck created SPARK-16882:
---

 Summary: Failures in JobGenerator Thread are Swallowed, Job Does 
Not Fail
 Key: SPARK-16882
 URL: https://issues.apache.org/jira/browse/SPARK-16882
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Streaming
Affects Versions: 1.5.0
 Environment: CDH 5.6.1, CentOS 6.7
Reporter: Brian Schrameck


Using the fileStream functionality and reading a directory with a large number 
of files over a long period of time, JVM garbage collection limits can be 
reached. In this case, the JobGenerator thread threw the exception, but it was 
completely swallowed and did not cause the job to fail. There were no errors in 
the ApplicationMaster, and the job just silently sat there not processing any 
further batches.

It would be expected that any fatal exception, not necessarily specific to this 
OutOfMemoryError, be handled appropriately and the job should be killed with 
the correct failure code.

We are running in YARN cluster mode on a CDH 5.6.1 cluster.

{noformat}Exception in thread "JobGenerator" java.lang.OutOfMemoryError: GC 
overhead limit exceeded
at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:68)
at java.lang.StringBuilder.(StringBuilder.java:89)
at org.apache.hadoop.fs.Path.(Path.java:109)
at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:430)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1494)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1534)
at 
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:569)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1494)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1534)
at 
org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:195)
at 
org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:146)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:120)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:247){noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405576#comment-15405576
 ] 

Maciej Bryński edited comment on SPARK-16320 at 8/3/16 6:50 PM:


You're right.
I think I found solution for SPARK-16320 in meantime.
It was *changing GC from G1GC to ParallelGC.*

||Query||Spark 1.6 G1GC||Spark 1.6 ParallelGC||Spark 2.0 G1GC||Spark 2.0 
ParallelGC||Spark 2.0 G1GC with PR14465||Spark 2.0 ParallelGC with PR14465||
|id > some_id|28s|30s|1:31|1:10|30s|29s|
|nested_column.id > some_id|1:50|2:03|2:18|1:51|2:15|1:53|
|id > some_id + python flatMap|1:57|1:59|5:09|2:53|3:47|1:51|

It looks like Spark 2.0 has problems with G1GC. And there is no such a problem 
with 1.6.


_Every measurement was minimum from 4 tries. I was using 15 cores and 30G RAM 
per executor on servers with 20 (with HT) cores and 128GB RAM_


was (Author: maver1ck):
You're right.
I think I found solution for SPARK-16320 in meantime.
It was *changing GC from G1GC to ParallelGC.*

||Query||Spark 1.6 ParallelGC||Spark 2.0 G1GC||Spark 2.0 ParallelGC||Spark 2.0 
ParallelGC with PR14465||
|id > some_id|30s| |1:10|29s|
|nested_column.id > some_id|2:03|2:18|1:51|1:53|
|id > some_id + python flatMap|1:59| |2:53|1:51|

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16455) Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling new tasks when cluster is restarting

2016-08-03 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406375#comment-15406375
 ] 

Kay Ousterhout commented on SPARK-16455:


Given that this feature is only needed for an internal feature that's not part 
of the main Spark code, I don't think it makes sense to add additional 
complexity to Spark to support it.

Also, isClusterAvailableForNewOffers is in a class that's private to Spark, so 
it doesn't make sense to add this Spark-private method that's not used anywhere 
in Spark.

I'd be in favor of closing this as "will not fix", unless others have reasons 
they think this makes sense to add?

> Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling 
> new tasks when cluster is restarting
> 
>
> Key: SPARK-16455
> URL: https://issues.apache.org/jira/browse/SPARK-16455
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: YangyangLiu
>Priority: Minor
>
> In our case, we are implementing a new mechanism which will let driver 
> survive when cluster is temporarily down and restarting. So when the service 
> provided by cluster is not available, scheduler should stop scheduling new 
> tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to 
> avoid new task scheduling when it's necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406372#comment-15406372
 ] 

Maciej Bryński commented on SPARK-16320:


Yes.
I also added Spark 1.6 with G1GC.
Have in mind that all GC has default configuration.
There is nice article by Databricks, but I didn't make such GC tuning as they 
did.
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend

2016-08-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406361#comment-15406361
 ] 

Joseph K. Bradley commented on SPARK-16578:
---

[~junyangq] has done some work on this.  [~wm624] are you also working on it?  
Pinging both to resolve duplicate work.  Thanks!

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales

2016-08-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406351#comment-15406351
 ] 

Yin Huai commented on SPARK-16714:
--

[~petermaxlee] Can you create a new jira for your pr? I just merged 
[~cloud_fan]'s quick fix.

> Fail to create a decimal arrays with literals having different inferred 
> precessions and scales
> --
>
> Key: SPARK-16714
> URL: https://issues.apache.org/jira/browse/SPARK-16714
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below. 
>  
> {code}
> select array(0.001, 0.02)
> {code}
> causes
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS 
> DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input 
> to function array should all be the same type, but it's [decimal(3,3), 
> decimal(2,2)]; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16596) Refactor DataSourceScanExec to do partition discovery at execution instead of planning time

2016-08-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16596.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14241
[https://github.com/apache/spark/pull/14241]

> Refactor DataSourceScanExec to do partition discovery at execution instead of 
> planning time
> ---
>
> Key: SPARK-16596
> URL: https://issues.apache.org/jira/browse/SPARK-16596
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Partition discovery is rather expensive, so we should do it at execution time 
> instead of during physical planning. Right now there is not much benefit 
> since ListingFileCatalog will read scan for all partitions at planning time 
> anyways, but this can be optimized in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16753) Spark SQL doesn't handle skewed dataset joins properly

2016-08-03 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406348#comment-15406348
 ] 

Thomas Sebastian commented on SPARK-16753:
--

[~jurriaanpruis]try a repartition before your join,so that the partition 
size tried to join would become even and improve the performance. But pls make 
sure you have enough cores to process these partitions.

> Spark SQL doesn't handle skewed dataset joins properly
> --
>
> Key: SPARK-16753
> URL: https://issues.apache.org/jira/browse/SPARK-16753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Jurriaan Pruis
> Attachments: screenshot-1.png
>
>
> I'm having issues with joining a 1 billion row dataframe with skewed data 
> with multiple dataframes with sizes ranging from 100,000 to 10 million rows. 
> This means some of the joins (about half of them) can be done using 
> broadcast, but not all.
> Because the data in the large dataframe is skewed we get out of memory errors 
> in the executors or errors like: 
> `org.apache.spark.shuffle.FetchFailedException: Too large frame`.
> We tried a lot of things, like broadcast joining the skewed rows separately 
> and unioning them with the dataset containing the sort merge joined data. 
> Which works perfectly when doing one or two joins, but when doing 10 joins 
> like this the query planner gets confused (see [SPARK-15326]).
> As most of the rows are skewed on the NULL value we use a hack where we put 
> unique values in those NULL columns so the data is properly distributed over 
> all partitions. This works fine for NULL values, but since this table is 
> growing rapidly and we have skewed data for non-NULL values as well this 
> isn't a full solution to the problem.
> Right now this specific spark task runs well 30% of the time and it's getting 
> worse and worse because of the increasing amount of data.
> How to approach these kinds of joins using Spark? It seems weird that I can't 
> find proper solutions for this problem/other people having the same kind of 
> issues when Spark profiles itself as a large-scale data processing engine. 
> Doing joins on big datasets should be a thing Spark should have no problem 
> with out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales

2016-08-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16714.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14439
[https://github.com/apache/spark/pull/14439]

> Fail to create a decimal arrays with literals having different inferred 
> precessions and scales
> --
>
> Key: SPARK-16714
> URL: https://issues.apache.org/jira/browse/SPARK-16714
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
> Fix For: 2.0.1, 2.1.0
>
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below. 
>  
> {code}
> select array(0.001, 0.02)
> {code}
> causes
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS 
> DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input 
> to function array should all be the same type, but it's [decimal(3,3), 
> decimal(2,2)]; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales

2016-08-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16714:
-
Assignee: Wenchen Fan

> Fail to create a decimal arrays with literals having different inferred 
> precessions and scales
> --
>
> Key: SPARK-16714
> URL: https://issues.apache.org/jira/browse/SPARK-16714
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below. 
>  
> {code}
> select array(0.001, 0.02)
> {code}
> causes
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS 
> DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input 
> to function array should all be the same type, but it's [decimal(3,3), 
> decimal(2,2)]; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-08-03 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406328#comment-15406328
 ] 

Thomas Sebastian commented on SPARK-16827:
--

I understand what you have mentioned is w.r.t to the 1.6 vs 2.0;  Having said 
that when I faced performance issue, because of skewed partitions, I had to go 
for a re-partition before the join, which helped improving the performance way 
better.

One thing, I am trying to understand is the real execution stages from webUI 
and how they can be mapped to the explain plans.Please share some light in this 
, if it is possible, may be a little out of context here.

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options

2016-08-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16610:
-
Target Version/s: 2.0.1, 2.1.0

> When writing ORC files, orc.compress should not be overridden if users do not 
> set "compression" in the options
> --
>
> Key: SPARK-16610
> URL: https://issues.apache.org/jira/browse/SPARK-16610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> For ORC source, Spark SQL has a writer option {{compression}}, which is used 
> to set the codec and its value will be also set to orc.compress (the orc conf 
> used for codec). However, if a user only set {{orc.compress}} in the writer 
> option, we should not use the default value of "compression" (snappy) as the 
> codec. Instead, we should respect the value of {{orc.compress}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options

2016-08-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406329#comment-15406329
 ] 

Yin Huai commented on SPARK-16610:
--

It is a ORC setting. It should be accepted by the dataframe option (when you 
use df.write.option). I do not think it makes sense to drop this conf key 
provided by ORC and just use Spark's key.

> When writing ORC files, orc.compress should not be overridden if users do not 
> set "compression" in the options
> --
>
> Key: SPARK-16610
> URL: https://issues.apache.org/jira/browse/SPARK-16610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> For ORC source, Spark SQL has a writer option {{compression}}, which is used 
> to set the codec and its value will be also set to orc.compress (the orc conf 
> used for codec). However, if a user only set {{orc.compress}} in the writer 
> option, we should not use the default value of "compression" (snappy) as the 
> codec. Instead, we should respect the value of {{orc.compress}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16881) Migrate Mesos configs to use ConfigEntry

2016-08-03 Thread Michael Gummelt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-16881:

Description: 
https://github.com/apache/spark/pull/14414#discussion_r73032190

We'd like to migrate Mesos' use of config vars to the new ConfigEntry class so 
we can a) define all our configs in one place like YARN does, and b) take use 
of features like default handling and generics

  was:https://github.com/apache/spark/pull/14414#discussion_r73032190


> Migrate Mesos configs to use ConfigEntry
> 
>
> Key: SPARK-16881
> URL: https://issues.apache.org/jira/browse/SPARK-16881
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Priority: Minor
>
> https://github.com/apache/spark/pull/14414#discussion_r73032190
> We'd like to migrate Mesos' use of config vars to the new ConfigEntry class 
> so we can a) define all our configs in one place like YARN does, and b) take 
> use of features like default handling and generics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-08-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16857.
---
Resolution: Not A Problem

The right way to do this is to evaluate based on the intra-cluster distance (or 
silhouette coefficient or whatever), not as a multi-class evaluation problem. I 
don't actually know that this is implemented, but that's the way forward.

> CrossValidator and KMeans throws IllegalArgumentException
> -
>
> Key: SPARK-16857
> URL: https://issues.apache.org/jira/browse/SPARK-16857
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
> Hadoop 2.4
>Reporter: Ryan Claussen
>
> I am attempting to use CrossValidation to train KMeans model. When I attempt 
> to fit the data spark throws an IllegalArgumentException as below since the 
> KMeans algorithm outputs an Integer into the prediction column instead of a 
> Double.   Before I go too far:  is using CrossValidation with Kmeans 
> supported?
> Here's the exception:
> {quote}
> java.lang.IllegalArgumentException: requirement failed: Column prediction 
> must be of type DoubleType but was actually IntegerType.
>  at scala.Predef$.require(Predef.scala:233)
>  at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>  at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> Here is the code I'm using to set up my cross validator.  As the stack trace 
> above indicates it is failing at the fit step when 
> {quote}
> ...
> val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
> val labelConverter = new 
> IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
> val pipeline = new Pipeline().setStages(Array(labelIndexer, 
> featureIndexer, mpc, labelConverter))
> val evaluator = new 
> MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")
> val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 
> 200, 500)).build()
> val cv = new 
> CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
> val cvModel = cv.fit(trainingData)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16881) Migrate Mesos configs to use ConfigEntry

2016-08-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406312#comment-15406312
 ] 

Sean Owen commented on SPARK-16881:
---

(Worth summarizing the change or just at least copying the two relevant 
comments here.)

> Migrate Mesos configs to use ConfigEntry
> 
>
> Key: SPARK-16881
> URL: https://issues.apache.org/jira/browse/SPARK-16881
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>Priority: Minor
>
> https://github.com/apache/spark/pull/14414#discussion_r73032190



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-08-03 Thread Ryan Claussen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406311#comment-15406311
 ] 

Ryan Claussen commented on SPARK-16857:
---

Basically I want to use the {code}Validator{code} as a way to tune the k-means 
hyperparameters.  I understand that I'm clustering and not classifying but I 
thought I might be able to emulate the same pipeline in k-means by using a 
subclass of {code}Evaluator{code}.  Does this mean Spark might be missing an 
evaluator for clustering or is hyperparameter tuning just something that isn't 
done with k-means?

> CrossValidator and KMeans throws IllegalArgumentException
> -
>
> Key: SPARK-16857
> URL: https://issues.apache.org/jira/browse/SPARK-16857
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
> Hadoop 2.4
>Reporter: Ryan Claussen
>
> I am attempting to use CrossValidation to train KMeans model. When I attempt 
> to fit the data spark throws an IllegalArgumentException as below since the 
> KMeans algorithm outputs an Integer into the prediction column instead of a 
> Double.   Before I go too far:  is using CrossValidation with Kmeans 
> supported?
> Here's the exception:
> {quote}
> java.lang.IllegalArgumentException: requirement failed: Column prediction 
> must be of type DoubleType but was actually IntegerType.
>  at scala.Predef$.require(Predef.scala:233)
>  at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>  at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> Here is the code I'm using to set up my cross validator.  As the stack trace 
> above indicates it is failing at the fit step when 
> {quote}
> ...
> val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
> val labelConverter = new 
> IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
> val pipeline = new Pipeline().setStages(Array(labelIndexer, 
> featureIndexer, mpc, labelConverter))
> val evaluator = new 
> MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")
> val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 
> 200, 500)).build()
> val cv = new 
> CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
> val cvModel = cv.fit(trainingData)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-03 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong closed SPARK-16841.
--
Resolution: Not A Problem

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-03 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306
 ] 

Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM:


This jira is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.


was (Author: clockfly):
This jira is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-03 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306
 ] 

Sean Zhong commented on SPARK-16841:


This PR is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16841) Improves the row level metrics performance when reading Parquet table

2016-08-03 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406306#comment-15406306
 ] 

Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM:


This jira is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.


was (Author: clockfly):
This PR is created after analyzing the performance impact of 
https://github.com/apache/spark/pull/12352, which added the row level metrics 
and caused 15% performance regression. And I can verify the performance 
regression consistently by comparing performance code before and after 
https://github.com/apache/spark/pull/12352.

But the problem is that I cannot reproduce the same performance regression 
consistently on Spark trunk code, the performance improvement after the fix on 
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge, and I cannot get an  exact performance number.

For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}

I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500 
ms for each run.

I guess the phenomenon has something to do with Java JIT and our codegen logic 
(Because of codegen, we are creating new class type for each run in 
spark-shell, which may impact code cache).

Since I cannot verify this improvement consistently on trunk, I am going to 
close this jira.

> Improves the row level metrics performance when reading Parquet table
> -
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead, 
> bytesRead 
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is 
> not used, it may take 20% of read time to update these metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16881) Migrate Mesos configs to use ConfigEntry

2016-08-03 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-16881:
---

 Summary: Migrate Mesos configs to use ConfigEntry
 Key: SPARK-16881
 URL: https://issues.apache.org/jira/browse/SPARK-16881
 Project: Spark
  Issue Type: Task
  Components: Mesos
Affects Versions: 2.0.0
Reporter: Michael Gummelt
Priority: Minor


https://github.com/apache/spark/pull/14414#discussion_r73032190



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406245#comment-15406245
 ] 

Sean Owen commented on SPARK-6305:
--

Mostly to silence noisy libraries manually (i.e. still taking effect even if 
the user provides customer config) and to set up friendlier logging in 
standalone examples. You can probably search for usages in an IDE to get flavor 
of it. Hey, maybe you see usages that are probably pointless or for which 
there's a better approach.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406207#comment-15406207
 ] 

Mikael Ståldal commented on SPARK-6305:
---

Yes, if you need to set logging level, then you need a concrete logging 
implementation, and not (only) slf4j-api.

Why does Spark need to set logging level?

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406191#comment-15406191
 ] 

Sean Owen commented on SPARK-6305:
--

How would you set logging levels in slf4j though? I thought it didn't expose 
that.
I don't think log4j 2 lacks functionality, it's just that it was exposed in 
different ways, and took some work to translate, not entirely directly. That 
wasn't a big deal.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16880) Improve ANN training, add training data persist if needed

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16880:


Assignee: Apache Spark

> Improve ANN training, add training data persist if needed
> -
>
> Key: SPARK-16880
> URL: https://issues.apache.org/jira/browse/SPARK-16880
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The ANN layer training does not persist input data RDD,
> so that it may cause overhead cost if the RDD need to compute from lineage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16880) Improve ANN training, add training data persist if needed

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16880:


Assignee: (was: Apache Spark)

> Improve ANN training, add training data persist if needed
> -
>
> Key: SPARK-16880
> URL: https://issues.apache.org/jira/browse/SPARK-16880
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The ANN layer training does not persist input data RDD,
> so that it may cause overhead cost if the RDD need to compute from lineage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16880) Improve ANN training, add training data persist if needed

2016-08-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406156#comment-15406156
 ] 

Apache Spark commented on SPARK-16880:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/14483

> Improve ANN training, add training data persist if needed
> -
>
> Key: SPARK-16880
> URL: https://issues.apache.org/jira/browse/SPARK-16880
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The ANN layer training does not persist input data RDD,
> so that it may cause overhead cost if the RDD need to compute from lineage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16880) Improve ANN training, add training data persist if needed

2016-08-03 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-16880:
---
Component/s: MLlib
 ML

> Improve ANN training, add training data persist if needed
> -
>
> Key: SPARK-16880
> URL: https://issues.apache.org/jira/browse/SPARK-16880
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The ANN layer training does not persist input data RDD,
> so that it may cause overhead cost if the RDD need to compute from lineage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16880) Improve ANN training, add training data persist if needed

2016-08-03 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-16880:
--

 Summary: Improve ANN training, add training data persist if needed
 Key: SPARK-16880
 URL: https://issues.apache.org/jira/browse/SPARK-16880
 Project: Spark
  Issue Type: Bug
Reporter: Weichen Xu


The ANN layer training does not persist input data RDD,
so that it may cause overhead cost if the RDD need to compute from lineage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406148#comment-15406148
 ] 

Mikael Ståldal commented on SPARK-6305:
---

Did you know that Log4j 2.x have it's own bridge from Log4j 1.x (not using 
slf4j)?

See: http://logging.apache.org/log4j/2.x/manual/migration.html

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406142#comment-15406142
 ] 

Mikael Ståldal commented on SPARK-6305:
---

Which methods in Log4j 1.x are you missing counterparts in Log4j 2.x for?

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-08-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406140#comment-15406140
 ] 

Mikael Ståldal commented on SPARK-6305:
---

Would it be possible for Spark to depend on slf4j-api only (and leave it up to 
any application using it choose the concrete logging implementation)?

Or does Spark itself need a concrete logging implementation? Why?

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-08-03 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406125#comment-15406125
 ] 

Michael Allman commented on SPARK-16320:


Hi [~maver1ck]. These are excellent findings! I'm curious to see a few more 
benchmarks. Could you add the two missing timings for Spark 2.0 G1GC, and 
another column for Spark 2.0 G1GC with the PR?

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16879) unify logical plans for CREATE TABLE and CTAS

2016-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16879:


Assignee: Apache Spark  (was: Wenchen Fan)

> unify logical plans for CREATE TABLE and CTAS
> -
>
> Key: SPARK-16879
> URL: https://issues.apache.org/jira/browse/SPARK-16879
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >