date:20160109

[jira] [Commented] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-01-09 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090493#comment-15090493
 ] 

Xiao Li commented on SPARK-12705:
-

Will try to fix it this weekend. Thanks!

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> The following query can't be resolved:
> ```
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12731) PySpark docstring cleanup

2016-01-09 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090507#comment-15090507
 ] 

holdenk commented on SPARK-12731:
-

cc [~josephkb] based on our chat on my other PR

> PySpark docstring cleanup
> -
>
> Key: SPARK-12731
> URL: https://issues.apache.org/jira/browse/SPARK-12731
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> We don't currently have any automated checks that our PySpark docstring lines 
> are within pep8/275/276 lenght limits (since the pep8 checker doesn't handle 
> this). As such there are ~400 non-comformant docstring lines. This JIRA is to 
> fix those docstring lines and add a command to lint python to fail on long 
> lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12426) Docker JDBC integration tests are failing again

2016-01-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090534#comment-15090534
 ] 

Sean Owen commented on SPARK-12426:
---

I'm not sure how to give edit access -- don't think I'm an admin -- but I do 
have edit access. If you send me your edits I can apply them.

> Docker JDBC integration tests are failing again
> ---
>
> Key: SPARK-12426
> URL: https://issues.apache.org/jira/browse/SPARK-12426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Mark Grover
>
> The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to 
> be failing again on my machine (Ubuntu Precise). This was the same box that I 
> tested my previous commit on. Also, I am not confident this failure has much 
> to do with Spark, since a well known commit where the tests were passing, 
> fails now, in the same environment.
> [~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing 
> on his Ubuntu 15 box as well.
> Here's the error, fyi:
> {code}
> 15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> *** RUN ABORTED ***
>   com.spotify.docker.client.DockerException: 
> java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
>   ...
>   Cause: java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   ...
>   Cause: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:37)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:487)
> 15/1

[jira] [Commented] (SPARK-12711) ML StopWordsRemover does not protect itself from column name duplication

2016-01-09 Thread Wojciech Jurczyk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090535#comment-15090535
 ] 

Wojciech Jurczyk commented on SPARK-12711:
--

[~josephkb]Is there any particular reason why StopWordsRemover is not a 
UnaryTransformer? As the docs say, the UnaryTransformer is an "Abstract class 
for transformers that take one input column, apply transformation, and output 
the result as a new column." which is the case. Moreover, UnaryTransformer 
implementation checks whether the output column already exists or not. Then, 
Making StopWordsRemover a UnaryTransformer would solve the issue. Talking about 
UnaryTransformers candidates, I think StringIndexer is a similar case (and 
probably, there are other Transformers that could be UnaryTransformers). It 
doesn't check whether the output column exists in the input DataFrame (it has 
the same flaw). Making StringIndexer a UnaryTransformer would solve the flaw, 
too. What do you think?

> ML StopWordsRemover does not protect itself from column name duplication
> 
>
> Key: SPARK-12711
> URL: https://issues.apache.org/jira/browse/SPARK-12711
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>  Labels: ml, mllib, newbie, suggestion
>
> At work we were 'taking a closer look' at ML transformers&estimators and I 
> spotted that anomally.
> On first look, resolution looks simple:
> Add to StopWordsRemover.transformSchema line (as is done in e.g. 
> PCA.transformSchema, StandardScaler.transformSchema, 
> OneHotEncoder.transformSchema):
> {code}
> require(!schema.fieldNames.contains($(outputCol)), s"Output column 
> ${$(outputCol)} already exists.")
> {code}
> Am I correct? Is that a bug?If yes - I am willing to prepare an 
> appropriate pull request.
> Maybe a better idea is to make use of super.transformSchema in 
> StopWordsRemover (and possibly in all other places)?
> Links to files at github, mentioned above:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12736) Standalone Master cannot be started due to NoClassDefFoundError: org/spark-project/guava/collect/Maps

2016-01-09 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-12736:
---

 Summary: Standalone Master cannot be started due to 
NoClassDefFoundError: org/spark-project/guava/collect/Maps
 Key: SPARK-12736
 URL: https://issues.apache.org/jira/browse/SPARK-12736
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 2.0.0
Reporter: Jacek Laskowski


After 
https://github.com/apache/spark/commit/659fd9d04b988d48960eac4f352ca37066f43f5c 
starting standalone Master (using {{./sbin/start-master.sh}}) fails with the 
following exception:

{code}
Spark Command: /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java
-cp 
/Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-2.0.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
-Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip japila.local
--port 7077 --webui-port 8080

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
Exception in thread "main" java.lang.NoClassDefFoundError:
org/spark-project/guava/collect/Maps
at 
org.apache.hadoop.metrics2.lib.MetricsRegistry.(MetricsRegistry.java:42)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.(MetricsSystemImpl.java:94)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.(MetricsSystemImpl.java:141)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:38)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:36)
at 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:120)
at 
org.apache.hadoop.security.UserGroupInformation.(UserGroupInformation.java:236)
at 
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2156)
at 
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2156)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2156)
at org.apache.spark.SecurityManager.(SecurityManager.scala:214)
at 
org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1108)
at org.apache.spark.deploy.master.Master$.main(Master.scala:1093)
at org.apache.spark.deploy.master.Master.main(Master.scala)
Caused by: java.lang.ClassNotFoundException:
org.spark-project.guava.collect.Maps
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 15 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12736) Standalone Master cannot be started due to NoClassDefFoundError: org/spark-project/guava/collect/Maps

2016-01-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12736:


Assignee: Apache Spark

> Standalone Master cannot be started due to NoClassDefFoundError: 
> org/spark-project/guava/collect/Maps
> -
>
> Key: SPARK-12736
> URL: https://issues.apache.org/jira/browse/SPARK-12736
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>
> After 
> https://github.com/apache/spark/commit/659fd9d04b988d48960eac4f352ca37066f43f5c
>  starting standalone Master (using {{./sbin/start-master.sh}}) fails with the 
> following exception:
> {code}
> Spark Command: 
> /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java
> -cp 
> /Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-2.0.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
> -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip japila.local
> --port 7077 --webui-port 8080
> 
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/spark-project/guava/collect/Maps
> at 
> org.apache.hadoop.metrics2.lib.MetricsRegistry.(MetricsRegistry.java:42)
> at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.(MetricsSystemImpl.java:94)
> at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.(MetricsSystemImpl.java:141)
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:38)
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:36)
> at 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:120)
> at 
> org.apache.hadoop.security.UserGroupInformation.(UserGroupInformation.java:236)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2156)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2156)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2156)
> at org.apache.spark.SecurityManager.(SecurityManager.scala:214)
> at 
> org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1108)
> at org.apache.spark.deploy.master.Master$.main(Master.scala:1093)
> at org.apache.spark.deploy.master.Master.main(Master.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.spark-project.guava.collect.Maps
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12736) Standalone Master cannot be started due to NoClassDefFoundError: org/spark-project/guava/collect/Maps

2016-01-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12736:


Assignee: (was: Apache Spark)

> Standalone Master cannot be started due to NoClassDefFoundError: 
> org/spark-project/guava/collect/Maps
> -
>
> Key: SPARK-12736
> URL: https://issues.apache.org/jira/browse/SPARK-12736
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>
> After 
> https://github.com/apache/spark/commit/659fd9d04b988d48960eac4f352ca37066f43f5c
>  starting standalone Master (using {{./sbin/start-master.sh}}) fails with the 
> following exception:
> {code}
> Spark Command: 
> /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java
> -cp 
> /Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-2.0.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
> -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip japila.local
> --port 7077 --webui-port 8080
> 
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/spark-project/guava/collect/Maps
> at 
> org.apache.hadoop.metrics2.lib.MetricsRegistry.(MetricsRegistry.java:42)
> at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.(MetricsSystemImpl.java:94)
> at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.(MetricsSystemImpl.java:141)
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:38)
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:36)
> at 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:120)
> at 
> org.apache.hadoop.security.UserGroupInformation.(UserGroupInformation.java:236)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2156)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2156)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2156)
> at org.apache.spark.SecurityManager.(SecurityManager.scala:214)
> at 
> org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1108)
> at org.apache.spark.deploy.master.Master$.main(Master.scala:1093)
> at org.apache.spark.deploy.master.Master.main(Master.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.spark-project.guava.collect.Maps
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12736) Standalone Master cannot be started due to NoClassDefFoundError: org/spark-project/guava/collect/Maps

2016-01-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090570#comment-15090570
 ] 

Apache Spark commented on SPARK-12736:
--

User 'jaceklaskowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/10674

> Standalone Master cannot be started due to NoClassDefFoundError: 
> org/spark-project/guava/collect/Maps
> -
>
> Key: SPARK-12736
> URL: https://issues.apache.org/jira/browse/SPARK-12736
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>
> After 
> https://github.com/apache/spark/commit/659fd9d04b988d48960eac4f352ca37066f43f5c
>  starting standalone Master (using {{./sbin/start-master.sh}}) fails with the 
> following exception:
> {code}
> Spark Command: 
> /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java
> -cp 
> /Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-2.0.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
> -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip japila.local
> --port 7077 --webui-port 8080
> 
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/spark-project/guava/collect/Maps
> at 
> org.apache.hadoop.metrics2.lib.MetricsRegistry.(MetricsRegistry.java:42)
> at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.(MetricsSystemImpl.java:94)
> at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.(MetricsSystemImpl.java:141)
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:38)
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.(DefaultMetricsSystem.java:36)
> at 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:120)
> at 
> org.apache.hadoop.security.UserGroupInformation.(UserGroupInformation.java:236)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2156)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2156)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2156)
> at org.apache.spark.SecurityManager.(SecurityManager.scala:214)
> at 
> org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1108)
> at org.apache.spark.deploy.master.Master$.main(Master.scala:1093)
> at org.apache.spark.deploy.master.Master.main(Master.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.spark-project.guava.collect.Maps
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12729) phantom references to replace the finalize call in python broadcast

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12729:
--
Component/s: p

> phantom references to replace the finalize call in python broadcast
> ---
>
> Key: SPARK-12729
> URL: https://issues.apache.org/jira/browse/SPARK-12729
> Project: Spark
>  Issue Type: Improvement
>  Components: p
>Reporter: Davies Liu
>
> it is doing IO operations and blocking the GC thread, 
> see 
> http://resources.ej-technologies.com/jprofiler/help/doc/index.html#jprofiler.helptopics.cpu.finalizers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12726) ParquetConversions doesn't always propagate metastore table identifier to ParquetRelation

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12726:
--
Component/s: SQL

> ParquetConversions doesn't always propagate metastore table identifier to 
> ParquetRelation
> -
>
> Key: SPARK-12726
> URL: https://issues.apache.org/jira/browse/SPARK-12726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>
> (I hit this issue while working on SPARK-12593, but haven't got time to 
> investigate it. Will fill more details when I get some clue.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12729) phantom references to replace the finalize call in python broadcast

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12729:
--
Component/s: (was: p)
 PySpark

[~davies] if you would please just tag your issues with component "Pyspark"

> phantom references to replace the finalize call in python broadcast
> ---
>
> Key: SPARK-12729
> URL: https://issues.apache.org/jira/browse/SPARK-12729
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>
> it is doing IO operations and blocking the GC thread, 
> see 
> http://resources.ej-technologies.com/jprofiler/help/doc/index.html#jprofiler.helptopics.cpu.finalizers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12732:
--
Component/s: MLlib

Sounds good [~iyounus], feel free to submit a PR.

> Fix LinearRegression.train for the case when label is constant and 
> fitIntercept=false
> -
>
> Key: SPARK-12732
> URL: https://issues.apache.org/jira/browse/SPARK-12732
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Imran Younus
>Priority: Minor
>
> If the target variable is constant, then the linear regression must check if 
> the fitIntercept is true or false, and handle these two cases separately.
> If the fitIntercept is true, then there is no training needed and we set the 
> intercept equal to the mean of y.
> But if the fit intercept is false, then the model should still train.
> Currently, LinearRegression handles both cases in the same way. It doesn't 
> train the model and sets the intercept equal to the mean of y. Which, means 
> that it returns a non-zero intercept even when the user forces the regression 
> through the origin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12712) test-dependencies.sh fails with difference in manifests

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12712.
---
Resolution: Not A Problem

If it's not a problem with the Spark build per se, I'm not sure this is the 
right place. Yes, it seems like you don't have something set up to follow this 
mechanism. Maybe you need to update or rebase your branch?

On the other hand, if you think you see a particular problem with the script, 
or can reproduce in the Spark build, reopen with that.

> test-dependencies.sh fails with difference in manifests
> ---
>
> Key: SPARK-12712
> URL: https://issues.apache.org/jira/browse/SPARK-12712
> Project: Spark
>  Issue Type: Bug
>Reporter: Stavros Kontopoulos
>
> Test-dependencies.sh script fails.
> This relates to this https://github.com/apache/spark/pull/10461
> Check failure here:
> https://ci.typesafe.com/job/ghprb-spark-multi-conf/label=Spark-Ora-JDK7-PV,scala_version=2.10/84/console
> My pr does not change dependencies shouldnt the pr manifest be generated with 
> full dependencies it seems empty. should i use replace-manifest?
> Reproducing it locally on that jenkins instance i get this: 
> Spark's published dependencies DO NOT MATCH the manifest file 
> (dev/spark-deps).
> To update the manifest file, run './dev/test-dependencies.sh 
> --replace-manifest'.
> diff --git a/dev/deps/spark-deps-hadoop-2.6 
> b/dev/pr-deps/spark-deps-hadoop-2.6
> index e703c7a..3aa2c38 100644
> --- a/dev/deps/spark-deps-hadoop-2.6
> +++ b/dev/pr-deps/spark-deps-hadoop-2.6
> @@ -1,190 +1,2 @@
> -JavaEWAH-0.3.2.jar
> -RoaringBitmap-0.5.11.jar
> -ST4-4.0.4.jar
> -activation-1.1.1.jar
> -akka-actor_2.10-2.3.11.jar
> -akka-remote_2.10-2.3.11.jar
> -akka-slf4j_2.10-2.3.11.jar
> -antlr-runtime-3.5.2.jar
> -aopalliance-1.0.jar
> -apache-log4j-extras-1.2.17.jar
> -apacheds-i18n-2.0.0-M15.jar
> -apacheds-kerberos-codec-2.0.0-M15.jar
> -api-asn1-api-1.0.0-M20.jar
> -api-util-1.0.0-M20.jar
> -arpack_combined_all-0.1.jar
> -asm-3.1.jar
> -asm-commons-3.1.jar
> -asm-tree-3.1.jar
> -avro-1.7.7.jar
> -avro-ipc-1.7.7-tests.jar
> -avro-ipc-1.7.7.jar
> -avro-mapred-1.7.7-hadoop2.jar
> -base64-2.3.8.jar
> -bcprov-jdk15on-1.51.jar
> -bonecp-0.8.0.RELEASE.jar
> -breeze-macros_2.10-0.11.2.jar
> -breeze_2.10-0.11.2.jar
> -calcite-avatica-1.2.0-incubating.jar
> -calcite-core-1.2.0-incubating.jar
> -calcite-linq4j-1.2.0-incubating.jar
> -chill-java-0.5.0.jar
> -chill_2.10-0.5.0.jar
> -commons-beanutils-1.7.0.jar
> -commons-beanutils-core-1.8.0.jar
> -commons-cli-1.2.jar
> -commons-codec-1.10.jar
> -commons-collections-3.2.2.jar
> -commons-compiler-2.7.6.jar
> -commons-compress-1.4.1.jar
> -commons-configuration-1.6.jar
> -commons-dbcp-1.4.jar
> -commons-digester-1.8.jar
> -commons-httpclient-3.1.jar
> -commons-io-2.4.jar
> -commons-lang-2.6.jar
> -commons-lang3-3.3.2.jar
> -commons-logging-1.1.3.jar
> -commons-math3-3.4.1.jar
> -commons-net-2.2.jar
> -commons-pool-1.5.4.jar
> -compress-lzf-1.0.3.jar
> -config-1.2.1.jar
> -core-1.1.2.jar
> -curator-client-2.6.0.jar
> -curator-framework-2.6.0.jar
> -curator-recipes-2.6.0.jar
> -datanucleus-api-jdo-3.2.6.jar
> -datanucleus-core-3.2.10.jar
> -datanucleus-rdbms-3.2.9.jar
> -derby-10.10.1.1.jar
> -eigenbase-properties-1.1.5.jar
> -geronimo-annotation_1.0_spec-1.1.1.jar
> -geronimo-jaspic_1.0_spec-1.0.jar
> -geronimo-jta_1.1_spec-1.1.1.jar
> -groovy-all-2.1.6.jar
> -gson-2.2.4.jar
> -guice-3.0.jar
> -guice-servlet-3.0.jar
> -hadoop-annotations-2.6.0.jar
> -hadoop-auth-2.6.0.jar
> -hadoop-client-2.6.0.jar
> -hadoop-common-2.6.0.jar
> -hadoop-hdfs-2.6.0.jar
> -hadoop-mapreduce-client-app-2.6.0.jar
> -hadoop-mapreduce-client-common-2.6.0.jar
> -hadoop-mapreduce-client-core-2.6.0.jar
> -hadoop-mapreduce-client-jobclient-2.6.0.jar
> -hadoop-mapreduce-client-shuffle-2.6.0.jar
> -hadoop-yarn-api-2.6.0.jar
> -hadoop-yarn-client-2.6.0.jar
> -hadoop-yarn-common-2.6.0.jar
> -hadoop-yarn-server-common-2.6.0.jar
> -hadoop-yarn-server-web-proxy-2.6.0.jar
> -htrace-core-3.0.4.jar
> -httpclient-4.3.2.jar
> -httpcore-4.3.2.jar
> -ivy-2.4.0.jar
> -jackson-annotations-2.4.4.jar
> -jackson-core-2.4.4.jar
> -jackson-core-asl-1.9.13.jar
> -jackson-databind-2.4.4.jar
> -jackson-jaxrs-1.9.13.jar
> -jackson-mapper-asl-1.9.13.jar
> -jackson-module-scala_2.10-2.4.4.jar
> -jackson-xc-1.9.13.jar
> -janino-2.7.8.jar
> -jansi-1.4.jar
> -java-xmlbuilder-1.0.jar
> -javax.inject-1.jar
> -javax.servlet-3.0.0.v201112011016.jar
> -javolution-5.5.1.jar
> -jaxb-api-2.2.2.jar
> -jaxb-impl-2.2.3-1.jar
> -jcl-over-slf4j-1.7.10.jar
> -jdo-api-3.0.1.jar
> -jersey-client-1.9.jar
> -jersey-core-1.9.jar
> -jersey-guice-1.9.jar
> -jersey-json-1.9.jar
> -jersey-server-1.9.jar
> -jets3t-0.9.3.jar
> -jettison-1.1.jar
> -jetty-6.1.26.jar
> -jetty-all-7.6.0.v20120127.jar
> -jetty-util-6.1.26.jar
> -jline-2.10.5.jar
> -jline-2

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-09 Thread Nikita Tarasenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090588#comment-15090588
 ] 

Nikita Tarasenko commented on SPARK-12177:
--

Hi, Mark!

Of course, it possible to collaborate =)

1. I think, I will create a new branch based on master, copy existing code with 
some changes (as you said) and create new pull request. Then you can issuing 
pull requests to this new branch. Is it ok?

2. Yes, this is a problem. I did not notice it. How we can handle it? I don't 
know how to properly separate those dependencies. Maybe we can put examples for 
Kafka 0.9 to kafka-v09 package until Kafka 0.8 exists?

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5766) Slow RowMatrix multiplication

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5766.
--
Resolution: Not A Problem

Or to some degree this is already done via netlib

> Slow RowMatrix multiplication
> -
>
> Key: SPARK-5766
> URL: https://issues.apache.org/jira/browse/SPARK-5766
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Amaru Cuba Gyllensten
>Priority: Minor
>  Labels: matrix
>
> Looking at the source code for RowMatrix multiplication by a local matrix, it 
> seems like it is going through all columnvectors of the matrix, doing 
> pairwise dot product on each column.  
> It seems like this could be sped up by using gemm, performing full 
> matrix-matrix multiplication on the local data, (or gemv, for vector-matrix 
> multiplication), as is done in BlockMatrix or Matrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5793) Add explode to Column

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5793.
--
Resolution: Won't Fix

> Add explode to Column
> -
>
> Key: SPARK-5793
> URL: https://issues.apache.org/jira/browse/SPARK-5793
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5792) hive udfs like "get_json_object and json_tuple" doesnot work in spark 1.2.0

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5792.
--
Resolution: Won't Fix

I think this is obsolete given it's traced to Jackson version conflict

> hive udfs like "get_json_object and json_tuple" doesnot work in spark 1.2.0
> ---
>
> Key: SPARK-5792
> URL: https://issues.apache.org/jira/browse/SPARK-5792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1
>Reporter: pengxu
>
> I'm using spark 1.2.1 now. After several testing, I found that the hive udfs 
> like get_json_object and json_tuple doesnot take effect.
> the testing environment is like the below.
> beeline==>thriftServer==>Spark Cluster
> For example, the output of such query is null instead of expected value.
> 'select get_json_object('{"hello":"world"}','hello') from demo_tbl'
> I issued the same query in hive also, the return value is "world", which was 
> I expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5828) Dynamic partition pattern support

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5828.
--
Resolution: Won't Fix

> Dynamic partition pattern support
> -
>
> Key: SPARK-5828
> URL: https://issues.apache.org/jira/browse/SPARK-5828
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jianshi Huang
>
> Hi,
> HCatalog allows you to specify the pattern of paths for partitions, which 
> will be used by dynamic partition loading.
>   
> https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions#HCatalogDynamicPartitions-ExternalTables
> Can we have similar feature in SparkSQL?
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5745) Allow to use custom TaskMetrics implementation

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5745.
--
Resolution: Won't Fix

> Allow to use custom TaskMetrics implementation
> --
>
> Key: SPARK-5745
> URL: https://issues.apache.org/jira/browse/SPARK-5745
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> There can be various RDDs implemented and the {{TaskMetrics}} provides a 
> great API for collecting metrics and aggregating them. However some RDDs may 
> want to register some custom metrics and the current implementation doesn't 
> allow for this (for example the number of read rows or whatever).
> I suppose that this can be changed without modifying the whole interface - 
> there could used some factory to create the initial {{TaskMetrics}} object. 
> The default factory could be overridden by user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5627) Enhance spark-ec2 to return machine-readable output

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5627.
--
Resolution: Won't Fix

I'm guessing this is WontFix given the lack of activity, and that EC2 support 
is finally moving out of Spark

> Enhance spark-ec2 to return machine-readable output
> ---
>
> Key: SPARK-5627
> URL: https://issues.apache.org/jira/browse/SPARK-5627
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There are some cases where users may want to programmatically invoke 
> {{spark-ec2}} to manage clusters. For example, we might want to 
> programmatically launch clusters as part of a {{spark-perf}} run and then 
> destroy them once the performance testing is done.
> We should support some of these use cases, perhaps with the explicit caveat 
> that we are not offering API stability for this access.
> A good way to offer this might be to follow [the Packer 
> model|https://www.packer.io/docs/command-line/machine-readable.html] and add 
> a {{--machine-readable}} option to spark-ec2.
> It would be a lot of work to support such an option for everything that 
> spark-ec2 does, and it probably isn't relevant for most things anyway.
> Still, we can phase it in for select things like [returning the 
> version|SPARK-5628] and [describing clusters|SPARK-5629].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3522) Make spark-ec2 verbosity configurable

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3522.
--
Resolution: Won't Fix

I'm guessing this is WontFix given the lack of activity, and that EC2 support 
is finally moving out of Spark

> Make spark-ec2 verbosity configurable
> -
>
> Key: SPARK-3522
> URL: https://issues.apache.org/jira/browse/SPARK-3522
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: starter
>
> When launching a cluster, {{spark-ec2}} spits out a lot of stuff that feels 
> like debug output. It would be better for the user if {{spark-ec2}} did the 
> following:
> * default to info output level
> * allow option to increase verbosity and include debug output
> This will require converting most of the {{print}} statements in the script 
> to use Python's {{logging}} module and setting output levels ({{INFO}}, 
> {{WARN}}, {{DEBUG}}) for each statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4241) spark_ec2.py support China AWS region: cn-north-1

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4241.
--
Resolution: Won't Fix

I'm guessing this is WontFix given the lack of activity, and that EC2 support 
is finally moving out of Spark

> spark_ec2.py support China AWS region: cn-north-1
> -
>
> Key: SPARK-4241
> URL: https://issues.apache.org/jira/browse/SPARK-4241
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Haitao Yao
>
> Amazon started a new region in China: cn-north-1. But in 
> https://github.com/mesos/spark-ec2/tree/v4/ami-list
> there's no ami id for the region: cn-north-1. so the ec2/spark_ec2.py failed 
> on this step. 
> We need to add ami id for region: cn-north-1 in 
> https://github.com/mesos/spark-ec2/tree/v4/ami-list



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1555) enable ec2/spark_ec2.py to stop/delete cluster non-interactively

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1555.
--
Resolution: Won't Fix

I'm guessing this is WontFix given the lack of activity, and that EC2 support 
is finally moving out of Spark

> enable ec2/spark_ec2.py to stop/delete cluster non-interactively
> 
>
> Key: SPARK-1555
> URL: https://issues.apache.org/jira/browse/SPARK-1555
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 0.9.0
>Reporter: Art Peel
>Priority: Minor
>
> Currently ec2/spark_ec2.py asks for user input to confirm a request to 
> stop/delete the cluster.
> This prevents non-interactive use of the script.
> Please add --assume-yes option with this behavior:
> a. It defaults to false so the current behavior is maintained by default.
> b. If set to true, the script does not ask for user input and instead assumes 
> the answer is 'y'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5865) Add doc warnings for methods that return local data structures

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5865:
-
Labels: starter  (was: )

> Add doc warnings for methods that return local data structures
> --
>
> Key: SPARK-5865
> URL: https://issues.apache.org/jira/browse/SPARK-5865
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: starter
>
> We should include a note in the doc string for any method that collects an 
> RDD to the driver so that users have some hint of why their call might be 
> OOMing.
> {{RDD.take()}}
> {{RDD.collect()}}
> * 
> [Scala|https://github.com/apache/spark/blob/d8adefefcc2a4af32295440ed1d4917a6968f017/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L803-L806]
> * 
> [Python|https://github.com/apache/spark/blob/d8adefefcc2a4af32295440ed1d4917a6968f017/python/pyspark/rdd.py#L680-L683]
> {{DataFrame.head()}}
> {{DataFrame.toPandas()}}
> * 
> [Python|https://github.com/apache/spark/blob/c76da36c2163276b5c34e59fbb139eeb34ed0faa/python/pyspark/sql/dataframe.py#L637-L645]
> {{Column.toPandas()}}
> * 
> [Python|https://github.com/apache/spark/blob/c76da36c2163276b5c34e59fbb139eeb34ed0faa/python/pyspark/sql/dataframe.py#L965-L973]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3113) Using local spark-submit with an EC2 cluster fails to execute job

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3113.
--
Resolution: Cannot Reproduce

I'm guessing this is CannotReproduce given the lack of activity, and that EC2 
support is finally moving out of Spark

> Using local spark-submit with an EC2 cluster fails to execute job
> -
>
> Key: SPARK-3113
> URL: https://issues.apache.org/jira/browse/SPARK-3113
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.0.2
>Reporter: Dan Osipov
>
> Steps taken:
> * Start a new cluster using EC2 script. Command:
> {code}
> ./spark-ec2 -k keypairname -i ~/path/tokeypair.pem -s 2 launch sp-test
> {code}
> * SSH into the master, execute a job -> Success. Command:
> {code}
> /root/spark/bin/spark-submit --verbose --executor-memory 6G --master 
> spark://ec2-174-129-92-3.compute-1.amazonaws.com:7077 --class JobApp --name 
> Job /root/Job-1.0.0.jar s3n://input-bucket/logs/year=2014/month=6/day=21/* 
> s3n://input-bucket/output/
> {code}
> * On a local build of spark, execute the same command -> Failure:
> {code}
> ./spark-submit --verbose --executor-memory 6G --master 
> spark://ec2-174-129-92-3.compute-1.amazonaws.com:7077 --class JobApp --name 
> Job ~/local/path/to/Job-1.0.0.jar 
> s3n://input-bucket/logs/year=2014/month=6/day=21/* s3n://input-bucket/output/
> {code}
> Local output:
> {code}
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 14/08/18 16:04:22 INFO SecurityManager: Changing view acls to: daniil.osipov,
> 14/08/18 16:04:22 INFO SecurityManager: Changing modify acls to: 
> daniil.osipov,
> 14/08/18 16:04:22 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(daniil.osipov, 
> ); users with modify permissions: Set(daniil.osipov, )
> 14/08/18 16:04:22 INFO Slf4jLogger: Slf4jLogger started
> 14/08/18 16:04:22 INFO Remoting: Starting remoting
> 14/08/18 16:04:22 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://spark@192.168.115.108:57076]
> 14/08/18 16:04:22 INFO Remoting: Remoting now listens on addresses: 
> [akka.tcp://spark@192.168.115.108:57076]
> 14/08/18 16:04:22 INFO Utils: Successfully started service 'spark' on port 
> 57076.
> 14/08/18 16:04:22 INFO SparkEnv: Registering MapOutputTracker
> 14/08/18 16:04:22 INFO SparkEnv: Registering BlockManagerMaster
> 14/08/18 16:04:22 INFO DiskBlockManager: Created local directory at 
> /var/folders/cs/651p8b5x0pb4ytl7zsv2fb7rgp/T/spark-local-20140818160422-5a2d
> 14/08/18 16:04:22 INFO Utils: Successfully started service 'Connection 
> manager for block manager' on port 57077.
> 14/08/18 16:04:22 INFO ConnectionManager: Bound socket to port 57077 with id 
> = ConnectionManagerId(192.168.115.108,57077)
> 14/08/18 16:04:22 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
> 14/08/18 16:04:22 INFO BlockManagerMaster: Trying to register BlockManager
> 14/08/18 16:04:22 INFO BlockManagerMasterActor: Registering block manager 
> 192.168.115.108:57077 with 265.1 MB RAM
> 14/08/18 16:04:22 INFO BlockManagerMaster: Registered BlockManager
> 14/08/18 16:04:22 INFO HttpFileServer: HTTP File server directory is 
> /var/folders/cs/651p8b5x0pb4ytl7zsv2fb7rgp/T/spark-1c5079ae-eb09-457d-bfc3-a7724fb15768
> 14/08/18 16:04:22 INFO HttpServer: Starting HTTP Server
> 14/08/18 16:04:23 INFO Utils: Successfully started service 'HTTP file server' 
> on port 57078.
> 14/08/18 16:04:23 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 14/08/18 16:04:23 INFO SparkUI: Started SparkUI at http://192.168.115.108:4040
> 14/08/18 16:04:23 INFO SparkContext: Added JAR 
> file:/.../path/to/target/scala-2.10/Job-1.0.0.jar at 
> http://192.168.115.108:57078/jars/Job-1.0.0.jar with timestamp 1408403063777
> 14/08/18 16:04:23 INFO AppClient$ClientActor: Connecting to master 
> spark://ec2-174-129-92-3.compute-1.amazonaws.com:7077...
> 14/08/18 16:04:23 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
> processing s3n://input-bucket/logs/year=2014/month=6/day=21/*
> 14/08/18 16:04:24 INFO MemoryStore: ensureFreeSpace(34006) called with 
> curMem=0, maxMem=278019440
> 14/08/18 16:04:24 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 33.2 KB, free 265.1 MB)
> 14/08/18 16:04:24 INFO MemoryStore: ensureFreeSpace(56) called with 
> curMem=34006, maxMem=278019440
> 14/08/18 16:04:24 INFO MemoryStore: Block broadcast_0_meta stored as values 
> in memory (estimated size 56.0 B, free 265.1 MB)
> 14/08/18 16:04:24 INFO BlockManagerInfo: Added broadcast_0_meta in memory on 
> 192.168.115.108:57077 (size: 56.0 B, free: 265.1 MB)
> 14/08/18 16:0

[jira] [Commented] (SPARK-5865) Add doc warnings for methods that return local data structures

2016-01-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090599#comment-15090599
 ] 

Sean Owen commented on SPARK-5865:
--

OK by me; seems like an old but valid starter JIRA

> Add doc warnings for methods that return local data structures
> --
>
> Key: SPARK-5865
> URL: https://issues.apache.org/jira/browse/SPARK-5865
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: starter
>
> We should include a note in the doc string for any method that collects an 
> RDD to the driver so that users have some hint of why their call might be 
> OOMing.
> {{RDD.take()}}
> {{RDD.collect()}}
> * 
> [Scala|https://github.com/apache/spark/blob/d8adefefcc2a4af32295440ed1d4917a6968f017/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L803-L806]
> * 
> [Python|https://github.com/apache/spark/blob/d8adefefcc2a4af32295440ed1d4917a6968f017/python/pyspark/rdd.py#L680-L683]
> {{DataFrame.head()}}
> {{DataFrame.toPandas()}}
> * 
> [Python|https://github.com/apache/spark/blob/c76da36c2163276b5c34e59fbb139eeb34ed0faa/python/pyspark/sql/dataframe.py#L637-L645]
> {{Column.toPandas()}}
> * 
> [Python|https://github.com/apache/spark/blob/c76da36c2163276b5c34e59fbb139eeb34ed0faa/python/pyspark/sql/dataframe.py#L965-L973]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5455) Add MultipleTransformer abstract class

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5455.
--
Resolution: Won't Fix

> Add MultipleTransformer abstract class
> --
>
> Key: SPARK-5455
> URL: https://issues.apache.org/jira/browse/SPARK-5455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> There's an example of UnaryTransformer abstract class. Need to make public 
> MultipleTransformer class that would accepts multiple columns as input and 
> produce a single output column (e.g. from [col1,col2,col3,...] => 
> Vector(col1,col2, col3,..) or mean([col1,col2,col3,...])).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4863) Suspicious exception handlers

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4863.
--
Resolution: Not A Problem

No follow up

> Suspicious exception handlers
> -
>
> Key: SPARK-4863
> URL: https://issues.apache.org/jira/browse/SPARK-4863
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.1
>Reporter: Ding Yuan
>Priority: Minor
>
> Following up with the discussion in 
> https://issues.apache.org/jira/browse/SPARK-1148, I am creating a new JIRA to 
> report the suspicious exception handlers detected by our tool aspirator on 
> spark-1.1.1. 
> {noformat}
> ==
> WARNING: TODO;  in handler.
>   Line: 129, File: "org/apache/thrift/transport/TNonblockingServerSocket.java"
> 122:  public void registerSelector(Selector selector) {
> 123:try {
> 124:  // Register the server socket channel, indicating an interest in
> 125:  // accepting new connections
> 126:  serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT);
> 127:} catch (ClosedChannelException e) {
> 128:  // this shouldn't happen, ideally...
> 129:  // TODO: decide what to do with this.
> 130:}
> 131:  }
> ==
> ==
> WARNING: TODO;  in handler.
>   Line: 1583, File: "org/apache/spark/SparkContext.scala"
> 1578: val scheduler = try {
> 1579:   val clazz = 
> Class.forName("org.apache.spark.scheduler.cluster.YarnClusterScheduler")
> 1580:   val cons = clazz.getConstructor(classOf[SparkContext])
> 1581:   cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
> 1582: } catch {
> 1583:   // TODO: Enumerate the exact reasons why it can fail
> 1584:   // But irrespective of it, it means we cannot proceed !
> 1585:   case e: Exception => {
> 1586: throw new SparkException("YARN mode not available ?", e)
> 1587:   }
> ==
> ==
> WARNING 1: empty handler for exception: java.lang.Exception
> THERE IS NO LOG MESSAGE!!!
>   Line: 75, File: "org/apache/spark/repl/ExecutorClassLoader.scala"
> try {
>   val pathInDirectory = name.replace('.', '/') + ".class"
>   val inputStream = {
> if (fileSystem != null) {
>   fileSystem.open(new Path(directory, pathInDirectory))
> } else {
>   if (SparkEnv.get.securityManager.isAuthenticationEnabled()) {
> val uri = new URI(classUri + "/" + urlEncode(pathInDirectory))
> val newuri = Utils.constructURIForAuthentication(uri, 
> SparkEnv.get.securityManager)
> newuri.toURL().openStream()
>   } else {
> new URL(classUri + "/" + urlEncode(pathInDirectory)).openStream()
>   }
> }
>   }
>   val bytes = readAndTransformClass(name, inputStream)
>   inputStream.close()
>   Some(defineClass(name, bytes, 0, bytes.length))
> } catch {
>   case e: Exception => None
> }
> ==
> ==
> WARNING 1: empty handler for exception: java.io.IOException
> THERE IS NO LOG MESSAGE!!!
>   Line: 275, File: "org/apache/spark/util/Utils.scala"
>   try {
> dir = new File(root, "spark-" + UUID.randomUUID.toString)
> if (dir.exists() || !dir.mkdirs()) {
>   dir = null
> }
>   } catch { case e: IOException => ; }
> ==
> ==
> WARNING 1: empty handler for exception: java.lang.InterruptedException
> THERE IS NO LOG MESSAGE!!!
>   Line: 172, File: "parquet/org/apache/thrift/server/TNonblockingServer.java"
>   protected void joinSelector() {
> // wait until the selector thread exits
> try {
>   selectThread_.join();
> } catch (InterruptedException e) {
>   // for now, just silently ignore. technically this means we'll have 
> less of
>   // a graceful shutdown as a result.
> }
>   }
> ==
> ==
> WARNING 2: empty handler for exception: java.net.SocketException
> There are log messages..
>   Line: 111, File: 
> "parquet/org/apache/thrift/transport/TNonblockingSocket.java"
>   public void setTimeout(int timeout) {
> try {
>   socketChannel_.socket().setSoTimeout(timeout);
> } catch (SocketException sx) {
>   LOGGER.warn("Could not set socket timeout.", sx);
> }
>   }
> ==
> ==
> WARNING 3: empty handler for exception: java.net.SocketException
> There are log messa

[jira] [Resolved] (SPARK-5005) Failed to start spark-shell when using yarn-client mode with the Spark1.2.0

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5005.
--
Resolution: Cannot Reproduce

> Failed to start spark-shell when using  yarn-client mode with the Spark1.2.0
> 
>
> Key: SPARK-5005
> URL: https://issues.apache.org/jira/browse/SPARK-5005
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 1.2.0
> Environment: Spark 1.2.0
> Hadoop 2.2.0
>Reporter: yangping wu
>Priority: Minor
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> I am using Spark 1.2.0, but when I starting spark-shell with yarn-client 
> mode({code}MASTER=yarn-client bin/spark-shell{code}), It Failed and the error 
> message is
> {code}
> Unknown/unsupported param List(--executor-memory, 1024m, --executor-cores, 8, 
> --num-executors, 2)
> Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options] 
> Options:
>   --jar JAR_PATH   Path to your application's JAR file (required)
>   --class CLASS_NAME   Name of your application's main class (required)
>   --args ARGS  Arguments to be passed to your application's main 
> class.
>Mutliple invocations are possible, each will be passed 
> in order.
>   --num-executors NUMNumber of executors to start (Default: 2)
>   --executor-cores NUM   Number of cores for the executors (Default: 1)
>   --executor-memory MEM  Memory per executor (e.g. 1000M, 2G) (Default: 1G)
> {code}
> But when I using Spark 1.1.0,and also using {code}MASTER=yarn-client 
> bin/spark-shell{code} to starting spark-shell,it works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5280) Import RDF graphs into GraphX

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5280.
--
Resolution: Won't Fix

> Import RDF graphs into GraphX
> -
>
> Key: SPARK-5280
> URL: https://issues.apache.org/jira/browse/SPARK-5280
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: lukovnikov
>
> RDF (Resource Description Framework) models knowledge in a graph and is 
> heavily used on the Semantic Web and beyond.
> GraphX should include a way to import RDF data easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4510.
--
Resolution: Won't Fix

> Add k-medoids Partitioning Around Medoids (PAM) algorithm
> -
>
> Key: SPARK-4510
> URL: https://issues.apache.org/jira/browse/SPARK-4510
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: clustering, features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
> because it minimizes a sum of pairwise dissimilarities instead of a sum of 
> squared Euclidean distances. A medoid can be defined as the object of a 
> cluster, whose average dissimilarity to all the objects in the cluster is 
> minimal i.e. it is a most centrally located point in the cluster.
> The most common realisation of k-medoid clustering is the Partitioning Around 
> Medoids (PAM) algorithm and is as follows:
> Initialize: randomly select (without replacement) k of the n data points as 
> the medoids
> Associate each data point to the closest medoid. ("closest" here is defined 
> using any valid distance metric, most commonly Euclidean distance, Manhattan 
> distance or Minkowski distance)
> For each medoid m
> For each non-medoid data point o
> Swap m and o and compute the total cost of the configuration
> Select the configuration with the lowest cost.
> Repeat steps 2 to 4 until there is no change in the medoid.
> The new feature for MLlib will contain 5 new files
> /main/scala/org/apache/spark/mllib/clustering/PAM.scala
> /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
> /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
> /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
> /main/scala/org/apache/spark/examples/mllib/KMedoids.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5940) Graph Loader: refactor + add more formats

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5940.
--
Resolution: Won't Fix

> Graph Loader: refactor + add more formats
> -
>
> Key: SPARK-5940
> URL: https://issues.apache.org/jira/browse/SPARK-5940
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: lukovnikov
>Priority: Minor
>
> Currently, the only graph loader is GraphLoader.edgeListFile. [SPARK-5280] 
> adds a RDF graph loader.
> However, as Takeshi Yamamuro suggested on github [SPARK-5280], 
> https://github.com/apache/spark/pull/4650, it might be interesting to make 
> GraphLoader an interface with several implementations for different formats. 
> And maybe it's good to make a façade graph loader that provides a unified 
> interface to all loaders.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3368) Spark cannot be used with Avro and Parquet

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3368.
--
Resolution: Cannot Reproduce

I think this is long since obsolete

> Spark cannot be used with Avro and Parquet
> --
>
> Key: SPARK-3368
> URL: https://issues.apache.org/jira/browse/SPARK-3368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Graham Dennis
>
> Spark cannot currently (as of 1.0.2) use any Parquet write support classes 
> that are not part of the spark assembly jar (at least when launched using 
> `spark-submit`).  This prevents using Avro with Parquet.
> See https://github.com/GrahamDennis/spark-avro-parquet for a test case to 
> reproduce this issue.
> The problem appears in the master logs as:
> {noformat}
> 14/09/03 17:31:10 ERROR Executor: Exception in task ID 0
> parquet.hadoop.BadConfigurationException: could not instanciate class 
> parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class
>   at 
> parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121)
>   at 
> parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302)
>   at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>   at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>   at org.apache.spark.scheduler.Task.run(Task.scala:51)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:190)
>   at 
> parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115)
>   ... 11 more
> {noformat}
> The root cause of the problem is that the class loader that's used to find 
> the Parquet write support class only searches the spark assembly jar and 
> doesn't also search the application jar.  A solution would be to ensure that 
> the application jar is always available on the executor classpath.  This is 
> the same underlying issue as SPARK-2878, and SPARK-3166



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5978) Spark, Examples have Hadoop1/2 compat issues with Hadoop 2.0.x (e.g. CDH4)

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5978.
--
Resolution: Not A Problem

No longer a problem at this stage given all these older Hadoop versions are 
unsupported

> Spark, Examples have Hadoop1/2 compat issues with Hadoop 2.0.x (e.g. CDH4)
> --
>
> Key: SPARK-5978
> URL: https://issues.apache.org/jira/browse/SPARK-5978
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Examples
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Michael Nazario
>Priority: Critical
>
> This is a regression from Spark 1.1.1.
> The Spark Examples includes an example for an avro converter for PySpark. 
> When I was trying to debug a problem, I discovered that even though you can 
> build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 
> somewhere else in the examples code.
> An easy fix would be to separate the examples into hadoop specific versions. 
> Another way would be to fix the hbase dependencies so that they don't rely on 
> hadoop 1 specific code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4565) Add docs about advanced spark application development

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4565.
--
Resolution: Won't Fix

No follow up

> Add docs about advanced spark application development
> -
>
> Key: SPARK-4565
> URL: https://issues.apache.org/jira/browse/SPARK-4565
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Evan Sparks
>Priority: Minor
>
> [~shivaram], [~jegonzal] and I have been working on a brief document based on 
> our experiences writing high performance spark applications - MLlib, GraphX, 
> pipelines, ml-matrix, etc.
> It currently exists here - 
> https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing
> Would it make sense to add these tips and tricks to the Spark Wiki?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3901) Add SocketSink capability for Spark metrics

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3901.
--
Resolution: Won't Fix

> Add SocketSink capability for Spark metrics
> ---
>
> Key: SPARK-3901
> URL: https://issues.apache.org/jira/browse/SPARK-3901
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Sreepathi Prasanna
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Spark depends on Coda hale metrics library to collect metrics. Today we can 
> send metrics to console, csv and jmx. We use chukwa as a monitoring framework 
> to monitor the hadoop services. To extend the the framework to collect spark 
> metrics, we need additional socketsink capability which is not there at the 
> moment in Spark. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5710) Combines two adjacent `Cast` expressions into one

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5710.
--
Resolution: Won't Fix

> Combines two adjacent `Cast` expressions into one
> -
>
> Key: SPARK-5710
> URL: https://issues.apache.org/jira/browse/SPARK-5710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: guowei
>Priority: Minor
>
> A plan after `analyzer` with `typeCoercionRules` may produce many `cast` 
> expressions. we can combine the adjacent ones.
> For example. 
> create table test(a decimal(3,1));
> explain select * from test where a*2-1>1;
> == Physical Plan ==
> Filter (CAST(CAST((CAST(CAST((CAST(a#5, DecimalType()) * 2), 
> DecimalType(21,1)), DecimalType()) - 1), DecimalType(22,1)), DecimalType()) > 
> 1)
>  HiveTableScan [a#5], (MetastoreRelation default, test, None), None



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2930) clarify docs on using webhdfs with spark.yarn.access.namenodes

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2930.
--
Resolution: Won't Fix

I haven't seen any follow up or demand for this

> clarify docs on using webhdfs with spark.yarn.access.namenodes
> --
>
> Key: SPARK-2930
> URL: https://issues.apache.org/jira/browse/SPARK-2930
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
>
> The documentation of spark.yarn.access.namenodes talks about putting 
> namenodes in it and gives example with hdfs://.  
> I can also be used with webhdfs so we should clarify how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5140.
--
Resolution: Won't Fix

> Two RDDs which are scheduled concurrently should be able to wait on parent in 
> all cases
> ---
>
> Key: SPARK-5140
> URL: https://issues.apache.org/jira/browse/SPARK-5140
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Corey J. Nolet
>  Labels: features
>
> Not sure if this would change too much of the internals to be included in the 
> 1.2.1 but it would be very helpful if it could be.
> This ticket is from a discussion between myself and [~ilikerps]. Here's the 
> result of some testing that [~ilikerps] did:
> bq. I did some testing as well, and it turns out the "wait for other guy to 
> finish caching" logic is on a per-task basis, and it only works on tasks that 
> happen to be executing on the same machine. 
> bq. Once a partition is cached, we will schedule tasks that touch that 
> partition on that executor. The problem here, though, is that the cache is in 
> progress, and so the tasks are still scheduled randomly (or with whatever 
> locality the data source has), so tasks which end up on different machines 
> will not see that the cache is already in progress.
> {code}
> Here was my test, by the way:
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent._
> import scala.concurrent.duration._
> val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i 
> }).cache()
> val futures = (0 until 4).map { _ => Future { rdd.count } }
> Await.result(Future.sequence(futures), 120.second)
> {code}
> bq. Note that I run the future 4 times in parallel. I found that the first 
> run has all tasks take 10 seconds. The second has about 50% of its tasks take 
> 10 seconds, and the rest just wait for the first stage to finish. The last 
> two runs have no tasks that take 10 seconds; all wait for the first two 
> stages to finish.
> What we want is the ability to fire off a job and have the DAG figure out 
> that two RDDs depend on the same parent so that when the children are 
> scheduled concurrently, the first one to start will activate the parent and 
> both will wait on the parent. When the parent is done, they will both be able 
> to finish their work concurrently. We are trying to use this pattern by 
> having the parent cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1593) Add status command to Spark Daemons(master/worker)

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1593.
--
Resolution: Won't Fix

> Add status command to Spark Daemons(master/worker)
> --
>
> Key: SPARK-1593
> URL: https://issues.apache.org/jira/browse/SPARK-1593
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 0.9.1
>Reporter: Pradeep Chanumolu
>  Labels: patch
> Attachments: 
> 0001-Adding-Spark-Daemon-master-worker-status-command.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Currently we have only start and stop commands for spark
> daemons(master/worker). So a status command can be added to spark-daemon.sh 
> and spark-daemons.sh which tells if the master/worker is alive or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6162) Handle missing values in GBM

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6162.
--
Resolution: Won't Fix

> Handle missing values in GBM
> 
>
> Key: SPARK-6162
> URL: https://issues.apache.org/jira/browse/SPARK-6162
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.1
>Reporter: Devesh Parekh
>
> We build a lot of predictive models over data combined from multiple sources, 
> where some entries may not have all sources of data and so some values are 
> missing in each feature vector. Another place this might come up is if you 
> have features from slightly heterogeneous items (or items composed of 
> heterogeneous subcomponents) that share many features in common but may have 
> extra features for different types, and you don't want to manually train 
> models for every different type.
> R's GBM library, which is what we are currently using, deals with this type 
> of data nicely by making "missing" nodes in the decision tree (a surrogate 
> split) for features that can have missing values. We'd like to do the same 
> with MLLib, but LabeledPoint would need to support missing values, and 
> GradientBoostedTrees would need to be modified to deal with them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1881) Executor caching

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1881.
--
Resolution: Not A Problem

> Executor caching
> 
>
> Key: SPARK-1881
> URL: https://issues.apache.org/jira/browse/SPARK-1881
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
> Environment: centos 6.5, mesos 0.18.1
>Reporter: nigel
>Priority: Minor
>
> The problem is that the executor is copied for each run. We have a cluster 
> where the disks are of moderate size and each executor is nearly 170MB. This 
> executor is slow to copy and multiple runs take up a significant amount of 
> space.
> The improvement would be to make it smaller.
> Currently the examples are included in there, which are not needed for 
> execution. It is easy to take them out, but it might be better to not include 
> them in the default build.
> Another improvement might be to cache the executor jar. The script below will 
> make a 'sparklite' executor which only downloads the jar file once (until the 
> tmp dir is wiped). The scripts (small) are downloaded each time as before.
> This example would need more work, the source and dest are currently 
> hard-coded and it might be a good idea to check file dates and or checksums 
> in case someone was uploading jars with the same version.
> This might be a bit redundant, depending on what happens with other work on 
> executor caching.
> Comments welcome.
> --
> mkdir sparklite
> echo '58c58
> <   if [ -f "$FWDIR/RELEASE" ]; then
> ---
> >   if [ -f "$FWDIR/RELEASE" ] && [ -f 
> > "$FWDIR"/lib/spark-assembly*hadoop*.jar ]; then
> 60c60
> <   else
> ---
> >   elif [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*.jar ]; then
> 61a62,68
> >   else
> > #Try the local one. If not there, download from hdfs
> > if [ ! -f /tmp/sparklite/spark-assembly*hadoop*.jar ]; then
> > mkdir /tmp/sparklite 2>/dev/null
> > hdfs dfs -get /spark/spark-assembly*-hadoop*.jar /tmp/sparklite/
> > fi
> > ASSEMBLY_JAR=$(ls /tmp/sparklite/spark-assembly*hadoop*.jar 2>/dev/null)
> 64a72
> > ' > cc.patch
> tar -C sparklite -xf spark-1.0.0.tgz 
> cd sparklite
> hdfs dfs -put ./spark-1.0.0/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar 
> /spark/
> rm -f spark-1.0.0/lib/*assembly*
> rm -f spark-1.0.0/lib/*example*
> rm -f spark-1.0.0/bin/*.cmd
> rm -rf spark-1.0.0/ec2
> rm -rf spark-1.0.0/lib
> rm -rf spark-1.0.0/conf
> rm -rf spark-1.0.0/examples
> patch spark-1.0.0/bin/compute-classpath.sh < cc.patch
> rm -f spark-1.0.0.tgz
> tar zcf spark-1.0.0.tgz spark-1.0.0
> hdfs dfs -rm /spark/spark-1.0.0.tgz
> hdfs dfs -put ./spark-1.0.0.tgz /spark/
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5916) $SPARK_HOME/bin/beeline conflicts with $HIVE_HOME/bin/beeline

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5916.
--
Resolution: Won't Fix

> $SPARK_HOME/bin/beeline conflicts with $HIVE_HOME/bin/beeline
> -
>
> Key: SPARK-5916
> URL: https://issues.apache.org/jira/browse/SPARK-5916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Carl Steinbach
>Priority: Minor
>
> Hive provides a JDBC CLI named "beeline". Spark currently depends on beeline, 
> but provides its own "beeline" wrapper script for launching it. This results 
> in a conflict when both $HIVE_HOME/bin and $SPARK_HOME/bin appear on a user's 
> PATH.
> In order to eliminate the potential for conflict I propose changing the name 
> of Spark's beeline wrapper script to sparkline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6215) Shorten apply and update funcs in GenerateProjection

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6215.
--
Resolution: Won't Fix

> Shorten apply and update funcs in GenerateProjection
> 
>
> Key: SPARK-6215
> URL: https://issues.apache.org/jira/browse/SPARK-6215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Some codes in GenerateProjection look redundant and can be shortened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4496) smallint (16 bit value) is being send as a 32 bit value in the thrift interface.

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4496.
--
Resolution: Invalid

> smallint (16 bit value)  is being send as a  32 bit  value in the thrift 
> interface.
> ---
>
> Key: SPARK-4496
> URL: https://issues.apache.org/jira/browse/SPARK-4496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Chip Sands
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6221) SparkSQL should support auto merging output files

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6221.
--
Resolution: Won't Fix

> SparkSQL should support auto merging output files
> -
>
> Key: SPARK-6221
> URL: https://issues.apache.org/jira/browse/SPARK-6221
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yi Tian
>
> Hive has a feature that could automatically merge small files in HQL's output 
> path. 
> This feature is quite useful for some cases that people use {{insert into}} 
> to  handle minute data from the input path to a daily table.
> In that case, if the SQL includes {{group by}} or {{join}} operation, we 
> always set the {{reduce number}} at least 200 to avoid the possible OOM in 
> reduce side.
> That will cause this SQL output at least 200 files at the end of the 
> execution. So the daily table will finally contains more than 5 files. 
> If we could provide the same feature in SparkSQL, it will extremely reduce 
> hdfs operations and spark tasks when we run other sql on this table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6277.
--
Resolution: Won't Fix

> Allow Hadoop configurations and env variables to be referenced in 
> spark-defaults.conf
> -
>
> Key: SPARK-6277
> URL: https://issues.apache.org/jira/browse/SPARK-6277
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Jianshi Huang
>
> I need to set spark.local.dir to use user local home instead of /tmp, but 
> currently spark-defaults.conf can only allow constant values.
> What I want to do is to write:
> bq. spark.local.dir /home/${user.name}/spark/tmp
> or
> bq. spark.local.dir /home/${USER}/spark/tmp
> Otherwise I would have to hack bin/spark-class and pass the option through 
> -Dspark.local.dir
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5260.
--
Resolution: Won't Fix

> Expose JsonRDD.allKeysWithValueTypes() in a utility class 
> --
>
> Key: SPARK-5260
> URL: https://issues.apache.org/jira/browse/SPARK-5260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Corey J. Nolet
>Assignee: Corey J. Nolet
>
> I have found this method extremely useful when implementing my own strategy 
> for inferring a schema from parsed json. For now, I've actually copied the 
> method right out of the JsonRDD class into my own project but I think it 
> would be immensely useful to keep the code in Spark and expose it publicly 
> somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4566) Multiple --py-files command line options to spark-submit replace instead of adding to previous options

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4566.
--
Resolution: Won't Fix

No follow up

> Multiple --py-files command line options to spark-submit replace instead of 
> adding to previous options
> --
>
> Key: SPARK-4566
> URL: https://issues.apache.org/jira/browse/SPARK-4566
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Phil Roth
>Priority: Minor
>
> If multiple --py-files are specified to spark-submit, previous lists of files 
> are replaced instead of added to. This is certainly a minor issue, but it 
> cost me a lot of debugging time.
> If people want the current behavior to stay the same, I would suggest 
> updating the help messages to highlight that the suggested usage is one 
> option with a comma separated list of files.
> If people want this behavior updated, I'd love to submit a pull request in 
> the next day or two. I think it would be a perfect small task to get me 
> started as a contributor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5273) Improve documentation examples for LinearRegression

2016-01-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5273:
---

Assignee: Apache Spark

> Improve documentation examples for LinearRegression 
> 
>
> Key: SPARK-5273
> URL: https://issues.apache.org/jira/browse/SPARK-5273
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Dev Lakhani
>Assignee: Apache Spark
>Priority: Minor
>
> In the document:
> https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html
> Under
> Linear least squares, Lasso, and ridge regression
> The suggested method to use LinearRegressionWithSGD.train()
> // Building the model
> val numIterations = 100
> val model = LinearRegressionWithSGD.train(parsedData, numIterations)
> is not ideal even for simple examples such as y=x. This should be replaced 
> with more real world parameters with step size:
> val lr = new LinearRegressionWithSGD()
> lr.optimizer.setStepSize(0.0001)
> lr.optimizer.setNumIterations(100)
> or
> LinearRegressionWithSGD.train(input,100,0.0001)
> To create a reasonable MSE. It took me a while using the dev forum to learn 
> that the step size should be really small. Might help save someone the same 
> effort when learning mllib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5273) Improve documentation examples for LinearRegression

2016-01-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090612#comment-15090612
 ] 

Apache Spark commented on SPARK-5273:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10675

> Improve documentation examples for LinearRegression 
> 
>
> Key: SPARK-5273
> URL: https://issues.apache.org/jira/browse/SPARK-5273
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Dev Lakhani
>Priority: Minor
>
> In the document:
> https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html
> Under
> Linear least squares, Lasso, and ridge regression
> The suggested method to use LinearRegressionWithSGD.train()
> // Building the model
> val numIterations = 100
> val model = LinearRegressionWithSGD.train(parsedData, numIterations)
> is not ideal even for simple examples such as y=x. This should be replaced 
> with more real world parameters with step size:
> val lr = new LinearRegressionWithSGD()
> lr.optimizer.setStepSize(0.0001)
> lr.optimizer.setNumIterations(100)
> or
> LinearRegressionWithSGD.train(input,100,0.0001)
> To create a reasonable MSE. It took me a while using the dev forum to learn 
> that the step size should be really small. Might help save someone the same 
> effort when learning mllib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5273) Improve documentation examples for LinearRegression

2016-01-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5273:
---

Assignee: (was: Apache Spark)

> Improve documentation examples for LinearRegression 
> 
>
> Key: SPARK-5273
> URL: https://issues.apache.org/jira/browse/SPARK-5273
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Dev Lakhani
>Priority: Minor
>
> In the document:
> https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html
> Under
> Linear least squares, Lasso, and ridge regression
> The suggested method to use LinearRegressionWithSGD.train()
> // Building the model
> val numIterations = 100
> val model = LinearRegressionWithSGD.train(parsedData, numIterations)
> is not ideal even for simple examples such as y=x. This should be replaced 
> with more real world parameters with step size:
> val lr = new LinearRegressionWithSGD()
> lr.optimizer.setStepSize(0.0001)
> lr.optimizer.setNumIterations(100)
> or
> LinearRegressionWithSGD.train(input,100,0.0001)
> To create a reasonable MSE. It took me a while using the dev forum to learn 
> that the step size should be really small. Might help save someone the same 
> effort when learning mllib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2385) Missing guide for running JDBC server on YARN

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2385.
--
Resolution: Won't Fix

> Missing guide for running JDBC server on YARN
> -
>
> Key: SPARK-2385
> URL: https://issues.apache.org/jira/browse/SPARK-2385
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.0.0
>Reporter: Yi Tian
>Priority: Minor
>
> There are no document for "running JDBC server on YARN"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3920) Add option to support aggregation using treeAggregate in decision tree

2016-01-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3920.
--
Resolution: Won't Fix

> Add option to support aggregation using treeAggregate in decision tree
> --
>
> Key: SPARK-3920
> URL: https://issues.apache.org/jira/browse/SPARK-3920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Qiping Li
>
> In [SPARK-3366|https://issues.apache.org/jira/browse/SPARK-3366], we used 
> distribute aggregation to aggregate node stats, which can save computation 
> and communication time when the shuffle size is very large. But experiments 
> have shown that if shuffle size is not large enough(e.g, shallow trees), this 
> will cause some performance loss(greater than 20% in some cases). We should 
> support both options for aggregation so that user can choose a proper one 
> based on their needs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12737) Decrease the redundant activeIds sent to remote mirrors in "aggregateMessagesWithActiveSet"

2016-01-09 Thread qbwu (JIRA)

qbwu created SPARK-12737:


 Summary: Decrease the redundant activeIds sent to remote mirrors 
in "aggregateMessagesWithActiveSet"
 Key: SPARK-12737
 URL: https://issues.apache.org/jira/browse/SPARK-12737
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.5.2
Reporter: qbwu


Hi, I found that it is not necessary to send the activeIds to all the mirrors 
of some master in the activeSetOpt which is passed to 
aggregateMessagesWithActiveSet. Because through the passed EdgeDirection, we 
can infer what kind of mirrors (classified by their position) will be checked 
isActive later, we can send the activeIds to the mirrors only at some position. 
In some cases even no activeId needs to be sent.
I have implemented it, and did some tests using PageRank and 
ConnectedComponent, the shuffle size and the running time are decreased. But 
the number of runing turn is not changed.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12737) Decrease the redundant activeIds sent to remote mirrors in "aggregateMessagesWithActiveSet"

2016-01-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12737:


Assignee: (was: Apache Spark)

> Decrease the redundant activeIds sent to remote mirrors in 
> "aggregateMessagesWithActiveSet"
> ---
>
> Key: SPARK-12737
> URL: https://issues.apache.org/jira/browse/SPARK-12737
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.5.2
>Reporter: qbwu
>  Labels: newbie
>
> Hi, I found that it is not necessary to send the activeIds to all the mirrors 
> of some master in the activeSetOpt which is passed to 
> aggregateMessagesWithActiveSet. Because through the passed EdgeDirection, we 
> can infer what kind of mirrors (classified by their position) will be checked 
> isActive later, we can send the activeIds to the mirrors only at some 
> position. In some cases even no activeId needs to be sent.
> I have implemented it, and did some tests using PageRank and 
> ConnectedComponent, the shuffle size and the running time are decreased. But 
> the number of runing turn is not changed.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12737) Decrease the redundant activeIds sent to remote mirrors in "aggregateMessagesWithActiveSet"

2016-01-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090627#comment-15090627
 ] 

Apache Spark commented on SPARK-12737:
--

User 'qbwu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10676

> Decrease the redundant activeIds sent to remote mirrors in 
> "aggregateMessagesWithActiveSet"
> ---
>
> Key: SPARK-12737
> URL: https://issues.apache.org/jira/browse/SPARK-12737
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.5.2
>Reporter: qbwu
>  Labels: newbie
>
> Hi, I found that it is not necessary to send the activeIds to all the mirrors 
> of some master in the activeSetOpt which is passed to 
> aggregateMessagesWithActiveSet. Because through the passed EdgeDirection, we 
> can infer what kind of mirrors (classified by their position) will be checked 
> isActive later, we can send the activeIds to the mirrors only at some 
> position. In some cases even no activeId needs to be sent.
> I have implemented it, and did some tests using PageRank and 
> ConnectedComponent, the shuffle size and the running time are decreased. But 
> the number of runing turn is not changed.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12737) Decrease the redundant activeIds sent to remote mirrors in "aggregateMessagesWithActiveSet"

2016-01-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12737:


Assignee: Apache Spark

> Decrease the redundant activeIds sent to remote mirrors in 
> "aggregateMessagesWithActiveSet"
> ---
>
> Key: SPARK-12737
> URL: https://issues.apache.org/jira/browse/SPARK-12737
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.5.2
>Reporter: qbwu
>Assignee: Apache Spark
>  Labels: newbie
>
> Hi, I found that it is not necessary to send the activeIds to all the mirrors 
> of some master in the activeSetOpt which is passed to 
> aggregateMessagesWithActiveSet. Because through the passed EdgeDirection, we 
> can infer what kind of mirrors (classified by their position) will be checked 
> isActive later, we can send the activeIds to the mirrors only at some 
> position. In some cases even no activeId needs to be sent.
> I have implemented it, and did some tests using PageRank and 
> ConnectedComponent, the shuffle size and the running time are decreased. But 
> the number of runing turn is not changed.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-09 Thread Mark Grover (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090717#comment-15090717
 ] 

Mark Grover commented on SPARK-12177:
-

#1 Sounds great, thanks!
#2 Yeah, that's the only way I can think of for now but let me ponder a bit 
more. 

Thanks! Looking forward to it.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12738) GROUPING__ID is wrong

2016-01-09 Thread Davies Liu (JIRA)

Davies Liu created SPARK-12738:
--

 Summary: GROUPING__ID is wrong
 Key: SPARK-12738
 URL: https://issues.apache.org/jira/browse/SPARK-12738
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0, 1.5.2, 1.4.1, 1.3.1
Reporter: Davies Liu
Priority: Critical


For group set, GROUPING__ID should be 1 if the column is aggregated, or 0.

The current implementation is the different, 0 for aggregated, 1 for 
not-aggregated.

The implementation in Hive is also wrong, it does not match that in the docs: 
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12738) GROUPING__ID is wrong

2016-01-09 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12738:
---
Description: 
For group set, GROUPING__ID should be 1 if the column is aggregated, or 0.

The current implementation is the different, 0 for aggregated, 1 for 
not-aggregated.

The implementation in Hive is also wrong, it does not match that in the docs: 
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup

https://issues.apache.org/jira/browse/HIVE-12833

  was:
For group set, GROUPING__ID should be 1 if the column is aggregated, or 0.

The current implementation is the different, 0 for aggregated, 1 for 
not-aggregated.

The implementation in Hive is also wrong, it does not match that in the docs: 
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup


> GROUPING__ID is wrong
> -
>
> Key: SPARK-12738
> URL: https://issues.apache.org/jira/browse/SPARK-12738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0
>Reporter: Davies Liu
>Priority: Critical
>
> For group set, GROUPING__ID should be 1 if the column is aggregated, or 0.
> The current implementation is the different, 0 for aggregated, 1 for 
> not-aggregated.
> The implementation in Hive is also wrong, it does not match that in the docs: 
> https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup
> https://issues.apache.org/jira/browse/HIVE-12833



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12739) Details of batch in Streaming tab uses two Duration columns

2016-01-09 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-12739:
---

 Summary: Details of batch in Streaming tab uses two Duration 
columns
 Key: SPARK-12739
 URL: https://issues.apache.org/jira/browse/SPARK-12739
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 2.0.0
Reporter: Jacek Laskowski
Priority: Minor


"Details of batch" screen in Streaming tab in web UI uses two Duration columns. 
I think one should be "Processing Time" while the other "Job Duration".

See the attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12739) Details of batch in Streaming tab uses two Duration columns

2016-01-09 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-12739:

Attachment: SPARK-12739.png

> Details of batch in Streaming tab uses two Duration columns
> ---
>
> Key: SPARK-12739
> URL: https://issues.apache.org/jira/browse/SPARK-12739
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: SPARK-12739.png
>
>
> "Details of batch" screen in Streaming tab in web UI uses two Duration 
> columns. I think one should be "Processing Time" while the other "Job 
> Duration".
> See the attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12706) support grouping/grouping_id function together group set

2016-01-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12706:


Assignee: Apache Spark  (was: Davies Liu)

> support grouping/grouping_id function together group set
> 
>
> Key: SPARK-12706
> URL: https://issues.apache.org/jira/browse/SPARK-12706
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup#EnhancedAggregation,Cube,GroupingandRollup-Grouping__IDfunction
> http://etutorials.org/SQL/Mastering+Oracle+SQL/Chapter+13.+Advanced+Group+Operations/13.3+The+GROUPING_ID+and+GROUP_ID+Functions/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12706) support grouping/grouping_id function together group set

2016-01-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090804#comment-15090804
 ] 

Apache Spark commented on SPARK-12706:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10677

> support grouping/grouping_id function together group set
> 
>
> Key: SPARK-12706
> URL: https://issues.apache.org/jira/browse/SPARK-12706
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup#EnhancedAggregation,Cube,GroupingandRollup-Grouping__IDfunction
> http://etutorials.org/SQL/Mastering+Oracle+SQL/Chapter+13.+Advanced+Group+Operations/13.3+The+GROUPING_ID+and+GROUP_ID+Functions/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12706) support grouping/grouping_id function together group set

2016-01-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12706:


Assignee: Davies Liu  (was: Apache Spark)

> support grouping/grouping_id function together group set
> 
>
> Key: SPARK-12706
> URL: https://issues.apache.org/jira/browse/SPARK-12706
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup#EnhancedAggregation,Cube,GroupingandRollup-Grouping__IDfunction
> http://etutorials.org/SQL/Mastering+Oracle+SQL/Chapter+13.+Advanced+Group+Operations/13.3+The+GROUPING_ID+and+GROUP_ID+Functions/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12740) grouping()/grouping_id() should work with having and order by

2016-01-09 Thread Davies Liu (JIRA)

Davies Liu created SPARK-12740:
--

 Summary: grouping()/grouping_id() should work with having and 
order by
 Key: SPARK-12740
 URL: https://issues.apache.org/jira/browse/SPARK-12740
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


The following query should work


{code}
select a, b, sum(c) from t group by cube(a, b) having grouping(a) = 0 order by 
grouping_id(a, b)
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-01-09 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090860#comment-15090860
 ] 

Xiao Li commented on SPARK-12705:
-

Will submit a PR soon. Thanks!

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> The following query can't be resolved:
> ```
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-01-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12705:


Assignee: (was: Apache Spark)

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> The following query can't be resolved:
> ```
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-01-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090877#comment-15090877
 ] 

Apache Spark commented on SPARK-12705:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10678

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> The following query can't be resolved:
> ```
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-01-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12705:


Assignee: Apache Spark

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> The following query can't be resolved:
> ```
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-01-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12705:

Description: 
The following query can't be resolved:

{code}
scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
order by b").explain()
org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input columns: 
[_c0]; line 1 pos 63
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
{code}

  was:
The following query can't be resolved:

```
scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
order by b").explain()
org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input columns: 
[_c0]; line 1 pos 63
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.Trav

[jira] [Resolved] (SPARK-12735) Move spark-ec2 scripts to AMPLab

2016-01-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12735.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Move spark-ec2 scripts to AMPLab
> 
>
> Key: SPARK-12735
> URL: https://issues.apache.org/jira/browse/SPARK-12735
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> It would be easier to fix bugs and maintain the ec2 script separately from 
> Spark releases.
> For more information, see https://issues.apache.org/jira/browse/SPARK-9562



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Liang updated SPARK-12691:

Description: 
Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
the sample code to reproduce this issue.

val dfs = for (i<-0 to 100) yield {
  val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
  df
}

var i = 1
val s1 = System.currentTimeMillis()
dfs.reduce{(a,b)=>{
  val t1 = System.currentTimeMillis()

  val dd = a unionAll b

  val t2 = System.currentTimeMillis()
  println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
  i = i + 1
  dd
  }
}
val s2 = System.currentTimeMillis()
println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")

And it printed as follows. And as you can see, it looks like each unionAll 
seems to redo all the previous unionAll and therefore took self time plus all 
previous time, which, not precisely speaking, makes each unionAll look like a 
"Fibonacci" action.

BTW, this behaviour doesn't happen if I directly union all the RDDs in 
Dataframes.

- output start 
Round 1 unionAll took 1 ms
Round 2 unionAll took 1 ms
Round 3 unionAll took 1 ms
Round 4 unionAll took 1 ms
Round 5 unionAll took 1 ms
Round 6 unionAll took 1 ms
Round 7 unionAll took 1 ms
Round 8 unionAll took 2 ms
Round 9 unionAll took 2 ms
Round 10 unionAll took 2 ms
Round 11 unionAll took 3 ms
Round 12 unionAll took 3 ms
Round 13 unionAll took 3 ms
Round 14 unionAll took 3 ms
Round 15 unionAll took 3 ms
Round 16 unionAll took 4 ms
Round 17 unionAll took 4 ms
Round 18 unionAll took 4 ms
Round 19 unionAll took 4 ms
Round 20 unionAll took 4 ms
Round 21 unionAll took 5 ms
Round 22 unionAll took 5 ms
Round 23 unionAll took 5 ms
Round 24 unionAll took 5 ms
Round 25 unionAll took 5 ms
Round 26 unionAll took 6 ms
Round 27 unionAll took 6 ms
Round 28 unionAll took 6 ms
Round 29 unionAll took 6 ms
Round 30 unionAll took 6 ms
Round 31 unionAll took 6 ms
Round 32 unionAll took 7 ms
Round 33 unionAll took 7 ms
Round 34 unionAll took 7 ms
Round 35 unionAll took 7 ms
Round 36 unionAll took 7 ms
Round 37 unionAll took 8 ms
Round 38 unionAll took 8 ms
Round 39 unionAll took 8 ms
Round 40 unionAll took 8 ms
Round 41 unionAll took 9 ms
Round 42 unionAll took 9 ms
Round 43 unionAll took 9 ms
Round 44 unionAll took 9 ms
Round 45 unionAll took 9 ms
Round 46 unionAll took 9 ms
Round 47 unionAll took 9 ms
Round 48 unionAll took 9 ms
Round 49 unionAll took 10 ms
Round 50 unionAll took 10 ms
Round 51 unionAll took 10 ms
Round 52 unionAll took 10 ms
Round 53 unionAll took 11 ms
Round 54 unionAll took 11 ms
Round 55 unionAll took 11 ms
Round 56 unionAll took 12 ms
Round 57 unionAll took 12 ms
Round 58 unionAll took 12 ms
Round 59 unionAll took 12 ms
Round 60 unionAll took 12 ms
Round 61 unionAll took 12 ms
Round 62 unionAll took 13 ms
Round 63 unionAll took 13 ms
Round 64 unionAll took 13 ms
Round 65 unionAll took 13 ms
Round 66 unionAll took 14 ms
Round 67 unionAll took 14 ms
Round 68 unionAll took 14 ms
Round 69 unionAll took 14 ms
Round 70 unionAll took 14 ms
Round 71 unionAll took 14 ms
Round 72 unionAll took 14 ms
Round 73 unionAll took 14 ms
Round 74 unionAll took 15 ms
Round 75 unionAll took 15 ms
Round 76 unionAll took 15 ms
Round 77 unionAll took 15 ms
Round 78 unionAll took 16 ms
Round 79 unionAll took 16 ms
Round 80 unionAll took 16 ms
Round 81 unionAll took 16 ms
Round 82 unionAll took 17 ms
Round 83 unionAll took 17 ms
Round 84 unionAll took 17 ms
Round 85 unionAll took 17 ms
Round 86 unionAll took 17 ms
Round 87 unionAll took 18 ms
Round 88 unionAll took 17 ms
Round 89 unionAll took 18 ms
Round 90 unionAll took 18 ms
Round 91 unionAll took 18 ms
Round 92 unionAll took 18 ms
Round 93 unionAll took 18 ms
Round 94 unionAll took 19 ms
Round 95 unionAll took 19 ms
Round 96 unionAll took 20 ms
Round 97 unionAll took 20 ms
Round 98 unionAll took 20 ms
Round 99 unionAll took 20 ms
Round 100 unionAll took 20 ms
100 unionAll took totally 1337 ms

- output end 




  was:
Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
the sample code to reproduce this issue.

val dfs = for (i<-0 to 100) yield {
  val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
  df
}

var i = 1
val s1 = System.currentTimeMillis()
dfs.reduce{(a,b)=>{
  val t1 = System.currentTimeMillis()
  val dd = a unionAll b
  val t2 = System.currentTimeMillis()
  println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
  i = i + 1
  dd
  }
}
val s2 = System.currentTimeMillis()
println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")

And it printed as follows. And as you can see, it looks like each unionAll 
seems to redo all the previous unionAll and therefore took self time plus all 
previous time, which, not precisely speaking, makes each unionAll look like a 
"Fibonacci" action.

BTW, this behaviour doesn't happen if I directly union all the RDDs in 
Dataframes.

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:22 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do union on RDDs. If you union all the RDDs 
in dataframes in above sample code, you'll find each round of RDD union will 
take a relatively consistent time (NOT growing at all), which is expected.

And I believe it has nothing to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which is not obviously acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took like around only 20+ seconds to complete, compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same things doesn't happen when we do same thing on RDDs. If you union all the 
RDDs in dataframes in above sample code, you'll find each round of RDD union 
will take a relatively consistent time (NOT growing at all), which is expected.

And I believe it has nothing to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which is not obviously acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took like around only 20+ seconds to complete, compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29

[jira] [Commented] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang commented on SPARK-12691:
-

Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same things doesn't happen when we do same thing on RDDs. If you union all the 
RDDs in dataframes in above sample code, you'll find each round of RDD union 
will take a relatively consistent time (NOT growing at all), which is expected.

And I believe it has nothing to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which is not obviously acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took like around only 20+ seconds to complete, compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 unionAll took 6 ms
> Round 31 unionAll took 6 ms
> Round 32 unionAll took 7 ms
> Round 33 unionAll took 7 ms
> Round 34 unionAll took 7 ms
> Round 35 unionAll took 7 ms
> Round 36 unionAll took 7 ms
> Round 37 unionAll took 8 ms
> Round 38 unionAll took 8 ms
> Round 39 unionAll took 8 ms
> Round 40 unionAll took 8 ms
> Round 41 unionAll took 9 ms
> Round 42 unionAll took 9 ms
> Round 43 unionAll took 9 ms
> Round 44 unionAll took 9 ms
> Round 45 unionAll took 9 ms
> Round 46 unionAll took 9 ms
> Round 47 unionAll took 9 ms
> Round 48 unionAll took 9 ms
> Round 49 unionAll took 10 ms
> Round 50 unionAll took 10 ms
> Round 51 unionAll took 10 ms
> Round 52 unionAll took 10 ms
> Round 53 unionAll took 11 ms
> Round 54 unionAll took 11 ms
> Round 55 unionAll took 11 ms
> Round 56 unionAll took 12 ms
> Round 57 unionAll took 12 ms
> Round 58 unionAll took 12 ms
> Round 59 unionAll took 12 ms
> Round 60 unionAll took 12 ms
> Round 61 unionAll took 12 ms
> Round 62 unionAll took 13 ms
> Round 63 unionAll took 13 ms
> Round 64 unionAll took 13 ms
> Round 65 unionAll took 13 ms
> Round 66 unionAll took 14 ms
> Round 67 unionAll took 14 ms
> Round 68 unionAll took 14 ms
> Round 69 un

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:23 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do union on RDDs. If you union all the RDDs 
in dataframes in above sample code, you'll find each round of RDD union will 
take a relatively consistent time (NOT growing at all), which is expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which is not obviously acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took like around only 20+ seconds to complete, compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do union on RDDs. If you union all the RDDs 
in dataframes in above sample code, you'll find each round of RDD union will 
take a relatively consistent time (NOT growing at all), which is expected.

And I believe it has nothing to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which is not obviously acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took like around only 20+ seconds to complete, compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:24 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do multiple unions on RDDs. If you union all 
the RDDs in dataframes in above sample code, you'll find each round of RDD 
union will take a relatively consistent time (NOT growing at all), which is 
expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which is not obviously acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took like around only 20+ seconds to complete, compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do union on RDDs. If you union all the RDDs 
in dataframes in above sample code, you'll find each round of RDD union will 
take a relatively consistent time (NOT growing at all), which is expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which is not obviously acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took like around only 20+ seconds to complete, compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll too

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:25 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do multiple unions on RDDs. If you union all 
the RDDs in dataframes in above sample code, you'll find each round of RDD 
union will take a relatively consistent time (NOT growing at all), which is 
expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which is not obviously acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do multiple unions on RDDs. If you union all 
the RDDs in dataframes in above sample code, you'll find each round of RDD 
union will take a relatively consistent time (NOT growing at all), which is 
expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which is not obviously acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took like around only 20+ seconds to complete, compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionA

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:27 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do multiple unions on RDDs. If you union all 
the RDDs in dataframes in above sample code, you'll find each round of RDD 
union will take a relatively consistent time (NOT growing at all), which is 
expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which obviously is not acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do multiple unions on RDDs. If you union all 
the RDDs in dataframes in above sample code, you'll find each round of RDD 
union will take a relatively consistent time (NOT growing at all), which is 
expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which is not obviously acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll to

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:28 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do multiple unions on RDDs. If you union all 
the RDDs in dataframes in above sample code, you'll find each round of RDD 
union will take a relatively consistent time (NOT growing at all), which is 
expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which obviously is not acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do multiple unions on RDDs. If you union all 
the RDDs in dataframes in above sample code, you'll find each round of RDD 
union will take a relatively consistent time (NOT growing at all), which is 
expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which obviously is not acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when unionAll 202 empty dataframes, this already is a huge improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAl

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:29 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters that much here. 
Actually the same thing doesn't happen when we do multiple unions on RDDs. If 
you union all the RDDs in dataframes in above sample code, you'll find each 
round of RDD union will take a relatively consistent time (NOT growing at all), 
which is expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which obviously is not acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters here. Actually the 
same thing doesn't happen when we do multiple unions on RDDs. If you union all 
the RDDs in dataframes in above sample code, you'll find each round of RDD 
union will take a relatively consistent time (NOT growing at all), which is 
expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which obviously is not acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Ro

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:30 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters that much here. 
Actually the same thing doesn't happen when we do multiple unions on RDDs. If 
you union all the RDDs in dataframes in above sample code, you'll find each 
round of RDD union will take a relatively consistent time (NOT growing at all), 
which is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe) to unionAll, 
and it took around over 20 minutes to complete, which obviously is not 
acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters that much here. 
Actually the same thing doesn't happen when we do multiple unions on RDDs. If 
you union all the RDDs in dataframes in above sample code, you'll find each 
round of RDD union will take a relatively consistent time (NOT growing at all), 
which is expected.

And I don't think it has something to do with the size of dataframe. The code 
attached is a simple sample to reproduce this issue. The time costed may not 
seem to be terrible here.  In our real case, where we we have 202 dataframes 
(which all happen to be empty dataframe) to unionAll, and it took around over 
20 minutes to complete, which obviously is not acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll t

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:34 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters that much here. 
Actually the same thing doesn't happen when we do multiple unions on RDDs. If 
you union all the RDDs in dataframes in above sample code, you'll find each 
round of RDD union will take a relatively constant time (NOT growing at all), 
which is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe) to unionAll, 
and it took around over 20 minutes to complete, which obviously is not 
acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters that much here. 
Actually the same thing doesn't happen when we do multiple unions on RDDs. If 
you union all the RDDs in dataframes in above sample code, you'll find each 
round of RDD union will take a relatively consistent time (NOT growing at all), 
which is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe) to unionAll, 
and it took around over 20 minutes to complete, which obviously is not 
acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 unionAll took 6 ms
> Round 31 unionAll

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:37 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters that much here. 
Actually this behavior doesn't happen when we do the same thing to RDDs. If you 
union all the RDDs in dataframes in above sample code, you'll find each round 
of RDD union takes a relatively constant time (NOT growing at all), which is 
expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe) to unionAll, 
and it took around over 20 minutes to complete, which obviously is not 
acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters that much here. 
Actually this behavior doesn't happen when we do the same thing to RDDs. If you 
union all the RDDs in dataframes in above sample code, you'll find each round 
of RDD union will take a relatively constant time (NOT growing at all), which 
is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe) to unionAll, 
and it took around over 20 minutes to complete, which obviously is not 
acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 unionAll took 6 ms
> Round 31 unionAll took 6 ms

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:37 AM:
--

Hi Bo Meng,

I understand you point, but why size of dataframe matters that much here. 
Actually this behavior doesn't happen when we do the same thing to RDDs. If you 
union all the RDDs in dataframes in above sample code, you'll find each round 
of RDD union will take a relatively constant time (NOT growing at all), which 
is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe) to unionAll, 
and it took around over 20 minutes to complete, which obviously is not 
acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters that much here. 
Actually the same thing doesn't happen when we do multiple unions on RDDs. If 
you union all the RDDs in dataframes in above sample code, you'll find each 
round of RDD union will take a relatively constant time (NOT growing at all), 
which is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe) to unionAll, 
and it took around over 20 minutes to complete, which obviously is not 
acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 unionAll took 6 ms
> Round 31 unionAll took

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:46 AM:
--

Hi Bo Meng,

I understand you point, but I don't think this has anything to do with size of 
dataframe, Or how do you explain this behavior doesn't happen when we do the 
same thing to RDDs. If you union all the RDDs in dataframes in above sample 
code, you'll find each round of RDD union takes a relatively constant time (NOT 
growing at all), which is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe (meaning size is 
zero)) to unionAll, and it took around over 20 minutes to complete, which 
obviously is not acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but why size of dataframe matters that much here. 
Actually this behavior doesn't happen when we do the same thing to RDDs. If you 
union all the RDDs in dataframes in above sample code, you'll find each round 
of RDD union takes a relatively constant time (NOT growing at all), which is 
expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe) to unionAll, 
and it took around over 20 minutes to complete, which obviously is not 
acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 u

[jira] [Comment Edited] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090896#comment-15090896
 ] 

Allen Liang edited comment on SPARK-12691 at 1/10/16 5:47 AM:
--

Hi Bo Meng,

I understand your point, but I don't think this has anything to do with size of 
dataframe. Or how do you explain this behavior doesn't happen when we do the 
same thing to RDDs. If you union all the RDDs in dataframes in above sample 
code, you'll find each round of RDD union takes a relatively constant time (NOT 
growing at all), which is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe (meaning size is 
zero)) to unionAll, and it took around over 20 minutes to complete, which 
obviously is not acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.


was (Author: lliang):
Hi Bo Meng,

I understand you point, but I don't think this has anything to do with size of 
dataframe, Or how do you explain this behavior doesn't happen when we do the 
same thing to RDDs. If you union all the RDDs in dataframes in above sample 
code, you'll find each round of RDD union takes a relatively constant time (NOT 
growing at all), which is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe (meaning size is 
zero)) to unionAll, and it took around over 20 minutes to complete, which 
obviously is not acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll

[jira] [Issue Comment Deleted] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Liang updated SPARK-12691:

Comment: was deleted

(was: Hi Bo Meng,

I understand your point, but I don't think this has something to do with size 
of dataframe. Or how do you explain this behavior doesn't happen when we do the 
same thing to RDDs. If you union all the RDDs in dataframes in above sample 
code, you'll find each round of RDD union takes a relatively constant time (NOT 
growing at all), which is expected.
The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here. However, in our real case, where we we 
have 202 dataframes (which all happen to be empty dataframe (meaning size is 
zero)) to unionAll, and it took around over 20 minutes to complete, which 
obviously is not acceptable.

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement.

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.)

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 unionAll took 6 ms
> Round 31 unionAll took 6 ms
> Round 32 unionAll took 7 ms
> Round 33 unionAll took 7 ms
> Round 34 unionAll took 7 ms
> Round 35 unionAll took 7 ms
> Round 36 unionAll took 7 ms
> Round 37 unionAll took 8 ms
> Round 38 unionAll took 8 ms
> Round 39 unionAll took 8 ms
> Round 40 unionAll took 8 ms
> Round 41 unionAll took 9 ms
> Round 42 unionAll took 9 ms
> Round 43 unionAll took 9 ms
> Round 44 unionAll took 9 ms
> Round 45 unionAll took 9 ms
> Round 46 unionAll took 9 ms
> Round 47 unionAll took 9 ms
> Round 48 unionAll took 9 ms
> Round 49 unionAll took 10 ms
> Round 50 unionAll took 10 ms
> Round 51 unionAll took 10 ms
> Round 52 unionAll took 10 ms
> Round 53 unionAll took 11 ms
> Round 54 unionAll took 11 ms
> Round 55 unionAll took 11 ms
> Round 56 unionAll took 12 ms
> Round 57 unionAll took 12 ms
> Round 58 unionAll took 12 ms
> Round 59 unionAll took 12 ms
> Round 60 unionAll took 12 ms
> Round 61 unionAll took 12 ms
> Round 62 unionAll took 13 ms
> Round 63 unionAll took 13 ms
> Round 64 unionAll took 13 ms
> Round 65 unionAll took 13 ms
> Round 66 unionAll took 14 ms
> Round 67 unionAll took 14 ms
> Round 68 unionAll took 14 ms
> Round 69 unionAll took 14 ms
>

[jira] [Commented] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090906#comment-15090906
 ] 

Allen Liang commented on SPARK-12691:
-

Hi Bo Meng,

I understand your point, but I don't think this has something to do with size 
of dataframe. Or how do you explain this behavior doesn't happen when we do the 
same thing to RDDs. If you union all the RDDs in dataframes in above sample 
code, you'll find each round of RDD union takes a relatively constant time (NOT 
growing at all), which is expected.
The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here. However, in our real case, where we we 
have 202 dataframes (which all happen to be empty dataframe (meaning size is 
zero)) to unionAll, and it took around over 20 minutes to complete, which 
obviously is not acceptable.

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement.

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 unionAll took 6 ms
> Round 31 unionAll took 6 ms
> Round 32 unionAll took 7 ms
> Round 33 unionAll took 7 ms
> Round 34 unionAll took 7 ms
> Round 35 unionAll took 7 ms
> Round 36 unionAll took 7 ms
> Round 37 unionAll took 8 ms
> Round 38 unionAll took 8 ms
> Round 39 unionAll took 8 ms
> Round 40 unionAll took 8 ms
> Round 41 unionAll took 9 ms
> Round 42 unionAll took 9 ms
> Round 43 unionAll took 9 ms
> Round 44 unionAll took 9 ms
> Round 45 unionAll took 9 ms
> Round 46 unionAll took 9 ms
> Round 47 unionAll took 9 ms
> Round 48 unionAll took 9 ms
> Round 49 unionAll took 10 ms
> Round 50 unionAll took 10 ms
> Round 51 unionAll took 10 ms
> Round 52 unionAll took 10 ms
> Round 53 unionAll took 11 ms
> Round 54 unionAll took 11 ms
> Round 55 unionAll took 11 ms
> Round 56 unionAll took 12 ms
> Round 57 unionAll took 12 ms
> Round 58 unionAll took 12 ms
> Round 59 unionAll took 12 ms
> Round 60 unionAll took 12 ms
> Round 61 unionAll took 12 ms
> Round 62 unionAll took 13 ms
> Round 63 unionAll took 13 ms
> Round 64 unionAll took 13 ms
> Round 65 unionAll took 13 ms
> Round 66 unionAll took 14 ms
> Round 67 unionAll took 14 ms
> Round 68 unionAll took 14 ms
> Round

[jira] [Issue Comment Deleted] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Liang updated SPARK-12691:

Comment: was deleted

(was: Hi Bo Meng,

I understand your point, but I don't think this has anything to do with size of 
dataframe. Or how do you explain this behavior doesn't happen when we do the 
same thing to RDDs. If you union all the RDDs in dataframes in above sample 
code, you'll find each round of RDD union takes a relatively constant time (NOT 
growing at all), which is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here.  However, in our real case, where we 
we have 202 dataframes (which all happen to be empty dataframe (meaning size is 
zero)) to unionAll, and it took around over 20 minutes to complete, which 
obviously is not acceptable. 

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement. 

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.)

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 unionAll took 6 ms
> Round 31 unionAll took 6 ms
> Round 32 unionAll took 7 ms
> Round 33 unionAll took 7 ms
> Round 34 unionAll took 7 ms
> Round 35 unionAll took 7 ms
> Round 36 unionAll took 7 ms
> Round 37 unionAll took 8 ms
> Round 38 unionAll took 8 ms
> Round 39 unionAll took 8 ms
> Round 40 unionAll took 8 ms
> Round 41 unionAll took 9 ms
> Round 42 unionAll took 9 ms
> Round 43 unionAll took 9 ms
> Round 44 unionAll took 9 ms
> Round 45 unionAll took 9 ms
> Round 46 unionAll took 9 ms
> Round 47 unionAll took 9 ms
> Round 48 unionAll took 9 ms
> Round 49 unionAll took 10 ms
> Round 50 unionAll took 10 ms
> Round 51 unionAll took 10 ms
> Round 52 unionAll took 10 ms
> Round 53 unionAll took 11 ms
> Round 54 unionAll took 11 ms
> Round 55 unionAll took 11 ms
> Round 56 unionAll took 12 ms
> Round 57 unionAll took 12 ms
> Round 58 unionAll took 12 ms
> Round 59 unionAll took 12 ms
> Round 60 unionAll took 12 ms
> Round 61 unionAll took 12 ms
> Round 62 unionAll took 13 ms
> Round 63 unionAll took 13 ms
> Round 64 unionAll took 13 ms
> Round 65 unionAll took 13 ms
> Round 66 unionAll took 14 ms
> Round 67 unionAll took 14 ms
> Round 68 unionAll took 14 ms
> Round 69 unionAll took 14 ms

[jira] [Commented] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

2016-01-09 Thread Allen Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090908#comment-15090908
 ] 

Allen Liang commented on SPARK-12691:
-

Hi Bo Meng,

I understand your point, but I don't think this has anything to do with size of 
dataframe. Or how do you explain this behavior doesn't happen when we do the 
same thing to RDDs. If you union all the RDDs in dataframes in above sample 
code, you'll find each round of RDD union takes a relatively constant time (NOT 
growing at all), which is expected.

The code attached is a simple sample to reproduce this issue and the time 
costed may not seem to be terrible here. However, in our real case, where we we 
have 202 dataframes (which all happen to be empty dataframe (meaning size is 
zero)) to unionAll, and it took around over 20 minutes to complete, which 
obviously is not acceptable.

To workaround this issue we actually directly unioned all the RDDs in those 202 
dataframes and convert back the final RDD to dataframe in the end. And that 
whole workaround took around only 20+ seconds to complete. Compared to 20+ 
minutes when we unionAll 202 empty dataframes, this already is a huge 
improvement.

I think there has to be something wrong in the multiple dataframe unionAll or 
let's say there has to be something we can improve here.

> Multiple unionAll on Dataframe seems to cause repeated calculations in a 
> "Fibonacci" manner
> ---
>
> Key: SPARK-12691
> URL: https://issues.apache.org/jira/browse/SPARK-12691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
> Environment: Tested in Spark 1.3 and 1.4.
>Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is 
> the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll 
> seems to redo all the previous unionAll and therefore took self time plus all 
> previous time, which, not precisely speaking, makes each unionAll look like a 
> "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in 
> Dataframes.
> - output start 
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 unionAll took 6 ms
> Round 31 unionAll took 6 ms
> Round 32 unionAll took 7 ms
> Round 33 unionAll took 7 ms
> Round 34 unionAll took 7 ms
> Round 35 unionAll took 7 ms
> Round 36 unionAll took 7 ms
> Round 37 unionAll took 8 ms
> Round 38 unionAll took 8 ms
> Round 39 unionAll took 8 ms
> Round 40 unionAll took 8 ms
> Round 41 unionAll took 9 ms
> Round 42 unionAll took 9 ms
> Round 43 unionAll took 9 ms
> Round 44 unionAll took 9 ms
> Round 45 unionAll took 9 ms
> Round 46 unionAll took 9 ms
> Round 47 unionAll took 9 ms
> Round 48 unionAll took 9 ms
> Round 49 unionAll took 10 ms
> Round 50 unionAll took 10 ms
> Round 51 unionAll took 10 ms
> Round 52 unionAll took 10 ms
> Round 53 unionAll took 11 ms
> Round 54 unionAll took 11 ms
> Round 55 unionAll took 11 ms
> Round 56 unionAll took 12 ms
> Round 57 unionAll took 12 ms
> Round 58 unionAll took 12 ms
> Round 59 unionAll took 12 ms
> Round 60 unionAll took 12 ms
> Round 61 unionAll took 12 ms
> Round 62 unionAll took 13 ms
> Round 63 unionAll took 13 ms
> Round 64 unionAll took 13 ms
> Round 65 unionAll took 13 ms
> Round 66 unionAll took 14 ms
> Round 67 unionAll took 14 ms
> Round 68 unionAll took 14 ms
> Round

[jira] [Commented] (SPARK-12612) Add missing Hadoop profiles to dev/run-tests-*.py scripts

2016-01-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090913#comment-15090913
 ] 

Apache Spark commented on SPARK-12612:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10679

> Add missing Hadoop profiles to dev/run-tests-*.py scripts
> -
>
> Key: SPARK-12612
> URL: https://issues.apache.org/jira/browse/SPARK-12612
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> There are a couple of places in the dev/run-tests-*.py scripts which deal 
> with Hadoop profiles, but the set of profiles that they handle does not 
> include all Hadoop profiles defined in our POM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests

2016-01-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090917#comment-15090917
 ] 

Apache Spark commented on SPARK-10359:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10679

> Enumerate Spark's dependencies in a file and diff against it for new pull 
> requests 
> ---
>
> Key: SPARK-10359
> URL: https://issues.apache.org/jira/browse/SPARK-10359
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Sometimes when we have dependency changes it can be pretty unclear what 
> transitive set of things are changing. If we enumerate all of the 
> dependencies and put them in a source file in the repo, we can make it so 
> that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests

2016-01-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090922#comment-15090922
 ] 

Apache Spark commented on SPARK-10359:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10680

> Enumerate Spark's dependencies in a file and diff against it for new pull 
> requests 
> ---
>
> Key: SPARK-10359
> URL: https://issues.apache.org/jira/browse/SPARK-10359
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Sometimes when we have dependency changes it can be pretty unclear what 
> transitive set of things are changing. If we enumerate all of the 
> dependencies and put them in a source file in the repo, we can make it so 
> that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12543) Support subquery in select/where/having

2016-01-09 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12543:
--

Assignee: Davies Liu

> Support subquery in select/where/having
> ---
>
> Key: SPARK-12543
> URL: https://issues.apache.org/jira/browse/SPARK-12543
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

96 matches

Mail list logo