[jira] [Created] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used

2016-10-30 Thread Shuai Lin (JIRA)
Shuai Lin created SPARK-18171:
-

 Summary: Show correct framework address in mesos master web ui 
when the advertised address is used
 Key: SPARK-18171
 URL: https://issues.apache.org/jira/browse/SPARK-18171
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Shuai Lin
Priority: Minor


In INF-4563 we added the support for the driver to advertise a different 
hostname/ip ({{spark.driver.host}} to the executors other than the hostname/ip 
the driver actually binds to ({{spark.driver.bindAddress}}). But in the mesos 
webui's frameworks page, it still shows the driver's binds hostname/ip (though 
the web ui link is correct). We should fix it to make them consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used

2016-10-30 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated SPARK-18171:
--
Description: In [[SPARK-4563]] we added the support for the driver to 
advertise a different hostname/ip ({{spark.driver.host}} to the executors other 
than the hostname/ip the driver actually binds to 
({{spark.driver.bindAddress}}). But in the mesos webui's frameworks page, it 
still shows the driver's binds hostname/ip (though the web ui link is correct). 
We should fix it to make them consistent.  (was: In INF-4563 we added the 
support for the driver to advertise a different hostname/ip 
({{spark.driver.host}} to the executors other than the hostname/ip the driver 
actually binds to ({{spark.driver.bindAddress}}). But in the mesos webui's 
frameworks page, it still shows the driver's binds hostname/ip (though the web 
ui link is correct). We should fix it to make them consistent.)

> Show correct framework address in mesos master web ui when the advertised 
> address is used
> -
>
> Key: SPARK-18171
> URL: https://issues.apache.org/jira/browse/SPARK-18171
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Shuai Lin
>Priority: Minor
>
> In [[SPARK-4563]] we added the support for the driver to advertise a 
> different hostname/ip ({{spark.driver.host}} to the executors other than the 
> hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But 
> in the mesos webui's frameworks page, it still shows the driver's binds 
> hostname/ip (though the web ui link is correct). We should fix it to make 
> them consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used

2016-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15619520#comment-15619520
 ] 

Apache Spark commented on SPARK-18171:
--

User 'lins05' has created a pull request for this issue:
https://github.com/apache/spark/pull/15684

> Show correct framework address in mesos master web ui when the advertised 
> address is used
> -
>
> Key: SPARK-18171
> URL: https://issues.apache.org/jira/browse/SPARK-18171
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Shuai Lin
>Priority: Minor
>
> In [[SPARK-4563]] we added the support for the driver to advertise a 
> different hostname/ip ({{spark.driver.host}} to the executors other than the 
> hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But 
> in the mesos webui's frameworks page, it still shows the driver's binds 
> hostname/ip (though the web ui link is correct). We should fix it to make 
> them consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18171:


Assignee: (was: Apache Spark)

> Show correct framework address in mesos master web ui when the advertised 
> address is used
> -
>
> Key: SPARK-18171
> URL: https://issues.apache.org/jira/browse/SPARK-18171
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Shuai Lin
>Priority: Minor
>
> In [[SPARK-4563]] we added the support for the driver to advertise a 
> different hostname/ip ({{spark.driver.host}} to the executors other than the 
> hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But 
> in the mesos webui's frameworks page, it still shows the driver's binds 
> hostname/ip (though the web ui link is correct). We should fix it to make 
> them consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18171) Show correct framework address in mesos master web ui when the advertised address is used

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18171:


Assignee: Apache Spark

> Show correct framework address in mesos master web ui when the advertised 
> address is used
> -
>
> Key: SPARK-18171
> URL: https://issues.apache.org/jira/browse/SPARK-18171
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Shuai Lin
>Assignee: Apache Spark
>Priority: Minor
>
> In [[SPARK-4563]] we added the support for the driver to advertise a 
> different hostname/ip ({{spark.driver.host}} to the executors other than the 
> hostname/ip the driver actually binds to ({{spark.driver.bindAddress}}). But 
> in the mesos webui's frameworks page, it still shows the driver's binds 
> hostname/ip (though the web ui link is correct). We should fix it to make 
> them consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18162) SparkEnv.get.metricsSystem in spark-shell results in error: missing or invalid dependency detected while loading class file 'MetricsSystem.class'

2016-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18162.
---
Resolution: Not A Problem

> SparkEnv.get.metricsSystem in spark-shell results in error: missing or 
> invalid dependency detected while loading class file 'MetricsSystem.class'
> -
>
> Key: SPARK-18162
> URL: https://issues.apache.org/jira/browse/SPARK-18162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> This is with the build today from master.
> {code}
> $ ./bin/spark-shell --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> Branch master
> Compiled by user jacek on 2016-10-28T04:05:11Z
> Revision ab5f938bc7c3c9b137d63e479fced2b7e9c9d75b
> Url https://github.com/apache/spark.git
> Type --help for more information.
> $ ./bin/spark-shell
> scala> SparkEnv.get.metricsSystem
> error: missing or invalid dependency detected while loading class file 
> 'MetricsSystem.class'.
> Could not access term eclipse in package org,
> because it (or its dependencies) are missing. Check your build definition for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)
> A full rebuild may help if 'MetricsSystem.class' was compiled against an 
> incompatible version of org.
> error: missing or invalid dependency detected while loading class file 
> 'MetricsSystem.class'.
> Could not access term jetty in value org.eclipse,
> because it (or its dependencies) are missing. Check your build definition for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see 
> the problematic classpath.)
> A full rebuild may help if 'MetricsSystem.class' was compiled against an 
> incompatible version of org.eclipse.
> scala> spark.version
> res3: String = 2.1.0-SNAPSHOT
> {code}
> I could not find any information about how to set it up in the [official 
> documentation|http://spark.apache.org/docs/latest/monitoring.html#metrics].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18170) Confusing error message when using rangeBetween without specifying an "orderBy"

2016-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18170:
--
Issue Type: Improvement  (was: Bug)

> Confusing error message when using rangeBetween without specifying an 
> "orderBy"
> ---
>
> Key: SPARK-18170
> URL: https://issues.apache.org/jira/browse/SPARK-18170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Weiluo Ren
>Priority: Minor
>
> {code}
> spark.range(1,3).select(sum('id) over Window.rangeBetween(0,1)).show
> {code}
> throws runtime exception:
> {code}
> Non-Zero range offsets are not supported for windows with multiple order 
> expressions.
> {code}
> which is confusing in this case because we don't have any order expression 
> here.
> How about add a check on
> {code}
> orderSpec.isEmpty
> {code}
> at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L141
> and throw an exception saying "no order expressions is specified"?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2016-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3261.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15450
[https://github.com/apache/spark/pull/15450]

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Priority: Minor
>  Labels: clustering
> Fix For: 2.1.0
>
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18172) AnalysisException in first/last during aggregation

2016-10-30 Thread Emlyn Corrin (JIRA)
Emlyn Corrin created SPARK-18172:


 Summary: AnalysisException in first/last during aggregation
 Key: SPARK-18172
 URL: https://issues.apache.org/jira/browse/SPARK-18172
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1
Reporter: Emlyn Corrin


Since Spark 2.0.1, the following pyspark snippet fails with 
{{AnalysisException: The second argument of First should be a boolean literal}} 
(but it's not restricted to Python, similar code with in Java fails in the same 
way).
It worked in Spark 2.0.0, so I believe it may be related to the fix for 
SPARK-16648.
{code}
from pyspark.sql import functions as F
ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
F.countDistinct(ds._2, ds._3)).show()
{code}
It works if any of the three arguments to {{.agg}} is removed.

The stack trace is:
{code}
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 
ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
 ds._3)).show()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py in 
show(self, n, truncate)
285 +---+-+
286 """
--> 287 print(self._jdf.showString(n, truncate))
288
289 def __repr__(self):

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: first(_2#1L)()
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfu

[jira] [Comment Edited] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-10-30 Thread Emlyn Corrin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613656#comment-15613656
 ] 

Emlyn Corrin edited comment on SPARK-16648 at 10/30/16 10:07 AM:
-

Edit: I've opened a new issue for this at SPARK-18172.

Since Spark 2.0.1, the following pyspark snippet fails (I believe it worked 
under 2.0.0, so this issue seems like the most likely cause of change in 
behaviour):
{code}
from pyspark.sql import functions as F
ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
F.countDistinct(ds._2, ds._3)).show()
{code}
It works if any of the three arguments to {{.agg}} is removed.

The stack trace is:
{code}
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 
ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
 ds._3)).show()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py in 
show(self, n, truncate)
285 +---+-+
286 """
--> 287 print(self._jdf.showString(n, truncate))
288
289 def __repr__(self):

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: first(_2#1L)()
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql

[jira] [Updated] (SPARK-18146) Avoid using Union to chain together create table and repair partition commands

2016-10-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18146:

Assignee: Eric Liang

> Avoid using Union to chain together create table and repair partition commands
> --
>
> Key: SPARK-18146
> URL: https://issues.apache.org/jira/browse/SPARK-18146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
>
> The behavior of union is not well defined here. We should add an internal 
> command to execute these commands sequentially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18146) Avoid using Union to chain together create table and repair partition commands

2016-10-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18146.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15665
[https://github.com/apache/spark/pull/15665]

> Avoid using Union to chain together create table and repair partition commands
> --
>
> Key: SPARK-18146
> URL: https://issues.apache.org/jira/browse/SPARK-18146
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> The behavior of union is not well defined here. We should add an internal 
> command to execute these commands sequentially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16522) [MESOS] Spark application throws exception on exit

2016-10-30 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620225#comment-15620225
 ] 

Harish commented on SPARK-16522:


I am getting same error in spark 2.0.2 snapshot. Standalone submission.
 py4j.protocol.Py4JJavaError: An error occurred while calling o37785.count.
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:30)
at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:62)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.consume(BroadcastHashJoinExec.scala:38)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:232)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:30)
at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:62)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
at 
org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
at 
org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
at 
org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:113)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:79)
at 
org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
at 
org.apache

[jira] [Resolved] (SPARK-18043) Java example for Broadcasting

2016-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18043.
---
Resolution: Not A Problem

> Java example for Broadcasting
> -
>
> Key: SPARK-18043
> URL: https://issues.apache.org/jira/browse/SPARK-18043
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: Akash Sethi
>Priority: Minor
> Attachments: JavaBroadcastTest.java
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I have created a java example for Broadcasting similar to as it is in Scala i 
> would like to contribute the code for the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18173) data source tables should support truncating partition

2016-10-30 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18173:
---

 Summary: data source tables should support truncating partition
 Key: SPARK-18173
 URL: https://issues.apache.org/jira/browse/SPARK-18173
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18173) data source tables should support truncating partition

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18173:


Assignee: Apache Spark  (was: Wenchen Fan)

> data source tables should support truncating partition
> --
>
> Key: SPARK-18173
> URL: https://issues.apache.org/jira/browse/SPARK-18173
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18173) data source tables should support truncating partition

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18173:


Assignee: Wenchen Fan  (was: Apache Spark)

> data source tables should support truncating partition
> --
>
> Key: SPARK-18173
> URL: https://issues.apache.org/jira/browse/SPARK-18173
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18173) data source tables should support truncating partition

2016-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620290#comment-15620290
 ] 

Apache Spark commented on SPARK-18173:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15688

> data source tables should support truncating partition
> --
>
> Key: SPARK-18173
> URL: https://issues.apache.org/jira/browse/SPARK-18173
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620392#comment-15620392
 ] 

Saikat Kanjilal commented on SPARK-9487:


PR attached here: https://github.com/apache/spark/pull/15689
I only changed everything to local[4] in core and ran unit tests, all unit 
tests ran sucessfully


This is a WIP so once have folks review this initial request and signed off I 
will start changing the python pieces

[~holdenk][~sowen]  let me know next steps

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620402#comment-15620402
 ] 

Sean Owen commented on SPARK-9487:
--

Again have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -- you 
need to update your PR title to link it.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18170) Confusing error message when using rangeBetween without specifying an "orderBy"

2016-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18170:
--
Target Version/s:   (was: 2.0.1)

> Confusing error message when using rangeBetween without specifying an 
> "orderBy"
> ---
>
> Key: SPARK-18170
> URL: https://issues.apache.org/jira/browse/SPARK-18170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Weiluo Ren
>Priority: Minor
>
> {code}
> spark.range(1,3).select(sum('id) over Window.rangeBetween(0,1)).show
> {code}
> throws runtime exception:
> {code}
> Non-Zero range offsets are not supported for windows with multiple order 
> expressions.
> {code}
> which is confusing in this case because we don't have any order expression 
> here.
> How about add a check on
> {code}
> orderSpec.isEmpty
> {code}
> at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L141
> and throw an exception saying "no order expressions is specified"?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2016-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-3261:


Assignee: Sean Owen

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Sean Owen
>Priority: Minor
>  Labels: clustering
> Fix For: 2.1.0
>
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18174) Avoid Implicit Type Cast in Arguments of Expressions Extending String2StringExpression

2016-10-30 Thread Xiao Li (JIRA)
Xiao Li created SPARK-18174:
---

 Summary: Avoid Implicit Type Cast in Arguments of Expressions 
Extending String2StringExpression
 Key: SPARK-18174
 URL: https://issues.apache.org/jira/browse/SPARK-18174
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Xiao Li


For the expressions that extend String2StringExpression (lower, upper, ltrim, 
rtrim, trim and reverse), Analyzer should not implicitly cast the arguments to 
string. If users input the some data types instead of string, we should issue 
an exception for this misuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18174) Avoid Implicit Type Cast in Arguments of Expressions Extending String2StringExpression

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18174:


Assignee: Apache Spark

> Avoid Implicit Type Cast in Arguments of Expressions Extending 
> String2StringExpression
> --
>
> Key: SPARK-18174
> URL: https://issues.apache.org/jira/browse/SPARK-18174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> For the expressions that extend String2StringExpression (lower, upper, ltrim, 
> rtrim, trim and reverse), Analyzer should not implicitly cast the arguments 
> to string. If users input the some data types instead of string, we should 
> issue an exception for this misuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18174) Avoid Implicit Type Cast in Arguments of Expressions Extending String2StringExpression

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18174:


Assignee: (was: Apache Spark)

> Avoid Implicit Type Cast in Arguments of Expressions Extending 
> String2StringExpression
> --
>
> Key: SPARK-18174
> URL: https://issues.apache.org/jira/browse/SPARK-18174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>
> For the expressions that extend String2StringExpression (lower, upper, ltrim, 
> rtrim, trim and reverse), Analyzer should not implicitly cast the arguments 
> to string. If users input the some data types instead of string, we should 
> issue an exception for this misuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18174) Avoid Implicit Type Cast in Arguments of Expressions Extending String2StringExpression

2016-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620431#comment-15620431
 ] 

Apache Spark commented on SPARK-18174:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15690

> Avoid Implicit Type Cast in Arguments of Expressions Extending 
> String2StringExpression
> --
>
> Key: SPARK-18174
> URL: https://issues.apache.org/jira/browse/SPARK-18174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>
> For the expressions that extend String2StringExpression (lower, upper, ltrim, 
> rtrim, trim and reverse), Analyzer should not implicitly cast the arguments 
> to string. If users input the some data types instead of string, we should 
> issue an exception for this misuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18103) Rename *FileCatalog to *FileProvider

2016-10-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18103.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0

> Rename *FileCatalog to *FileProvider
> 
>
> Key: SPARK-18103
> URL: https://issues.apache.org/jira/browse/SPARK-18103
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> In the SQL component there are too many different components called some 
> variant of *Catalog, which is quite confusing. We should rename the 
> subclasses of FileCatalog to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17791) Join reordering using star schema detection

2016-10-30 Thread Ioana Delaney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620517#comment-15620517
 ] 

Ioana Delaney commented on SPARK-17791:
---

[~ron8hu] I appreciate your comment. Thank you. I agree that the algorithm will 
have to evolve as CBO introduces new features such as cardinality, predicate 
selectivity, and ultimately the cost-based planning itself. The current 
proposal is conservative in choosing a star plan and can be made even more 
conservative. I can look at what CBO implements today for the number of 
distinct values and base table cardinality as suggested by [~wangzhenhua]. A 
check for pseudo RI using these two estimates can be easily incorporated into 
our current star-schema detection. 

The algorithm is also disabled by default. We can keep it disabled until we 
have a tighter integration with CBO. But there are advantages in letting the 
code in before CBO is completely implemented. From an implementation point of 
view, this will allow us to incrementally deliver our work. Then, given its 
good performance results, the feature can be enabled on demand for warehouse 
workloads that can take advantage of star join planning.

Thank you.


> Join reordering using star schema detection
> ---
>
> Key: SPARK-17791
> URL: https://issues.apache.org/jira/browse/SPARK-17791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Assignee: Ioana Delaney
>Priority: Critical
> Attachments: StarJoinReordering1005.doc
>
>
> This JIRA is a sub-task of SPARK-17626.
> The objective is to provide a consistent performance improvement for star 
> schema queries. Star schema consists of one or more fact tables referencing a 
> number of dimension tables. In general, queries against star schema are 
> expected to run fast  because of the established RI constraints among the 
> tables. This design proposes a join reordering based on natural, generally 
> accepted heuristics for star schema queries:
> * Finds the star join with the largest fact table and places it on the 
> driving arm of the left-deep join. This plan avoids large tables on the 
> inner, and thus favors hash joins. 
> * Applies the most selective dimensions early in the plan to reduce the 
> amount of data flow.
> The design description is included in the below attached document.
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17791) Join reordering using star schema detection

2016-10-30 Thread Ioana Delaney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620517#comment-15620517
 ] 

Ioana Delaney edited comment on SPARK-17791 at 10/30/16 8:22 PM:
-

[~ron8hu] I appreciate your comment. Thank you. I agree that the algorithm will 
have to evolve as CBO introduces new features such as cardinality, predicate 
selectivity, and ultimately the cost-based planning itself. The current 
proposal is conservative in choosing a star plan and can be made even more 
conservative. I can look at what CBO implements today for the number of 
distinct values and base table cardinality as suggested by [~mikewzh]. A check 
for pseudo RI using these two estimates can be easily incorporated into our 
current star-schema detection. 

The algorithm is also disabled by default. We can keep it disabled until we 
have a tighter integration with CBO. But there are advantages in letting the 
code in before CBO is completely implemented. From an implementation point of 
view, this will allow us to incrementally deliver our work. Then, given its 
good performance results, the feature can be enabled on demand for warehouse 
workloads that can take advantage of star join planning.

Thank you.



was (Author: ioana-delaney):
[~ron8hu] I appreciate your comment. Thank you. I agree that the algorithm will 
have to evolve as CBO introduces new features such as cardinality, predicate 
selectivity, and ultimately the cost-based planning itself. The current 
proposal is conservative in choosing a star plan and can be made even more 
conservative. I can look at what CBO implements today for the number of 
distinct values and base table cardinality as suggested by [~wangzhenhua]. A 
check for pseudo RI using these two estimates can be easily incorporated into 
our current star-schema detection. 

The algorithm is also disabled by default. We can keep it disabled until we 
have a tighter integration with CBO. But there are advantages in letting the 
code in before CBO is completely implemented. From an implementation point of 
view, this will allow us to incrementally deliver our work. Then, given its 
good performance results, the feature can be enabled on demand for warehouse 
workloads that can take advantage of star join planning.

Thank you.


> Join reordering using star schema detection
> ---
>
> Key: SPARK-17791
> URL: https://issues.apache.org/jira/browse/SPARK-17791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Assignee: Ioana Delaney
>Priority: Critical
> Attachments: StarJoinReordering1005.doc
>
>
> This JIRA is a sub-task of SPARK-17626.
> The objective is to provide a consistent performance improvement for star 
> schema queries. Star schema consists of one or more fact tables referencing a 
> number of dimension tables. In general, queries against star schema are 
> expected to run fast  because of the established RI constraints among the 
> tables. This design proposes a join reordering based on natural, generally 
> accepted heuristics for star schema queries:
> * Finds the star join with the largest fact table and places it on the 
> driving arm of the left-deep join. This plan avoids large tables on the 
> inner, and thus favors hash joins. 
> * Applies the most selective dimensions early in the plan to reduce the 
> amount of data flow.
> The design description is included in the below attached document.
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9487:
---

Assignee: (was: Apache Spark)

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620610#comment-15620610
 ] 

Apache Spark commented on SPARK-9487:
-

User 'skanjila' has created a pull request for this issue:
https://github.com/apache/spark/pull/15689

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9487:
---

Assignee: Apache Spark

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620613#comment-15620613
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~srowen] Yes I read through that and adjusted the PR title, I will Jenkins 
test this next, however please do let me know if I can proceed adding more to 
this PR including python and other parts of the codebase.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620613#comment-15620613
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 10/30/16 9:40 PM:
--

[~srowen] Yes I read through that link and adjusted the PR title, I will 
Jenkins test this next, however please do let me know if I can proceed adding 
more to this PR including python and other parts of the codebase.


was (Author: kanjilal):
[~srowen] Yes I read through that and adjusted the PR title, I will Jenkins 
test this next, however please do let me know if I can proceed adding more to 
this PR including python and other parts of the codebase.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620613#comment-15620613
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 10/30/16 9:46 PM:
--

[~srowen] Yes I read through that link and adjusted the PR title, however 
please do let me know if I can proceed adding more to this PR including python 
and other parts of the codebase.


was (Author: kanjilal):
[~srowen] Yes I read through that link and adjusted the PR title, I will 
Jenkins test this next, however please do let me know if I can proceed adding 
more to this PR including python and other parts of the codebase.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620657#comment-15620657
 ] 

Saikat Kanjilal commented on SPARK-9487:


Added org.apache.spark.mllib unitTest changes to pull request

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18106) Analyze Table accepts a garbage identifier at the end

2016-10-30 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18106.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.1.0

> Analyze Table accepts a garbage identifier at the end
> -
>
> Key: SPARK-18106
> URL: https://issues.apache.org/jira/browse/SPARK-18106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Srinath
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.1.0
>
>
> {noformat}
> scala> sql("create table test(a int)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("analyze table test compute statistics blah")
> res3: org.apache.spark.sql.DataFrame = []
> {noformat}
> An identifier that is not "noscan" produces an AnalyzeTableCommand with 
> noscan=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18174) Avoid Implicit Type Cast in Arguments of Expressions Extending String2StringExpression

2016-10-30 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620685#comment-15620685
 ] 

Herman van Hovell commented on SPARK-18174:
---

[~smilegator] Won't this create a regression for users who are relying on this?

> Avoid Implicit Type Cast in Arguments of Expressions Extending 
> String2StringExpression
> --
>
> Key: SPARK-18174
> URL: https://issues.apache.org/jira/browse/SPARK-18174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>
> For the expressions that extend String2StringExpression (lower, upper, ltrim, 
> rtrim, trim and reverse), Analyzer should not implicitly cast the arguments 
> to string. If users input the some data types instead of string, we should 
> issue an exception for this misuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-10-30 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620711#comment-15620711
 ] 

Harish commented on SPARK-16740:


is this fix is available in 2.0.2 snapshot?. Please confirm

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Sylvain Zimmer
> Fix For: 2.0.1, 2.1.0
>
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anon

[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-30 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620750#comment-15620750
 ] 

Franck Tago commented on SPARK-15616:
-

SO was not able to  use the changes for the following  reasons . 
1-I forgot to mention that I am working off the spark 2.0.1  branch. 
2- I get the following error
[info] Compiling 30 Scala sources and 2 Java sources to 
/export/home/devbld/spark_world/Mercury/pvt/ftago/spark-2.0.1/sql/hive/target/scala-2.11/classes...
[error] 
/export/home/devbld/spark_world/Mercury/pvt/ftago/spark-2.0.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala:295:
 type mismatch;
[error]  found   : Seq[org.apache.spark.sql.catalyst.expressions.Expression]
[error]  required: Option[String]
[error] MetastoreRelation(databaseName, tableName, 
partitionPruningPred)(catalogTable, client, sparkSession)
[error]^
[error] one error found
[error] Compile failed 

Can you please  build a version of this fix off spark 2.0.1? I tried 
incorporating your changes but as pointed to the error message shown above , I 
was not able to .

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17919) Make timeout to RBackend configurable in SparkR

2016-10-30 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-17919.
--
Resolution: Fixed
  Assignee: Hossein Falaki

> Make timeout to RBackend configurable in SparkR
> ---
>
> Key: SPARK-17919
> URL: https://issues.apache.org/jira/browse/SPARK-17919
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>
> I am working on a project where {{gapply()}} is being used with a large 
> dataset that happens to be extremely skewed. On that skewed partition, the 
> user function takes more than 2 hours to return and that turns out to be 
> larger than the timeout that we hardcode in SparkR for backend connection.
> {code}
> connectBackend <- function(hostname, port, timeout = 6000) 
> {code}
> Ideally user should be able to reconfigure Spark and increase the timeout. It 
> should be a small fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17919) Make timeout to RBackend configurable in SparkR

2016-10-30 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-17919:
-
Fix Version/s: 2.1.0

> Make timeout to RBackend configurable in SparkR
> ---
>
> Key: SPARK-17919
> URL: https://issues.apache.org/jira/browse/SPARK-17919
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
> Fix For: 2.1.0
>
>
> I am working on a project where {{gapply()}} is being used with a large 
> dataset that happens to be extremely skewed. On that skewed partition, the 
> user function takes more than 2 hours to return and that turns out to be 
> larger than the timeout that we hardcode in SparkR for backend connection.
> {code}
> connectBackend <- function(hostname, port, timeout = 6000) 
> {code}
> Ideally user should be able to reconfigure Spark and increase the timeout. It 
> should be a small fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16137) Random Forest wrapper in SparkR

2016-10-30 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-16137.
--
  Resolution: Fixed
Assignee: Felix Cheung
Target Version/s: 2.1.0

> Random Forest wrapper in SparkR
> ---
>
> Key: SPARK-16137
> URL: https://issues.apache.org/jira/browse/SPARK-16137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Kai Jiang
>Assignee: Felix Cheung
>
> Implement a wrapper in SparkR to support Random Forest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16137) Random Forest wrapper in SparkR

2016-10-30 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-16137:
-
Fix Version/s: 2.1.0

> Random Forest wrapper in SparkR
> ---
>
> Key: SPARK-16137
> URL: https://issues.apache.org/jira/browse/SPARK-16137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Kai Jiang
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>
> Implement a wrapper in SparkR to support Random Forest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18110) Missing parameter in Python for RandomForest regression and classification

2016-10-30 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-18110.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

> Missing parameter in Python for RandomForest regression and classification
> --
>
> Key: SPARK-18110
> URL: https://issues.apache.org/jira/browse/SPARK-18110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18174) Avoid Implicit Type Cast in Arguments of Expressions Extending String2StringExpression

2016-10-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620785#comment-15620785
 ] 

Xiao Li commented on SPARK-18174:
-

Yeah. Your concern is right. The impact of implicit type casting is not small. 
I posted some thoughts in the PR. We should not merge this PR anyway.

> Avoid Implicit Type Cast in Arguments of Expressions Extending 
> String2StringExpression
> --
>
> Key: SPARK-18174
> URL: https://issues.apache.org/jira/browse/SPARK-18174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>
> For the expressions that extend String2StringExpression (lower, upper, ltrim, 
> rtrim, trim and reverse), Analyzer should not implicitly cast the arguments 
> to string. If users input the some data types instead of string, we should 
> issue an exception for this misuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-10-30 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620813#comment-15620813
 ] 

Dongjoon Hyun commented on SPARK-16740:
---

Hi, [~harishk15]

Yep. The patch is still there in branch-2.0. I guess you can test that with 
Spark 2.0.2-rc1, too.

If you think you meet some related issue in 2.0.2-rc1, please file a Jira issue.

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Sylvain Zimmer
> Fix For: 2.0.1, 2.1.0
>
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225)
>   at 
> org.apache.sp

[jira] [Commented] (SPARK-12648) UDF with Option[Double] throws ClassCastException

2016-10-30 Thread Grant Neale (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620882#comment-15620882
 ] 

Grant Neale commented on SPARK-12648:
-

This works for single, argument UDFs.  However, one may want to define a 
multi-argument UDF that allows some arguments to be null.

> UDF with Option[Double] throws ClassCastException
> -
>
> Key: SPARK-12648
> URL: https://issues.apache.org/jira/browse/SPARK-12648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Mikael Valot
>
> I can write an UDF that returns an Option[Double], and the DataFrame's  
> schema is correctly inferred to be a nullable double. 
> However I cannot seem to be able to write a UDF that takes an Option as an 
> argument:
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> val conf = new SparkConf().setMaster("local[4]").setAppName("test")
> val sc = new SparkContext(conf)
> val sqlc = new SQLContext(sc)
> import sqlc.implicits._
> val df = sc.parallelize(List(("a", Some(4D)), ("b", None))).toDF("name", 
> "weight")
> import org.apache.spark.sql.functions._
> val addTwo = udf((d: Option[Double]) => d.map(_+2)) 
> df.withColumn("plusTwo", addTwo(df("weight"))).show()
> =>
> 2016-01-05T14:41:52 Executor task launch worker-0 ERROR 
> org.apache.spark.executor.Executor Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ClassCastException: java.lang.Double cannot be cast to scala.Option
>   at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:18) 
> ~[na:na]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[na:na]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> ~[scala-library-2.10.5.jar:na]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-10-30 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620885#comment-15620885
 ] 

Harish commented on SPARK-16740:


Thank you. I downloaded the 2.0.2 snapshot with 2.7 Hadoop (i think its on 
10/13). I can still reproduce this issue. If the "2.0.2-rc1" was updated after 
10/13 then i will take the updates and try. Can you please help me to find the 
latest download path.?

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Sylvain Zimmer
> Fix For: 2.0.1, 2.1.0
>
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(Exis

[jira] [Created] (SPARK-18175) Improve the test case coverage of implicit type casting

2016-10-30 Thread Xiao Li (JIRA)
Xiao Li created SPARK-18175:
---

 Summary: Improve the test case coverage of implicit type casting
 Key: SPARK-18175
 URL: https://issues.apache.org/jira/browse/SPARK-18175
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.0.1
Reporter: Xiao Li


So far, we have limited test case coverage about implicit type casting. We need 
to draw a matrix to find all the possible casting pairs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18175) Improve the test case coverage of implicit type casting

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18175:


Assignee: (was: Apache Spark)

> Improve the test case coverage of implicit type casting
> ---
>
> Key: SPARK-18175
> URL: https://issues.apache.org/jira/browse/SPARK-18175
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>
> So far, we have limited test case coverage about implicit type casting. We 
> need to draw a matrix to find all the possible casting pairs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18175) Improve the test case coverage of implicit type casting

2016-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620930#comment-15620930
 ] 

Apache Spark commented on SPARK-18175:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15691

> Improve the test case coverage of implicit type casting
> ---
>
> Key: SPARK-18175
> URL: https://issues.apache.org/jira/browse/SPARK-18175
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>
> So far, we have limited test case coverage about implicit type casting. We 
> need to draw a matrix to find all the possible casting pairs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18175) Improve the test case coverage of implicit type casting

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18175:


Assignee: Apache Spark

> Improve the test case coverage of implicit type casting
> ---
>
> Key: SPARK-18175
> URL: https://issues.apache.org/jira/browse/SPARK-18175
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> So far, we have limited test case coverage about implicit type casting. We 
> need to draw a matrix to find all the possible casting pairs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17952) SparkSession createDataFrame method throws exception for nested JavaBeans

2016-10-30 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Summary: SparkSession createDataFrame method throws exception for nested 
JavaBeans  (was: Java SparkSession createDataFrame method throws exception for 
nested JavaBeans)

> SparkSession createDataFrame method throws exception for nested JavaBeans
> -
>
> Key: SPARK-17952
> URL: https://issues.apache.org/jira/browse/SPARK-17952
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Amit Baghel
>
> As per latest spark documentation for Java at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
>  
> {quote}
> Nested JavaBeans and List or Array fields are supported though.
> {quote}
> However nested JavaBean is not working. Please see the below code.
> SubCategory class
> {code}
> public class SubCategory implements Serializable{
>   private String id;
>   private String name;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public String getName() {
>   return name;
>   }
>   public void setName(String name) {
>   this.name = name;
>   }   
> }
> {code}
> Category class
> {code}
> public class Category implements Serializable{
>   private String id;
>   private SubCategory subCategory;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public SubCategory getSubCategory() {
>   return subCategory;
>   }
>   public void setSubCategory(SubCategory subCategory) {
>   this.subCategory = subCategory;
>   }
> }
> {code}
> SparkSample class
> {code}
> public class SparkSample {
>   public static void main(String[] args) throws IOException { 
> 
>   SparkSession spark = SparkSession
>   .builder()
>   .appName("SparkSample")
>   .master("local")
>   .getOrCreate();
>   //SubCategory
>   SubCategory sub = new SubCategory();
>   sub.setId("sc-111");
>   sub.setName("Sub-1");
>   //Category
>   Category category = new Category();
>   category.setId("s-111");
>   category.setSubCategory(sub);
>   //categoryList
>   List categoryList = new ArrayList();
>   categoryList.add(category);
>//DF
>   Dataset dframe = spark.createDataFrame(categoryList, 
> Category.class);  
>   dframe.show();  
>   }
> }
> {code}
> Above code throws below error.
> {code}
> Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d 
> (of class com.sample.SubCategory)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
>   at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
>   at 
> scala.colle

[jira] [Commented] (SPARK-17791) Join reordering using star schema detection

2016-10-30 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15621070#comment-15621070
 ] 

Zhenhua Wang commented on SPARK-17791:
--

Hi Ioana,
The current implementation is NOT ready because we need to refactor the 
statistics structure to make it easier to use during cost estimation, it won't 
be stable until we finish the estimation part. I think it is necessary and 
important to use CBO based RI. You can start to incorporate it in the algorithm 
and rebase after the related code refactor is finished. Thanks.

> Join reordering using star schema detection
> ---
>
> Key: SPARK-17791
> URL: https://issues.apache.org/jira/browse/SPARK-17791
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Assignee: Ioana Delaney
>Priority: Critical
> Attachments: StarJoinReordering1005.doc
>
>
> This JIRA is a sub-task of SPARK-17626.
> The objective is to provide a consistent performance improvement for star 
> schema queries. Star schema consists of one or more fact tables referencing a 
> number of dimension tables. In general, queries against star schema are 
> expected to run fast  because of the established RI constraints among the 
> tables. This design proposes a join reordering based on natural, generally 
> accepted heuristics for star schema queries:
> * Finds the star join with the largest fact table and places it on the 
> driving arm of the left-deep join. This plan avoids large tables on the 
> inner, and thus favors hash joins. 
> * Applies the most selective dimensions early in the plan to reduce the 
> amount of data flow.
> The design description is included in the below attached document.
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18176) Kafka010 .createRDD() scala API should expect scala Map

2016-10-30 Thread Liwei Lin (JIRA)
Liwei Lin created SPARK-18176:
-

 Summary: Kafka010 .createRDD() scala API should expect scala Map
 Key: SPARK-18176
 URL: https://issues.apache.org/jira/browse/SPARK-18176
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 2.0.1, 2.0.0
Reporter: Liwei Lin


Thoughout {{external/kafka-010}}, Java APIs are expecting {{java.lang.Maps}} 
and Scala APIs are expecting {{scala.collection.Maps}}, with the exception of 
{{KafkaUtils.createRDD()}} Scala API expecting a {{java.lang.Map}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18176) Kafka010 .createRDD() scala API should expect scala Map

2016-10-30 Thread Liwei Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-18176:
--
Description: 
Thoughout {{external/kafka-010}}, Java APIs are expecting {{java.lang.Maps}} 
and Scala APIs are expecting {{scala.collection.Maps}}, with the exception of 
{{KafkaUtils.createRDD()}} Scala API expecting a {{java.lang.Map}}.

But please note, this is a public API change.

  was:Thoughout {{external/kafka-010}}, Java APIs are expecting 
{{java.lang.Maps}} and Scala APIs are expecting {{scala.collection.Maps}}, with 
the exception of {{KafkaUtils.createRDD()}} Scala API expecting a 
{{java.lang.Map}}.


> Kafka010 .createRDD() scala API should expect scala Map
> ---
>
> Key: SPARK-18176
> URL: https://issues.apache.org/jira/browse/SPARK-18176
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liwei Lin
>
> Thoughout {{external/kafka-010}}, Java APIs are expecting {{java.lang.Maps}} 
> and Scala APIs are expecting {{scala.collection.Maps}}, with the exception of 
> {{KafkaUtils.createRDD()}} Scala API expecting a {{java.lang.Map}}.
> But please note, this is a public API change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18176) Kafka010 .createRDD() scala API should expect scala Map

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18176:


Assignee: (was: Apache Spark)

> Kafka010 .createRDD() scala API should expect scala Map
> ---
>
> Key: SPARK-18176
> URL: https://issues.apache.org/jira/browse/SPARK-18176
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liwei Lin
>
> Thoughout {{external/kafka-010}}, Java APIs are expecting {{java.lang.Maps}} 
> and Scala APIs are expecting {{scala.collection.Maps}}, with the exception of 
> {{KafkaUtils.createRDD()}} Scala API expecting a {{java.lang.Map}}.
> But please note, this is a public API change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18176) Kafka010 .createRDD() scala API should expect scala Map

2016-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15621075#comment-15621075
 ] 

Apache Spark commented on SPARK-18176:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15681

> Kafka010 .createRDD() scala API should expect scala Map
> ---
>
> Key: SPARK-18176
> URL: https://issues.apache.org/jira/browse/SPARK-18176
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liwei Lin
>
> Thoughout {{external/kafka-010}}, Java APIs are expecting {{java.lang.Maps}} 
> and Scala APIs are expecting {{scala.collection.Maps}}, with the exception of 
> {{KafkaUtils.createRDD()}} Scala API expecting a {{java.lang.Map}}.
> But please note, this is a public API change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18176) Kafka010 .createRDD() scala API should expect scala Map

2016-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18176:


Assignee: Apache Spark

> Kafka010 .createRDD() scala API should expect scala Map
> ---
>
> Key: SPARK-18176
> URL: https://issues.apache.org/jira/browse/SPARK-18176
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liwei Lin
>Assignee: Apache Spark
>
> Thoughout {{external/kafka-010}}, Java APIs are expecting {{java.lang.Maps}} 
> and Scala APIs are expecting {{scala.collection.Maps}}, with the exception of 
> {{KafkaUtils.createRDD()}} Scala API expecting a {{java.lang.Map}}.
> But please note, this is a public API change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org