[jira] [Commented] (SPARK-12059) Standalone Master assertion error

2015-11-30 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033268#comment-15033268
 ] 

Saisai Shao commented on SPARK-12059:
-

Hi [~andrewor14], when will this be happened? I suppose state from {{RUNNING}} 
to {{RUNNING}} should not be happened normally.

> Standalone Master assertion error
> -
>
> Key: SPARK-12059
> URL: https://issues.apache.org/jira/browse/SPARK-12059
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Saisai Shao
>Priority: Critical
>
> {code}
> 15/11/30 09:55:04 ERROR Inbox: Ignoring error
> java.lang.AssertionError: assertion failed: executor 4 state transfer from 
> RUNNING to RUNNING is illegal
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12046) Visibility and format issues in ScalaDoc/JavaDoc for branch-1.6

2015-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033260#comment-15033260
 ] 

Apache Spark commented on SPARK-12046:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/10063

> Visibility and format issues in ScalaDoc/JavaDoc for branch-1.6
> ---
>
> Key: SPARK-12046
> URL: https://issues.apache.org/jira/browse/SPARK-12046
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs

2015-11-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033250#comment-15033250
 ] 

Joseph K. Bradley commented on SPARK-11605:
---

Those are private APIs.

> ML 1.6 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-11605
> URL: https://issues.apache.org/jira/browse/SPARK-11605
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here.
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs

2015-11-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033242#comment-15033242
 ] 

Joseph K. Bradley commented on SPARK-11605:
---

There is already a Java-friendly version.

> ML 1.6 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-11605
> URL: https://issues.apache.org/jira/browse/SPARK-11605
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here.
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs

2015-11-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033241#comment-15033241
 ] 

Joseph K. Bradley commented on SPARK-11605:
---

There are already Java-friendly versions.

> ML 1.6 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-11605
> URL: https://issues.apache.org/jira/browse/SPARK-11605
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here.
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs

2015-11-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033239#comment-15033239
 ] 

Joseph K. Bradley commented on SPARK-11605:
---

We could add Java-friendly versions.

> ML 1.6 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-11605
> URL: https://issues.apache.org/jira/browse/SPARK-11605
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here.
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12070) PySpark implementation of Slicing operator incorrect

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12070:


Assignee: (was: Apache Spark)

> PySpark implementation of Slicing operator incorrect
> 
>
> Key: SPARK-12070
> URL: https://issues.apache.org/jira/browse/SPARK-12070
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>
> {code}
> aa=('Ofer', 1), ('Wei', 2)
> a = sqlContext.createDataFrame(aa)
> a.select(a._1[2:]).show()
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, 
> in substr
> jc = self._jc.substr(startPos, length)
>   File 
> "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in __call__
>   File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in 
> deco
> return f(*a, **kw)
>   File 
> "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", 
> line 312, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace:
> py4j.Py4JException: Method substr([class java.lang.Integer, class 
> java.lang.Long]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12070) PySpark implementation of Slicing operator incorrect

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12070:


Assignee: Apache Spark

> PySpark implementation of Slicing operator incorrect
> 
>
> Key: SPARK-12070
> URL: https://issues.apache.org/jira/browse/SPARK-12070
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>
> {code}
> aa=('Ofer', 1), ('Wei', 2)
> a = sqlContext.createDataFrame(aa)
> a.select(a._1[2:]).show()
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, 
> in substr
> jc = self._jc.substr(startPos, length)
>   File 
> "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in __call__
>   File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in 
> deco
> return f(*a, **kw)
>   File 
> "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", 
> line 312, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace:
> py4j.Py4JException: Method substr([class java.lang.Integer, class 
> java.lang.Long]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12070) PySpark implementation of Slicing operator incorrect

2015-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033233#comment-15033233
 ] 

Apache Spark commented on SPARK-12070:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10062

> PySpark implementation of Slicing operator incorrect
> 
>
> Key: SPARK-12070
> URL: https://issues.apache.org/jira/browse/SPARK-12070
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>
> {code}
> aa=('Ofer', 1), ('Wei', 2)
> a = sqlContext.createDataFrame(aa)
> a.select(a._1[2:]).show()
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, 
> in substr
> jc = self._jc.substr(startPos, length)
>   File 
> "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in __call__
>   File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in 
> deco
> return f(*a, **kw)
>   File 
> "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", 
> line 312, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace:
> py4j.Py4JException: Method substr([class java.lang.Integer, class 
> java.lang.Long]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs

2015-11-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033230#comment-15033230
 ] 

Joseph K. Bradley commented on SPARK-11605:
---

We don't need to worry about Attribute; it's an old API and is a DeveloperApi 
we expect to change.

If you see Option issues with other public APIs though, they would be good to 
investigate.

> ML 1.6 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-11605
> URL: https://issues.apache.org/jira/browse/SPARK-11605
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here.
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs

2015-11-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033228#comment-15033228
 ] 

Joseph K. Bradley commented on SPARK-11605:
---

This is a problem we should fix for this release since it's a new API.

> ML 1.6 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-11605
> URL: https://issues.apache.org/jira/browse/SPARK-11605
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here.
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10798) JsonMappingException with Spark Context Parallelize

2015-11-30 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033229#comment-15033229
 ] 

Miao Wang commented on SPARK-10798:
---

These two lines:

byte[] data= Kryo.serialize(List)
List fromKryoRows=Kryo.unserialize(data)

can't be compiled in my Java application. I googled the usage of Kryo 
serializer and there is no matched usage as shown in the above two lines.

Miao

> JsonMappingException with Spark Context Parallelize
> ---
>
> Key: SPARK-10798
> URL: https://issues.apache.org/jira/browse/SPARK-10798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: Linux, Java 1.8.45
>Reporter: Dev Lakhani
>
> When trying to create an RDD of Rows using a Java Spark Context and if I 
> serialize the rows with Kryo first, the sparkContext fails.
> byte[] data= Kryo.serialize(List)
> List fromKryoRows=Kryo.unserialize(data)
> List rows= new Vector(); //using a new set of data.
> rows.add(RowFactory.create("test"));
> javaSparkContext.parallelize(rows);
> OR
> javaSparkContext.parallelize(fromKryoRows); //using deserialized rows
> I get :
> com.fasterxml.jackson.databind.JsonMappingException: (None,None) (of class 
> scala.Tuple2) (through reference chain: 
> org.apache.spark.rdd.RDDOperationScope["parent"])
>at 
> com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:210)
>at 
> com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:177)
>at 
> com.fasterxml.jackson.databind.ser.std.StdSerializer.wrapAndThrow(StdSerializer.java:187)
>at 
> com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:647)
>at 
> com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152)
>at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>at 
> org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:50)
>at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:141)
>at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>at 
> org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>at 
> org.apache.spark.SparkContext.parallelize(SparkContext.scala:714)
>at 
> org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:145)
>at 
> org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:157)
>...
> Caused by: scala.MatchError: (None,None) (of class scala.Tuple2)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply$mcV$sp(OptionSerializerModule.scala:32)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply(OptionSerializerModule.scala:32)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply(OptionSerializerModule.scala:32)
>at scala.Option.getOrElse(Option.scala:120)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionSerializer.serialize(OptionSerializerModule.scala:31)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionSerializer.serialize(OptionSerializerModule.scala:22)
>at 
> com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionPropertyWriter.serializeAsField(OptionSerializerModule.scala:128)
>at 
> com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639)
>... 19 more
> I've tried updating jackson module scala to 2.6.1 but same issue. This 
> happens in local mode with java 1.8_45. I searched the web and this Jira for 
> similar issues but found nothing of interest.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R

2015-11-30 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033227#comment-15033227
 ] 

Felix Cheung edited comment on SPARK-12071 at 12/1/15 7:07 AM:
---

See commit 
https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c


was (Author: felixcheung):
See PR 
https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c

> Programming guide should explain NULL in JVM translate to NA in R
> -
>
> Key: SPARK-12071
> URL: https://issues.apache.org/jira/browse/SPARK-12071
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Felix Cheung
>Priority: Minor
>
> This behavior seems to be new for Spark 1.6.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R

2015-11-30 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033227#comment-15033227
 ] 

Felix Cheung commented on SPARK-12071:
--

See PR 
https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c

> Programming guide should explain NULL in JVM translate to NA in R
> -
>
> Key: SPARK-12071
> URL: https://issues.apache.org/jira/browse/SPARK-12071
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Felix Cheung
>Priority: Minor
>
> This behavior seems to be new for Spark 1.6.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12071) Programming guide should explain NULL in JVM translate to NA in R

2015-11-30 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-12071:


 Summary: Programming guide should explain NULL in JVM translate to 
NA in R
 Key: SPARK-12071
 URL: https://issues.apache.org/jira/browse/SPARK-12071
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.0
Reporter: Felix Cheung
Priority: Minor


This behavior seems to be new for Spark 1.6.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12070) PySpark implementation of Slicing operator incorrect

2015-11-30 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033218#comment-15033218
 ] 

Jeff Zhang commented on SPARK-12070:


The root cause is that when using syntax like this str[1:] for slice, the 
length will be set as the max int of python which is long for java. Because the 
range of python int is larger than that of java int. 



> PySpark implementation of Slicing operator incorrect
> 
>
> Key: SPARK-12070
> URL: https://issues.apache.org/jira/browse/SPARK-12070
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>
> {code}
> aa=('Ofer', 1), ('Wei', 2)
> a = sqlContext.createDataFrame(aa)
> a.select(a._1[2:]).show()
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, 
> in substr
> jc = self._jc.substr(startPos, length)
>   File 
> "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in __call__
>   File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in 
> deco
> return f(*a, **kw)
>   File 
> "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", 
> line 312, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace:
> py4j.Py4JException: Method substr([class java.lang.Integer, class 
> java.lang.Long]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12070) PySpark implementation of Slicing operator incorrect

2015-11-30 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033218#comment-15033218
 ] 

Jeff Zhang edited comment on SPARK-12070 at 12/1/15 6:59 AM:
-

The root cause is that when using syntax like this str[1:] for slice, the 
length will be set as the max int of python which is long for java. Because the 
range of python int is larger than that of java int. 

Will create a PR.


was (Author: zjffdu):
The root cause is that when using syntax like this str[1:] for slice, the 
length will be set as the max int of python which is long for java. Because the 
range of python int is larger than that of java int. 



> PySpark implementation of Slicing operator incorrect
> 
>
> Key: SPARK-12070
> URL: https://issues.apache.org/jira/browse/SPARK-12070
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>
> {code}
> aa=('Ofer', 1), ('Wei', 2)
> a = sqlContext.createDataFrame(aa)
> a.select(a._1[2:]).show()
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, 
> in substr
> jc = self._jc.substr(startPos, length)
>   File 
> "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in __call__
>   File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in 
> deco
> return f(*a, **kw)
>   File 
> "/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", 
> line 312, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace:
> py4j.Py4JException: Method substr([class java.lang.Integer, class 
> java.lang.Long]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
>   at py4j.Gateway.invoke(Gateway.java:252)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12070) PySpark implementation of Slicing operator incorrect

2015-11-30 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-12070:
--

 Summary: PySpark implementation of Slicing operator incorrect
 Key: SPARK-12070
 URL: https://issues.apache.org/jira/browse/SPARK-12070
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.2
Reporter: Jeff Zhang


{code}
aa=('Ofer', 1), ('Wei', 2)
a = sqlContext.createDataFrame(aa)
a.select(a._1[2:]).show()

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/jzhang/github/spark/python/pyspark/sql/column.py", line 286, in 
substr
jc = self._jc.substr(startPos, length)
  File 
"/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
line 813, in __call__
  File "/Users/jzhang/github/spark/python/pyspark/sql/utils.py", line 45, in 
deco
return f(*a, **kw)
  File 
"/Users/jzhang/github/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
312, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o37.substr. Trace:
py4j.Py4JException: Method substr([class java.lang.Integer, class 
java.lang.Long]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
{code}

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11605) ML 1.6 QA: API: Java compatibility, docs

2015-11-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033213#comment-15033213
 ] 

Joseph K. Bradley commented on SPARK-11605:
---

private[ml] functions can be ignored.  It's a shame that they are public in 
Java, but at least they do not show up in the Java doc.

> ML 1.6 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-11605
> URL: https://issues.apache.org/jira/browse/SPARK-11605
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here.
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11206) Support SQL UI on the history server

2015-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033209#comment-15033209
 ] 

Apache Spark commented on SPARK-11206:
--

User 'carsonwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10061

> Support SQL UI on the history server
> 
>
> Key: SPARK-11206
> URL: https://issues.apache.org/jira/browse/SPARK-11206
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Web UI
>Reporter: Carson Wang
>Assignee: Carson Wang
>
> On the live web UI, there is a SQL tab which provides valuable information 
> for the SQL query. But once the workload is finished, we won't see the SQL 
> tab on the history server. It will be helpful if we support SQL UI on the 
> history server so we can analyze it even after its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-11-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033187#comment-15033187
 ] 

Reynold Xin commented on SPARK-12032:
-

[~marmbrus] do you mean the selinger algo?


> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12031) Integer overflow when do sampling.

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12031:
-
Priority: Critical  (was: Major)

> Integer overflow when do sampling.
> --
>
> Key: SPARK-12031
> URL: https://issues.apache.org/jira/browse/SPARK-12031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
>Reporter: uncleGen
>Priority: Critical
>
> In my case, some partitions contain too much items. When do range partition, 
> exception thrown as:
> {code}
> java.lang.IllegalArgumentException: n must be positive
> at java.util.Random.nextInt(Random.java:300)
> at 
> org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6280) Remove Akka systemName from Spark

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6280:

Target Version/s:   (was: 1.6.0)

> Remove Akka systemName from Spark
> -
>
> Key: SPARK-6280
> URL: https://issues.apache.org/jira/browse/SPARK-6280
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> `systemName` is a Akka concept. A RPC implementation does not need to support 
> it. 
> We can hard code the system name in Spark and hide it in the internal Akka 
> RPC implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12069) Documentation update for Datasets

2015-11-30 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-12069:


 Summary: Documentation update for Datasets
 Key: SPARK-12069
 URL: https://issues.apache.org/jira/browse/SPARK-12069
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12069) Documentation update for Datasets

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12069:


Assignee: Apache Spark  (was: Michael Armbrust)

> Documentation update for Datasets
> -
>
> Key: SPARK-12069
> URL: https://issues.apache.org/jira/browse/SPARK-12069
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12069) Documentation update for Datasets

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12069:


Assignee: Michael Armbrust  (was: Apache Spark)

> Documentation update for Datasets
> -
>
> Key: SPARK-12069
> URL: https://issues.apache.org/jira/browse/SPARK-12069
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12069) Documentation update for Datasets

2015-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033183#comment-15033183
 ] 

Apache Spark commented on SPARK-12069:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/10060

> Documentation update for Datasets
> -
>
> Key: SPARK-12069
> URL: https://issues.apache.org/jira/browse/SPARK-12069
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11954) Encoder for JavaBeans / POJOs

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11954:
-
Assignee: Wenchen Fan

> Encoder for JavaBeans / POJOs
> -
>
> Key: SPARK-11954
> URL: https://issues.apache.org/jira/browse/SPARK-11954
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12068) use a single column in Dataset.groupBy and count will fail

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12068:


Assignee: (was: Apache Spark)

> use a single column in Dataset.groupBy and count will fail
> --
>
> Key: SPARK-12068
> URL: https://issues.apache.org/jira/browse/SPARK-12068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> {code}
> val ds = Seq("a" -> 1, "b" -> 1, "a" -> 2).toDS()
> val count = ds.groupBy($"_1").count()
> count.collect() // will fail
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12068) use a single column in Dataset.groupBy and count will fail

2015-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033175#comment-15033175
 ] 

Apache Spark commented on SPARK-12068:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10059

> use a single column in Dataset.groupBy and count will fail
> --
>
> Key: SPARK-12068
> URL: https://issues.apache.org/jira/browse/SPARK-12068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> {code}
> val ds = Seq("a" -> 1, "b" -> 1, "a" -> 2).toDS()
> val count = ds.groupBy($"_1").count()
> count.collect() // will fail
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12068) use a single column in Dataset.groupBy and count will fail

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12068:


Assignee: Apache Spark

> use a single column in Dataset.groupBy and count will fail
> --
>
> Key: SPARK-12068
> URL: https://issues.apache.org/jira/browse/SPARK-12068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> {code}
> val ds = Seq("a" -> 1, "b" -> 1, "a" -> 2).toDS()
> val count = ds.groupBy($"_1").count()
> count.collect() // will fail
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-11-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033172#comment-15033172
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

Hi Phil, it's on my bucket. I should submit the PR today.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12068) use a single column in Dataset.groupBy and count will fail

2015-11-30 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-12068:
---

 Summary: use a single column in Dataset.groupBy and count will fail
 Key: SPARK-12068
 URL: https://issues.apache.org/jira/browse/SPARK-12068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan


{code}
val ds = Seq("a" -> 1, "b" -> 1, "a" -> 2).toDS()
val count = ds.groupBy($"_1").count()
count.collect() // will fail
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033124#comment-15033124
 ] 

Michael Armbrust commented on SPARK-12010:
--

Thanks for working on this, but we've already hit code freeze for 1.6.0 so I'm 
going to retarget.  Typically [let project 
committers|https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-JIRA]
 set the "target version".

> Spark JDBC requires support for column-name-free INSERT syntax
> --
>
> Key: SPARK-12010
> URL: https://issues.apache.org/jira/browse/SPARK-12010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Christian Kurz
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark JDBC write only works with technologies which support the following 
> INSERT statement syntax (JdbcUtils.scala: insertStatement()):
> INSERT INTO $table VALUES ( ?, ?, ..., ? )
> Some technologies require a list of column names:
> INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
> Therefore technologies like Progress JDBC Driver for Cassandra do not work 
> with Spark JDBC write.
> Idea for fix:
> Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect 
> for Progress JDBC Driver for Cassandra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12010:
-
Target Version/s:   (was: 1.6.0)

> Spark JDBC requires support for column-name-free INSERT syntax
> --
>
> Key: SPARK-12010
> URL: https://issues.apache.org/jira/browse/SPARK-12010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Christian Kurz
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark JDBC write only works with technologies which support the following 
> INSERT statement syntax (JdbcUtils.scala: insertStatement()):
> INSERT INTO $table VALUES ( ?, ?, ..., ? )
> Some technologies require a list of column names:
> INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
> Therefore technologies like Progress JDBC Driver for Cassandra do not work 
> with Spark JDBC write.
> Idea for fix:
> Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect 
> for Progress JDBC Driver for Cassandra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12017) Java Doc Publishing Broken

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12017.
--
   Resolution: Fixed
 Assignee: Josh Rosen
Fix Version/s: 1.6.0

> Java Doc Publishing Broken
> --
>
> Key: SPARK-12017
> URL: https://issues.apache.org/jira/browse/SPARK-12017
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Michael Armbrust
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.6.0
>
>
> The java docs are missing from the 1.6 preview.  I think that 
> [this|https://github.com/apache/spark/commit/529a1d3380c4c23fed068ad05a6376162c4b76d6#commitcomment-14392230]
>  is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12017) Java Doc Publishing Broken

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033119#comment-15033119
 ] 

Michael Armbrust commented on SPARK-12017:
--

Fixed in https://github.com/apache/spark/pull/10049

> Java Doc Publishing Broken
> --
>
> Key: SPARK-12017
> URL: https://issues.apache.org/jira/browse/SPARK-12017
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Michael Armbrust
>Priority: Blocker
>
> The java docs are missing from the 1.6 preview.  I think that 
> [this|https://github.com/apache/spark/commit/529a1d3380c4c23fed068ad05a6376162c4b76d6#commitcomment-14392230]
>  is the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11796) Docker JDBC integration tests fail in Maven build due to dependency issue

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11796:
-
Component/s: Tests

> Docker JDBC integration tests fail in Maven build due to dependency issue
> -
>
> Key: SPARK-11796
> URL: https://issues.apache.org/jira/browse/SPARK-11796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>
> Our new Docker integration tests for JDBC dialects are failing in the Maven 
> builds. For now, I've disabled this for Maven by adding the 
> {{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag to our Jenkins 
> builds, but we should fix this soon. The test failures seem to be related to 
> dependency or classpath issues:
> {code}
> *** RUN ABORTED ***
>   java.lang.NoSuchMethodError: 
> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> {code}
> To reproduce locally: {{build/mvn -pl docker-integration-tests package}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11601:
-
Component/s: Documentation

> ML 1.6 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-11601
> URL: https://issues.apache.org/jira/browse/SPARK-11601
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Timothy Hunter
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, ping [~mengxr] for advice since he did it for 
> 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11607) Update MLlib website for 1.6

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11607:
-
Component/s: Documentation

> Update MLlib website for 1.6
> 
>
> Key: SPARK-11607
> URL: https://issues.apache.org/jira/browse/SPARK-11607
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update MLlib's website to include features in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11603) ML 1.6 QA: API: Experimental, DeveloperApi, final, sealed audit

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11603:
-
Component/s: Documentation

> ML 1.6 QA: API: Experimental, DeveloperApi, final, sealed audit
> ---
>
> Key: SPARK-11603
> URL: https://issues.apache.org/jira/browse/SPARK-11603
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: DB Tsai
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.  This will 
> probably not include the Pipeline APIs yet since some parts (e.g., feature 
> attributes) are still under flux.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11315) Add YARN extension service to publish Spark events to YARN timeline service

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11315:
-
Target Version/s:   (was: 1.6.0)

> Add YARN extension service to publish Spark events to YARN timeline service
> ---
>
> Key: SPARK-11315
> URL: https://issues.apache.org/jira/browse/SPARK-11315
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Hadoop 2.6+
>Reporter: Steve Loughran
>
> Add an extension service (using SPARK-11314) to subscribe to Spark lifecycle 
> events, batch them and forward them to the YARN Application Timeline Service. 
> This data can then be retrieved by a new back end for the Spark History 
> Service, and by other analytics tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11600) Spark MLlib 1.6 QA umbrella

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11600:
-
Component/s: Documentation

> Spark MLlib 1.6 QA umbrella
> ---
>
> Key: SPARK-11600
> URL: https://issues.apache.org/jira/browse/SPARK-11600
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next MLlib release's QA period.
> h2. API
> * Check binary API compatibility (SPARK-11601)
> * Audit new public APIs (from the generated html doc)
> ** Scala (SPARK-11602)
> ** Java compatibility (SPARK-11605)
> ** Python coverage (SPARK-11604)
> * Check Experimental, DeveloperApi tags (SPARK-11603)
> h2. Algorithms and performance
> *Performance*
> * _List any other missing performance tests from spark-perf here_
> * ALS.recommendAll (SPARK-7457)
> * perf-tests in Python (SPARK-7539)
> * perf-tests for transformers (SPARK-2838)
> * MultilayerPerceptron (SPARK-11911)
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide (SPARK-11606)
> * For major components, create JIRAs for example code (SPARK-9670)
> * Update Programming Guide for 1.6 (towards end of QA) (SPARK-11608)
> * Update website (SPARK-11607)
> * Merge duplicate content under examples/ (SPARK-11685)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8414) Ensure ContextCleaner actually triggers clean ups

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033103#comment-15033103
 ] 

Michael Armbrust commented on SPARK-8414:
-

Still planning to do this for 1.6?

> Ensure ContextCleaner actually triggers clean ups
> -
>
> Key: SPARK-8414
> URL: https://issues.apache.org/jira/browse/SPARK-8414
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> Right now it cleans up old references only through natural GCs, which may not 
> occur if the driver has infinite RAM. We should do a periodic GC to make sure 
> that we actually do clean things up. Something like once per 30 minutes seems 
> relatively inexpensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7348) DAG visualization: add links to RDD page

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-7348:

Target Version/s:   (was: 1.6.0)

> DAG visualization: add links to RDD page
> 
>
> Key: SPARK-7348
> URL: https://issues.apache.org/jira/browse/SPARK-7348
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> It currently has links from the job page to the stage page. It would be nice 
> if it has links to the corresponding RDD page as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11985) Update Spark Streaming - Kinesis Library Documentation regarding data de-aggregation and message handler

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11985:
-
Component/s: Documentation

> Update Spark Streaming - Kinesis Library Documentation regarding data 
> de-aggregation and message handler
> 
>
> Key: SPARK-11985
> URL: https://issues.apache.org/jira/browse/SPARK-11985
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Burak Yavuz
>
> Update documentation and provide how-to example in guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6518) Add example code and user guide for bisecting k-means

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6518:

Component/s: Documentation

> Add example code and user guide for bisecting k-means
> -
>
> Key: SPARK-6518
> URL: https://issues.apache.org/jira/browse/SPARK-6518
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12060) Avoid memory copy in JavaSerializerInstance.serialize

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12060:
-
Priority: Critical  (was: Major)

> Avoid memory copy in JavaSerializerInstance.serialize
> -
>
> Key: SPARK-12060
> URL: https://issues.apache.org/jira/browse/SPARK-12060
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
>
> JavaSerializerInstance.serialize uses ByteArrayOutputStream.toByteArray to 
> get the serialized data. ByteArrayOutputStream.toByteArray needs to copy the 
> content in the internal array to a new array. However, since the array will 
> be converted to ByteBuffer at once, we can avoid the memory copy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8966) Design a mechanism to ensure that temporary files created in tasks are cleaned up after failures

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8966:

Target Version/s:   (was: 1.6.0)

> Design a mechanism to ensure that temporary files created in tasks are 
> cleaned up after failures
> 
>
> Key: SPARK-8966
> URL: https://issues.apache.org/jira/browse/SPARK-8966
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>
> It's important to avoid leaking temporary files, such as spill files created 
> by the external sorter.  Individual operators should still make an effort to 
> clean up their own files / perform their own error handling, but I think that 
> we should add a safety-net mechanism to track file creation on a per-task 
> basis and automatically clean up leaked files.
> During tests, this mechanism should throw an exception when a leak is 
> detected. In production deployments, it should log a warning and clean up the 
> leak itself.  This is similar to the TaskMemoryManager's leak detection and 
> cleanup code.
> We may be able to implement this via a convenience method that registers task 
> completion handlers with TaskContext.
> We might also explore techniques that will cause files to be cleaned up 
> automatically when their file descriptors are closed (e.g. by calling unlink 
> on an open file). These techniques should not be our last line of defense 
> against file resource leaks, though, since they might be platform-specific 
> and may clean up resources later than we'd like.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12031) Integer overflow when do sampling.

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12031:
-
Description: 
In my case, some partitions contain too much items. When do range partition, 
exception thrown as:

{code}
java.lang.IllegalArgumentException: n must be positive
at java.util.Random.nextInt(Random.java:300)
at 
org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58)
at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
In my case, some partitions contain too much items. When do range partition, 
exception thrown as:


java.lang.IllegalArgumentException: n must be positive
at java.util.Random.nextInt(Random.java:300)
at 
org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58)
at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


> Integer overflow when do sampling.
> --
>
> Key: SPARK-12031
> URL: https://issues.apache.org/jira/browse/SPARK-12031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
>Reporter: uncleGen
>
> In my case, some partitions contain too much items. When do range partition, 
> exception thrown as:
> {code}
> java.lang.IllegalArgumentException: n must be positive
> at java.util.Random.nextInt(Random.java:300)
> at 
> org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259)
> at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7729) Executor which has been killed should also be displayed on Executors Tab.

2015-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033095#comment-15033095
 ] 

Apache Spark commented on SPARK-7729:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10058

> Executor which has been killed should also be displayed on Executors Tab.
> -
>
> Key: SPARK-7729
> URL: https://issues.apache.org/jira/browse/SPARK-7729
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.3.1
>Reporter: Archit Thakur
>Priority: Minor
> Attachments: WebUI.png
>
>
> On the ExecutorsTab there is no information about the executors which have 
> been killed. It only shows the running executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11966) Spark API for UDTFs

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11966:
-
Target Version/s: 1.7.0

> Spark API for UDTFs
> ---
>
> Key: SPARK-11966
> URL: https://issues.apache.org/jira/browse/SPARK-11966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Jaka Jancar
>Priority: Minor
>
> Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
> functions. For those you still have to use these horrendous Hive interfaces:
> https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12018) Refactor common subexpression elimination code

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12018.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10009
[https://github.com/apache/spark/pull/10009]

> Refactor common subexpression elimination code
> --
>
> Key: SPARK-12018
> URL: https://issues.apache.org/jira/browse/SPARK-12018
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> The code of common subexpression elimination can be factored and simplified. 
> Some unnecessary variables can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-11-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12032:
--

Assignee: Davies Liu

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10647) Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be documented

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10647:


Assignee: Apache Spark  (was: Timothy Chen)

> Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be 
> documented
> ---
>
> Key: SPARK-10647
> URL: https://issues.apache.org/jira/browse/SPARK-10647
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Alan Braithwaite
>Assignee: Apache Spark
>Priority: Minor
>
> The property `spark.deploy.zookeeper.dir` doesn't match up with the other 
> properties surrounding it, namely:
> spark.mesos.deploy.zookeeper.url
> and
> spark.mesos.deploy.recoveryMode
> Since it's also a property specific to mesos, it makes sense to be under that 
> hierarchy as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10647) Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be documented

2015-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033076#comment-15033076
 ] 

Apache Spark commented on SPARK-10647:
--

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10057

> Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be 
> documented
> ---
>
> Key: SPARK-10647
> URL: https://issues.apache.org/jira/browse/SPARK-10647
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Alan Braithwaite
>Assignee: Timothy Chen
>Priority: Minor
>
> The property `spark.deploy.zookeeper.dir` doesn't match up with the other 
> properties surrounding it, namely:
> spark.mesos.deploy.zookeeper.url
> and
> spark.mesos.deploy.recoveryMode
> Since it's also a property specific to mesos, it makes sense to be under that 
> hierarchy as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10647) Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be documented

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10647:


Assignee: Timothy Chen  (was: Apache Spark)

> Mesos HA mode misuses spark.deploy.zookeeper.dir property; configs should be 
> documented
> ---
>
> Key: SPARK-10647
> URL: https://issues.apache.org/jira/browse/SPARK-10647
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Alan Braithwaite
>Assignee: Timothy Chen
>Priority: Minor
>
> The property `spark.deploy.zookeeper.dir` doesn't match up with the other 
> properties surrounding it, namely:
> spark.mesos.deploy.zookeeper.url
> and
> spark.mesos.deploy.recoveryMode
> Since it's also a property specific to mesos, it makes sense to be under that 
> hierarchy as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions

2015-11-30 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao resolved SPARK-12064.
---
Resolution: Won't Fix

DBX has plan to remove the SqlParser in 2.0.

> Make the SqlParser as trait for better integrated with extensions
> -
>
> Key: SPARK-12064
> URL: https://issues.apache.org/jira/browse/SPARK-12064
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>
> `SqlParser` is now an object, which hard to reuse it in extensions, a proper 
> implementation will be make the `SqlParser` as trait, and keep all of its 
> implementation unchanged, and then add another object called `SqlParser` 
> inherits from the trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6521) Bypass network shuffle read if both endpoints are local

2015-11-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033016#comment-15033016
 ] 

Takeshi Yamamuro commented on SPARK-6521:
-

Performance of the current Spark heavily depends on CPU, so this shuffle 
optimization has little effect on that ( benchmark results could be found in 
pullreq #9478). For a while, this ticket needs not to be considered and

> Bypass network shuffle read if both endpoints are local
> ---
>
> Key: SPARK-6521
> URL: https://issues.apache.org/jira/browse/SPARK-6521
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: xukun
>
> In the past, executor read other executor's shuffle file in the same node by 
> net. This pr make that executors in the same node read local shuffle file In 
> sort-based Shuffle. It will reduce net transport.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame

2015-11-30 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12067:

Description: 
* SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced by 
DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
Column.isnotnull.
* Add Column.notnull as alias of Column.isnotnull following the pandas naming 
convention.
* Add DataFrame.isnotnull and DataFrame.notnull.

  was:Fix usage of isnan, isnull, isnotnull of Column and DataFrame.


> Fix usage of isnan, isnull, isnotnull of Column and DataFrame
> -
>
> Key: SPARK-12067
> URL: https://issues.apache.org/jira/browse/SPARK-12067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>
> * SPARK-11947 has deprecated DataFrame.isNaN, DataFrame.isNull and replaced 
> by DataFrame.isnan, DataFrame.isnull, this PR changed Column.isNaN to 
> Column.isnan, Column.isNull to Column.isnull, Column.isNotNull to 
> Column.isnotnull.
> * Add Column.notnull as alias of Column.isnotnull following the pandas naming 
> convention.
> * Add DataFrame.isnotnull and DataFrame.notnull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12067:


Assignee: (was: Apache Spark)

> Fix usage of isnan, isnull, isnotnull of Column and DataFrame
> -
>
> Key: SPARK-12067
> URL: https://issues.apache.org/jira/browse/SPARK-12067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>
> Fix usage of isnan, isnull, isnotnull of Column and DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12067:


Assignee: Apache Spark

> Fix usage of isnan, isnull, isnotnull of Column and DataFrame
> -
>
> Key: SPARK-12067
> URL: https://issues.apache.org/jira/browse/SPARK-12067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Fix usage of isnan, isnull, isnotnull of Column and DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame

2015-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033013#comment-15033013
 ] 

Apache Spark commented on SPARK-12067:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10056

> Fix usage of isnan, isnull, isnotnull of Column and DataFrame
> -
>
> Key: SPARK-12067
> URL: https://issues.apache.org/jira/browse/SPARK-12067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yanbo Liang
>
> Fix usage of isnan, isnull, isnotnull of Column and DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12067) Fix usage of isnan, isnull, isnotnull of Column and DataFrame

2015-11-30 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-12067:
---

 Summary: Fix usage of isnan, isnull, isnotnull of Column and 
DataFrame
 Key: SPARK-12067
 URL: https://issues.apache.org/jira/browse/SPARK-12067
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yanbo Liang


Fix usage of isnan, isnull, isnotnull of Column and DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12066) spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* with join

2015-11-30 Thread Ricky Yang (JIRA)
Ricky Yang created SPARK-12066:
--

 Summary: spark sql  throw java.lang.ArrayIndexOutOfBoundsException 
when use table.* with join 
 Key: SPARK-12066
 URL: https://issues.apache.org/jira/browse/SPARK-12066
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2, 1.4.0
 Environment: linux 
Reporter: Ricky Yang
Priority: Blocker


throw java.lang.ArrayIndexOutOfBoundsException  when I use following spark sql 
on spark standlone or yarn.
   the sql:
select ta.* 
from bi_td.dm_price_seg_td tb 
join bi_sor.sor_ord_detail_tf ta 
on 1 = 1 
where ta.sale_dt = '20140514' 
and ta.sale_price >= tb.pri_from 
and ta.sale_price < tb.pri_to limit 10 ; 

But ,the result is correct when using no * as following:
select ta.sale_dt 
from bi_td.dm_price_seg_td tb 
join bi_sor.sor_ord_detail_tf ta 
on 1 = 1 
where ta.sale_dt = '20140514' 
and ta.sale_price >= tb.pri_from 
and ta.sale_price < tb.pri_to limit 10 ; 

standlone version is 1.4.0 and version spark on yarn  is 1.5.2
error log :
  
15/11/30 14:19:59 ERROR SparkSQLDriver: Failed in [select ta.* 
from bi_td.dm_price_seg_td tb 
join bi_sor.sor_ord_detail_tf ta 
on 1 = 1 
where ta.sale_dt = '20140514' 
and ta.sale_price >= tb.pri_from 
and ta.sale_price < tb.pri_to limit 10 ] 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3, namenode2-sit.cnsuning.com): java.lang.ArrayIndexOutOfBoundsException 

Driver stacktrace: 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
 
at scala.Option.foreach(Option.scala:236) 
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
 
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) 
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215) 
at 
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207) 
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:587)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:308)
 
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) 
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311) 
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409) 
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425) 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:606) 
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
 
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) 
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) 
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) 
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Caused by: java.lang.ArrayIndexOutOfBoundsException 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure:

[jira] [Assigned] (SPARK-12065) Upgrade Tachyon dependency to 0.8.2

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12065:


Assignee: Josh Rosen  (was: Apache Spark)

> Upgrade Tachyon dependency to 0.8.2
> ---
>
> Key: SPARK-12065
> URL: https://issues.apache.org/jira/browse/SPARK-12065
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> I think that we should upgrade from Tachyon 0.8.1 to 0.8.2 in order to get 
> the fix for https://tachyon.atlassian.net/browse/TACHYON-1254.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12065) Upgrade Tachyon dependency to 0.8.2

2015-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032993#comment-15032993
 ] 

Apache Spark commented on SPARK-12065:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10054

> Upgrade Tachyon dependency to 0.8.2
> ---
>
> Key: SPARK-12065
> URL: https://issues.apache.org/jira/browse/SPARK-12065
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> I think that we should upgrade from Tachyon 0.8.1 to 0.8.2 in order to get 
> the fix for https://tachyon.atlassian.net/browse/TACHYON-1254.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12065) Upgrade Tachyon dependency to 0.8.2

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12065:


Assignee: Apache Spark  (was: Josh Rosen)

> Upgrade Tachyon dependency to 0.8.2
> ---
>
> Key: SPARK-12065
> URL: https://issues.apache.org/jira/browse/SPARK-12065
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> I think that we should upgrade from Tachyon 0.8.1 to 0.8.2 in order to get 
> the fix for https://tachyon.atlassian.net/browse/TACHYON-1254.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12065) Upgrade Tachyon dependency to 0.8.2

2015-11-30 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12065:
---
Issue Type: Improvement  (was: Bug)

> Upgrade Tachyon dependency to 0.8.2
> ---
>
> Key: SPARK-12065
> URL: https://issues.apache.org/jira/browse/SPARK-12065
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> I think that we should upgrade from Tachyon 0.8.1 to 0.8.2 in order to get 
> the fix for https://tachyon.atlassian.net/browse/TACHYON-1254.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12065) Upgrade Tachyon dependency to 0.8.2

2015-11-30 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-12065:
--

 Summary: Upgrade Tachyon dependency to 0.8.2
 Key: SPARK-12065
 URL: https://issues.apache.org/jira/browse/SPARK-12065
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen


I think that we should upgrade from Tachyon 0.8.1 to 0.8.2 in order to get the 
fix for https://tachyon.atlassian.net/browse/TACHYON-1254.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Henri DF (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032960#comment-15032960
 ] 

Henri DF commented on SPARK-11941:
--

I wasn't trying to serialize to/from using the Spark APIs - I was just getting 
the json representation out in order to build a programmatic representation of 
the structtype in another (non-Spark) environment. Recursing down the tree 
would be trivial if it was regular, but is painful with its current layout. 

Anyway, with your question I think I better understand the intended use for 
this, and it does indeed appear to work fine for ser/deser within Spark. So I 
get the rationale for making it an "Improvement". Thanks!

> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11940) Python API for ml.clustering.LDA

2015-11-30 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032952#comment-15032952
 ] 

Jeff Zhang commented on SPARK-11940:


Thanks [~yanboliang] I will work on it. 

> Python API for ml.clustering.LDA
> 
>
> Key: SPARK-11940
> URL: https://issues.apache.org/jira/browse/SPARK-11940
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>
> Add Python API for ml.clustering.LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11940) Python API for ml.clustering.LDA

2015-11-30 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032948#comment-15032948
 ] 

Yanbo Liang commented on SPARK-11940:
-

[~zjffdu] I'm not working on this,  you can take it.

> Python API for ml.clustering.LDA
> 
>
> Key: SPARK-11940
> URL: https://issues.apache.org/jira/browse/SPARK-11940
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>
> Add Python API for ml.clustering.LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032946#comment-15032946
 ] 

Michael Armbrust commented on SPARK-11941:
--

Sorry, maybe I'm misunderstanding.  Can you construct a case where we
serialize the case class representation to and from json and we lose
information?

If you can, then I agree this is a bug and we should fix it.  Otherwise, it
seems like an inconvenience.



> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12030) Incorrect results when aggregate joined data

2015-11-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032939#comment-15032939
 ] 

Xiao Li edited comment on SPARK-12030 at 12/1/15 2:16 AM:
--

Let me post a simple case that can trigger the data corruption. The data set t1 
is downloaded from this JIRA. 

{code}
test("sort result") {
  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
  SQLConf.SHUFFLE_PARTITIONS.key -> "1") {
  val t1test = 
sqlContext.read.parquet("/Users/xiaoli/Downloads/t1").dropDuplicates().where("fk1=39
 or (fk1=525 and id1 < 664618 and id1 >= 470050)").repartition(1).cache()

  //t1test.orderBy("fk1").explain(true)
  val t1 = t1test.orderBy("fk1").cache()

  checkAnswer( t1test, t1.collect() )
}
{code}

I am not sure if you can see the un-match. I am unable to reproduce it in a 
Thinkpad, but I can easily reproduce it in my macbook. 

My case did not hit any exception, but I saw a data corruption. After sorting, 
one row [664615,525] is replaced by another row [664611,525]. Thus one row 
disappeared after sorting, but you can see a duplicate of another row. The 
number of total rows was not changed after the sort. 


was (Author: smilegator):
Let me post a simple case that can trigger the data corruption. The data set t1 
is downloaded from this JIRA. 

{code}
test("sort result") {
  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
  SQLConf.SHUFFLE_PARTITIONS.key -> "1") {
  val t1test = 
sqlContext.read.parquet("/Users/xiaoli/Downloads/t1").dropDuplicates().where("fk1=39
 or (fk1=525 and id1 < 664618 and id1 >= 470050)").repartition(1).cache()

  //t1test.orderBy("fk1").explain(true)
  val t1 = t1test.orderBy("fk1").cache()

  checkAnswer( t1test, t1.collect() )
}
{code}

I am not sure if you can see the un-match. I am unable to reproduce it in the 
Thinkpad, but I can easily reproduce it in my macbook. 

My case did not hit any exception, but I saw a data corruption. After sorting, 
one row [664615,525] is replaced by another row [664611,525]. Thus one row 
disappears after sorting, but you can see a duplicate in another row. The 
number of total rows is not changed after the sort. 

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Blocker
> Attachments: spark.jpg, t1.tar.gz, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data

2015-11-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032939#comment-15032939
 ] 

Xiao Li commented on SPARK-12030:
-

Let me post a simple case that can trigger the data corruption. The data set t1 
is downloaded from this JIRA. 

{code}
test("sort result") {
  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
  SQLConf.SHUFFLE_PARTITIONS.key -> "1") {
  val t1test = 
sqlContext.read.parquet("/Users/xiaoli/Downloads/t1").dropDuplicates().where("fk1=39
 or (fk1=525 and id1 < 664618 and id1 >= 470050)").repartition(1).cache()

  //t1test.orderBy("fk1").explain(true)
  val t1 = t1test.orderBy("fk1").cache()

  checkAnswer( t1test, t1.collect() )
}
{code}

I am not sure if you can see the un-match. I am unable to reproduce it in the 
Thinkpad, but I can easily reproduce it in my macbook. 

My case did not hit any exception, but I saw a data corruption. After sorting, 
one row [664615,525] is replaced by another row [664611,525]. Thus one row 
disappears after sorting, but you can see a duplicate in another row. The 
number of total rows is not changed after the sort. 

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Blocker
> Attachments: spark.jpg, t1.tar.gz, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Henri DF (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032931#comment-15032931
 ] 

Henri DF edited comment on SPARK-11941 at 12/1/15 2:14 AM:
---

I think "might be nicer if it was flat' is a bit of an understatement  

The current representation isn't of much use with nested structs. If it's hard 
to fix, wouldn't it be better to make this private rather than leave exposed it 
in its current state? 


was (Author: henridf):
I think "might be nicer if it was flat' is a bit of an understatement  

The current representation isn't of much use with nested structs. If it's hard 
to fix, would it be better to remove this than leave it in its current state? 

> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Henri DF (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032931#comment-15032931
 ] 

Henri DF commented on SPARK-11941:
--

I think "might be nicer if it was flat' is a bit of an understatement  

The current representation isn't of much use with nested structs. If it's hard 
to fix, would it be better to remove this than leave it in its current state? 

> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data

2015-11-30 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032927#comment-15032927
 ] 

Yin Huai commented on SPARK-12030:
--

[~smilegator] Can you post the case that triggers the problem? Also, is 
https://issues.apache.org/jira/browse/SPARK-12055 a related issue?

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Blocker
> Attachments: spark.jpg, t1.tar.gz, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11966) Spark API for UDTFs

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032925#comment-15032925
 ] 

Michael Armbrust commented on SPARK-11966:
--

Ah, I was proposing the DataFrame function explode as it gives you something 
very close to UDTFs.  However, if you want to be able to use the functions in 
pure SQL then thats not going to be sufficient.

> Spark API for UDTFs
> ---
>
> Key: SPARK-11966
> URL: https://issues.apache.org/jira/browse/SPARK-11966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Jaka Jancar
>Priority: Minor
>
> Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
> functions. For those you still have to use these horrendous Hive interfaces:
> https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12032:
-
Issue Type: Improvement  (was: Bug)

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Priority: Critical
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11873) Regression for TPC-DS query 63 when used with decimal datatype and windows function

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032921#comment-15032921
 ] 

Michael Armbrust commented on SPARK-11873:
--

What about with Spark 1.6?

> Regression for TPC-DS query 63 when used with decimal datatype and windows 
> function
> ---
>
> Key: SPARK-11873
> URL: https://issues.apache.org/jira/browse/SPARK-11873
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Dileep Kumar
>  Labels: perfomance
> Attachments: 63.1.1, 63.1.5, 63.decimal_schema, 
> 63.decimal_schema_windows_function, 63.double_schema, 98.1.1, 98.1.5, 
> decimal_schema.sql, double_schema.sql
>
>
> When running the TPC-DS based queries for benchmarking spark found that query 
> 63 (after making it similar to original query) show different behavior 
> compared to other queries eg. q98 which has similar function.
> Here are performance numbers(execution time in seconds):
>   1.1 Baseline1.5 1.5 + Decimal
> q63   27  26  38
> q98   18  26  24
> As you can see q63 is showing regression compared to similar query. I am 
> attaching the both version of queries and affected schemas. When adding the 
> windows function back this is the only query seem to be slower than 1.1 in 
> 1.5.
> I have attached the both version of schema and queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11966) Spark API for UDTFs

2015-11-30 Thread Jaka Jancar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032906#comment-15032906
 ] 

Jaka Jancar edited comment on SPARK-11966 at 12/1/15 1:59 AM:
--

Not sure I understand. I would like to do {{SELECT * FROM 
my_create_table(...)}}.

Right now, all I can do is {{SELECT * FROM explode(my_create_array(...))}}.

//edit: In reality, this would be a part of JOIN or lateral view. I would like 
it to be doable with only SQL.


was (Author: jakajancar):
Not sure I understand. I would like to do {{SELECT * FROM 
my_create_table(...)}}.

Right now, all I can do is {{SELECT * FROM explode(my_create_array(...))}}.


> Spark API for UDTFs
> ---
>
> Key: SPARK-11966
> URL: https://issues.apache.org/jira/browse/SPARK-11966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Jaka Jancar
>Priority: Minor
>
> Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
> functions. For those you still have to use these horrendous Hive interfaces:
> https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11966) Spark API for UDTFs

2015-11-30 Thread Jaka Jancar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032906#comment-15032906
 ] 

Jaka Jancar commented on SPARK-11966:
-

Not sure I understand. I would like to do {{SELECT * FROM 
my_create_table(...)}}.

Right now, all I can do is {{SELECT * FROM explode(my_create_array(...))}}.


> Spark API for UDTFs
> ---
>
> Key: SPARK-11966
> URL: https://issues.apache.org/jira/browse/SPARK-11966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Jaka Jancar
>Priority: Minor
>
> Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
> functions. For those you still have to use these horrendous Hive interfaces:
> https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11941:
-
Issue Type: Improvement  (was: Bug)

> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032899#comment-15032899
 ] 

Michael Armbrust commented on SPARK-11941:
--

/cc [~lian cheng]


> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data

2015-11-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032902#comment-15032902
 ] 

Xiao Li commented on SPARK-12030:
-

I already excluded Exchange and Partitioning. It should be caused by Sort. Will 
continue the investigation tonight. Will keep you posted if I can locate the 
exact changes. Thanks!

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Blocker
> Attachments: spark.jpg, t1.tar.gz, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11941) JSON representation of nested StructTypes could be more uniform

2015-11-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11941:
-
Summary: JSON representation of nested StructTypes could be more uniform  
(was: JSON representation of nested StructTypes is incorrect)

> JSON representation of nested StructTypes could be more uniform
> ---
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11941) JSON representation of nested StructTypes is incorrect

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032889#comment-15032889
 ] 

Michael Armbrust commented on SPARK-11941:
--

While I can appreciate that this might be nicer if it was flat, I don't think 
that changing it at this point is worth the cost.  This is a stable 
representation that we persist with data.  As such, if we change it we are 
going to have to support parsing both representations forever.

> JSON representation of nested StructTypes is incorrect
> --
>
> Key: SPARK-11941
> URL: https://issues.apache.org/jira/browse/SPARK-11941
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Henri DF
>
> I have a json file with a single row {code}{"a":1, "b": 1.0, "c": "asdfasd", 
> "d":[1, 2, 4]}{code} After reading that file in, the schema is correctly 
> inferred:
> {code}
> scala> df.printSchema
> root
>  |-- a: long (nullable = true)
>  |-- b: double (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: array (nullable = true)
>  ||-- element: long (containsNull = true)
> {code}
> However, the json representation has a strange nesting under "type" for 
> column "d":
> {code}
> scala> df.collect()(0).schema.prettyJson
> res60: String = 
> {
>   "type" : "struct",
>   "fields" : [ {
> "name" : "a",
> "type" : "long",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "b",
> "type" : "double",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "c",
> "type" : "string",
> "nullable" : true,
> "metadata" : { }
>   }, {
> "name" : "d",
> "type" : {
>   "type" : "array",
>   "elementType" : "long",
>   "containsNull" : true
> },
> "nullable" : true,
> "metadata" : { }
>   }]
> }
> {code}
> Specifically, in the last element, "type" is an object instead of being a 
> string. I would expect the last element to be:
> {code}
>   {
>  "name":"d",
>  "type":"array",
>  "elementType":"long",
>  "containsNull":true,
>  "nullable":true,
>  "metadata":{}
>   }
> {code}
> There's a similar issue for nested structs.
> (I ran into this while writing node.js bindings, wanted to recurse down this 
> representation, which would be nicer if it was uniform...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data

2015-11-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032887#comment-15032887
 ] 

Xiao Li commented on SPARK-12030:
-

[SPARK-7542][SQL] Support off-heap index/sort buffer
https://github.com/apache/spark/pull/9477

and

[SPARK-11389][CORE] Add support for off-heap memory to MemoryManage
https://github.com/apache/spark/pull/9344

The problem does not exist if I took out the code changes by these two JIRAs. 
The code changes of these two JIRAs are mixed. Thus, I assume it should be 
caused by #9477.


> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Blocker
> Attachments: spark.jpg, t1.tar.gz, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11966) Spark API for UDTFs

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032877#comment-15032877
 ] 

Michael Armbrust commented on SPARK-11966:
--

Have you seen 
[explode|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1146].
  Does this do what you want, or is something missing?

> Spark API for UDTFs
> ---
>
> Key: SPARK-11966
> URL: https://issues.apache.org/jira/browse/SPARK-11966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Jaka Jancar
>Priority: Minor
>
> Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
> functions. For those you still have to use these horrendous Hive interfaces:
> https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12049) User JVM shutdown hook can cause deadlock at shutdown

2015-11-30 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12049.

   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.3

> User JVM shutdown hook can cause deadlock at shutdown
> -
>
> Key: SPARK-12049
> URL: https://issues.apache.org/jira/browse/SPARK-12049
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Sean Owen
>Assignee: Sean Owen
> Fix For: 1.5.3, 1.6.0
>
>
> Here's a simplification of a deadlock that can occur a shutdown if the user 
> app has also installed a shutdown hook to clean up:
> - Spark Shutdown Hook thread runs
> - {{SparkShutdownHookManager.runAll()}} is invoked, locking 
> {{SparkShutdownHookManager}} as it is {{synchronized}}
> - A user shutdown hook thread runs
> - User hook tries to call, for example {{StreamingContext.stop()}}, which is 
> {{synchronized}} and locks it
> - User hook blocks when the {{StreamingContext}} tries to {{remove()}} the 
> Spark Streaming shutdown task, since it's {{synchronized}} per above
> - Spark Shutdown Hook tries to execute the Spark Streaming shutdown task, but 
> blocks on {{StreamingContext.stop()}}
> I think this is actually not that critical, since it requires a pretty 
> specific setup, and I think it can be worked around in many cases by 
> integrating with Hadoop's shutdown hook mechanism like Spark does so that 
> these happen serially.
> I also think it's solvable in the code by not locking 
> {{SparkShutdownHookManager}} in the 3 methods that are {{synchronized}} since 
> these are really only protecting {{hooks}}. {{runAll()}} shouldn't hold the 
> lock while executing hooks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation

2015-11-30 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032863#comment-15032863
 ] 

Josh Rosen commented on SPARK-12000:


Here's the full stacktrace of the compiler crash:

{code}
  last tree to typer: Literal(Constant(1.5.0))
  symbol: null
   symbol definition: null
 tpe: String("1.5.0")
   symbol owners:
  context owners: value  -> package clustering

== Enclosing template or block ==

Apply(
  new Since.""
  "1.5.0"
)

== Expanded type of tree ==

ConstantType(value = Constant(1.5.0))

no-symbol does not have an owner
at scala.reflect.internal.SymbolTable.abort(SymbolTable.scala:49)
at scala.tools.nsc.Global.abort(Global.scala:254)
at scala.reflect.internal.Symbols$NoSymbol.owner(Symbols.scala:3257)
at 
scala.tools.nsc.symtab.classfile.ClassfileParser.addEnclosingTParams(ClassfileParser.scala:585)
at 
scala.tools.nsc.symtab.classfile.ClassfileParser.parseClass(ClassfileParser.scala:530)
at 
scala.tools.nsc.symtab.classfile.ClassfileParser.parse(ClassfileParser.scala:88)
at 
scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader.doComplete(SymbolLoaders.scala:261)
at 
scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.complete(SymbolLoaders.scala:194)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anonfun$cleanupBogusClasses$1$1.apply(MemberLookupBase.scala:153)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anonfun$cleanupBogusClasses$1$1.apply(MemberLookupBase.scala:153)
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.cleanupBogusClasses$1(MemberLookupBase.scala:153)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.lookupInTemplate(MemberLookupBase.scala:164)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.scala$tools$nsc$doc$base$MemberLookupBase$$lookupInTemplate(MemberLookupBase.scala:128)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.lookupInRootPackage(MemberLookupBase.scala:115)
at 
scala.tools.nsc.doc.base.MemberLookupBase$class.memberLookup(MemberLookupBase.scala:52)
at 
scala.tools.nsc.doc.DocFactory$$anon$1.memberLookup(DocFactory.scala:78)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anon$1.link$lzycompute(MemberLookupBase.scala:27)
at 
scala.tools.nsc.doc.base.MemberLookupBase$$anon$1.link(MemberLookupBase.scala:27)
at scala.tools.nsc.doc.base.comment.EntityLink$.unapply(Body.scala:75)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:126)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:115)
at scala.tools.nsc.doc.html.HtmlPage.inlineToHtml(HtmlPage.scala:124)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.tools.nsc.doc.html.HtmlPage$$anonfun$inlineToHtml$1.apply(HtmlPage.scala:115)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.colle

[jira] [Commented] (SPARK-12030) Incorrect results when aggregate joined data

2015-11-30 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032857#comment-15032857
 ] 

Davies Liu commented on SPARK-12030:


[~smilegator] Could you post the related PRs here? So we can also looking into 
it, thanks!

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Blocker
> Attachments: spark.jpg, t1.tar.gz, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12007) Network library's RPC layer requires a lot of copying

2015-11-30 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12007.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Network library's RPC layer requires a lot of copying
> -
>
> Key: SPARK-12007
> URL: https://issues.apache.org/jira/browse/SPARK-12007
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.6.0
>
>
> The network library's RPC layer has an external API based on byte arrays, 
> instead of ByteBuffer; that requires a lot of copying since the internals of 
> the library use ByteBuffers (or rather Netty's ByteBuf), and lots of external 
> clients also use ByteBuffer.
> The extra copies could be avoided if the API used ByteBuffer instead.
> To show an extreme case, look at an RPC send via NettyRpcEnv:
> - message is encoded using JavaSerializer, resulting in a ByteBuffer
> - the ByteBuffer is copied into a byte array of the right size, since its 
> internal array may be larger than the actual data it holds
> - the network library's encoder copies the byte array into a ByteBuf
> - finally the data is written to the socket
> The intermediate 2 copies could be avoided if the API allowed the original 
> ByteBuffer to be sent instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12037) Executors use heartbeatReceiverRef to report heartbeats and task metrics that might not be initialized and leads to NullPointerException

2015-11-30 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12037.
---
  Resolution: Fixed
Assignee: Nan Zhu
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0

> Executors use heartbeatReceiverRef to report heartbeats and task metrics that 
> might not be initialized and leads to NullPointerException
> 
>
> Key: SPARK-12037
> URL: https://issues.apache.org/jira/browse/SPARK-12037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: The latest sources at revision {{c793d2d}}
>Reporter: Jacek Laskowski
>Assignee: Nan Zhu
> Fix For: 1.6.0
>
>
> When {{Executor}} starts it starts driver heartbeater (using 
> {{startDriverHeartbeater()}}) that uses {{heartbeatReceiverRef}} that is 
> initialized later and there is a possibility of NullPointerException (after 
> {{spark.executor.heartbeatInterval}} or {{10s}}).
> {code}
> WARN Executor: Issue communicating with driver in heartbeater
> java.lang.NullPointerException
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:447)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:467)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:467)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:467)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1717)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:467)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12035) Add more debug information in include_example tag of Jekyll

2015-11-30 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12035.
---
  Resolution: Fixed
Assignee: Xusen Yin
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0

> Add more debug information in include_example tag of Jekyll
> ---
>
> Key: SPARK-12035
> URL: https://issues.apache.org/jira/browse/SPARK-12035
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>Priority: Minor
>  Labels: documentation
> Fix For: 1.6.0
>
>
> Add more debug information in the include_example tag of Jekyll, so that we 
> can know more when facing with errors of `jekyll build`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12064:


Assignee: Apache Spark

> Make the SqlParser as trait for better integrated with extensions
> -
>
> Key: SPARK-12064
> URL: https://issues.apache.org/jira/browse/SPARK-12064
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Apache Spark
>
> `SqlParser` is now an object, which hard to reuse it in extensions, a proper 
> implementation will be make the `SqlParser` as trait, and keep all of its 
> implementation unchanged, and then add another object called `SqlParser` 
> inherits from the trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions

2015-11-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12064:


Assignee: (was: Apache Spark)

> Make the SqlParser as trait for better integrated with extensions
> -
>
> Key: SPARK-12064
> URL: https://issues.apache.org/jira/browse/SPARK-12064
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>
> `SqlParser` is now an object, which hard to reuse it in extensions, a proper 
> implementation will be make the `SqlParser` as trait, and keep all of its 
> implementation unchanged, and then add another object called `SqlParser` 
> inherits from the trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12064) Make the SqlParser as trait for better integrated with extensions

2015-11-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032842#comment-15032842
 ] 

Apache Spark commented on SPARK-12064:
--

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/10053

> Make the SqlParser as trait for better integrated with extensions
> -
>
> Key: SPARK-12064
> URL: https://issues.apache.org/jira/browse/SPARK-12064
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>
> `SqlParser` is now an object, which hard to reuse it in extensions, a proper 
> implementation will be make the `SqlParser` as trait, and keep all of its 
> implementation unchanged, and then add another object called `SqlParser` 
> inherits from the trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >