date:20160219

[jira] [Resolved] (SPARK-13408) Exception in resultHandler will shutdown SparkContext

2016-02-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13408.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11280
[https://github.com/apache/spark/pull/11280]

> Exception in resultHandler will shutdown SparkContext
> -
>
> Key: SPARK-13408
> URL: https://issues.apache.org/jira/browse/SPARK-13408
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> {code}
> davies@localhost:~/work/spark$ bin/spark-submit 
> python/pyspark/sql/dataframe.py
> NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
> ahead of assembly.
> 16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback 
> address: 127.0.0.1; using 192.168.0.143 instead (on interface en0)
> 16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> **
> File 
> "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
> line 554, in pyspark.sql.dataframe.DataFrame.alias
> Failed example:
> joined_df.select(col("df_as1.name"), col("df_as2.name"), 
> col("df_as2.age")).collect()
> Differences (ndiff with -expected +actual):
> - [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', 
> name=u'Alice', age=2)]
> + [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', 
> name=u'Bob', age=5)]
> org.apache.spark.SparkDriverExecutionException: Execution error
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47)
>   at 
> org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
>   at 
> org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929)
>   at 
> org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185)
>   ... 4 more
> org.apache.spark.SparkDriverExecutionException: Execution error
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Caused by: java.lang.NullPointerException
>   at 
>

[jira] [Comment Edited] (SPARK-13409) Log the stacktrace when stopping a SparkContext

2016-02-19 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155441#comment-15155441
 ] 

Reynold Xin edited comment on SPARK-13409 at 2/20/16 6:58 AM:
--

OK that makes sense. It wasn't clear to me what you meant by "remembering". 

You are proposing adding a field to SparkContext and a function that can be 
used to retrieve the stacktrace when SparkContext stops?



was (Author: rxin):
OK that makes sense. It wasn't clear to me what you meant by "remembering". 

You are proposing adding a field to SparkContext that can be used to retrieve 
the stacktrace when SparkContext stops?


> Log the stacktrace when stopping a SparkContext
> ---
>
> Key: SPARK-13409
> URL: https://issues.apache.org/jira/browse/SPARK-13409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>
> Somethings we saw a stopped SparkContext, then have no idea it's stopped by 
> what, we should log that for troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13304) Broadcast join with two ints could be very slow

2016-02-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13304.

   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 2.0.0

Fixed by https://github.com/apache/spark/pull/11130

> Broadcast join with two ints could be very slow
> ---
>
> Key: SPARK-13304
> URL: https://issues.apache.org/jira/browse/SPARK-13304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> If the two join columns have the same value, the hash code of them will be (a 
> ^ b), which is 0, then the HashMap will be very very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13409) Log the stacktrace when stopping a SparkContext

2016-02-19 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155441#comment-15155441
 ] 

Reynold Xin commented on SPARK-13409:
-

OK that makes sense. It wasn't clear to me what you meant by "remembering". 

You are proposing adding a field to SparkContext that can be used to retrieve 
the stacktrace when SparkContext stops?


> Log the stacktrace when stopping a SparkContext
> ---
>
> Key: SPARK-13409
> URL: https://issues.apache.org/jira/browse/SPARK-13409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>
> Somethings we saw a stopped SparkContext, then have no idea it's stopped by 
> what, we should log that for troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13409) Log the stacktrace when stopping a SparkContext

2016-02-19 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155439#comment-15155439
 ] 

Davies Liu commented on SPARK-13409:


[~rxin] I think we should remember the stacktrace, put that in the message when 
we tried to access the stopped the SparkContext, this will be much usefully 
than just logging it (hard to find the log).



We already did similar things when creating a SparkContext.

> Log the stacktrace when stopping a SparkContext
> ---
>
> Key: SPARK-13409
> URL: https://issues.apache.org/jira/browse/SPARK-13409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>
> Somethings we saw a stopped SparkContext, then have no idea it's stopped by 
> what, we should log that for troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13213) BroadcastNestedLoopJoin is very slow

2016-02-19 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155438#comment-15155438
 ] 

Davies Liu commented on SPARK-13213:


It depends, I'm open to any reasonable solution.

> BroadcastNestedLoopJoin is very slow
> 
>
> Key: SPARK-13213
> URL: https://issues.apache.org/jira/browse/SPARK-13213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> Since we have improve the performance of CartisianProduct, which should be 
> faster and robuster than BroacastNestedLoopJoin, we should do 
> CartisianProduct instead of BroacastNestedLoopJoin, especially  when the 
> broadcasted table is not that small.
> Today, we hit a query that take very long time but still not finished, once 
> decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it 
> just finished in seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12567) Add aes_encrypt and aes_decrypt UDFs

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12567:


Assignee: Kai Jiang  (was: Apache Spark)

> Add aes_encrypt and aes_decrypt UDFs
> 
>
> Key: SPARK-12567
> URL: https://issues.apache.org/jira/browse/SPARK-12567
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kai Jiang
>Assignee: Kai Jiang
> Fix For: 2.0.0
>
>
> AES (Advanced Encryption Standard) algorithm.
> Add aes_encrypt and aes_decrypt UDFs.
> Ref:
> [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions]
> [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12567) Add aes_encrypt and aes_decrypt UDFs

2016-02-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-12567:
-

> Add aes_encrypt and aes_decrypt UDFs
> 
>
> Key: SPARK-12567
> URL: https://issues.apache.org/jira/browse/SPARK-12567
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kai Jiang
>Assignee: Kai Jiang
> Fix For: 2.0.0
>
>
> AES (Advanced Encryption Standard) algorithm.
> Add aes_encrypt and aes_decrypt UDFs.
> Ref:
> [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions]
> [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12567) Add aes_encrypt and aes_decrypt UDFs

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12567:


Assignee: Apache Spark  (was: Kai Jiang)

> Add aes_encrypt and aes_decrypt UDFs
> 
>
> Key: SPARK-12567
> URL: https://issues.apache.org/jira/browse/SPARK-12567
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kai Jiang
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> AES (Advanced Encryption Standard) algorithm.
> Add aes_encrypt and aes_decrypt UDFs.
> Ref:
> [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions]
> [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12567) Add aes_encrypt and aes_decrypt UDFs

2016-02-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12567:

Summary: Add aes_encrypt and aes_decrypt UDFs  (was: Add 
aes_{encrypt,decrypt} UDFs)

> Add aes_encrypt and aes_decrypt UDFs
> 
>
> Key: SPARK-12567
> URL: https://issues.apache.org/jira/browse/SPARK-12567
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kai Jiang
>Assignee: Kai Jiang
> Fix For: 2.0.0
>
>
> AES (Advanced Encryption Standard) algorithm.
> Add aes_encrypt and aes_decrypt UDFs.
> Ref:
> [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions]
> [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12567) Add aes_{encrypt,decrypt} UDFs

2016-02-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12567.

   Resolution: Fixed
 Assignee: Kai Jiang
Fix Version/s: 2.0.0

> Add aes_{encrypt,decrypt} UDFs
> --
>
> Key: SPARK-12567
> URL: https://issues.apache.org/jira/browse/SPARK-12567
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kai Jiang
>Assignee: Kai Jiang
> Fix For: 2.0.0
>
>
> AES (Advanced Encryption Standard) algorithm.
> Add aes_encrypt and aes_decrypt UDFs.
> Ref:
> [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions]
> [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12594) Outer Join Elimination by Filter Condition

2016-02-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12594.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10567
[https://github.com/apache/spark/pull/10567]

> Outer Join Elimination by Filter Condition
> --
>
> Key: SPARK-12594
> URL: https://issues.apache.org/jira/browse/SPARK-12594
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> Elimination of outer joins, if the predicates in the filter condition can 
> restrict the result sets so that all null-supplying rows are eliminated. 
> - full outer -> inner if both sides have such predicates
> - left outer -> inner if the right side has such predicates
> - right outer -> inner if the left side has such predicates
> - full outer -> left outer if only the left side has such predicates
> - full outer -> right outer if only the right side has such predicates
> If applicable, this can greatly improve the performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13213) BroadcastNestedLoopJoin is very slow

2016-02-19 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155418#comment-15155418
 ] 

Reynold Xin commented on SPARK-13213:
-

[~davies] what is this ticket about? Is it about making BroadcastNestedLoopJoin 
faster, or using CartesianProduct as much as possible?


> BroadcastNestedLoopJoin is very slow
> 
>
> Key: SPARK-13213
> URL: https://issues.apache.org/jira/browse/SPARK-13213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> Since we have improve the performance of CartisianProduct, which should be 
> faster and robuster than BroacastNestedLoopJoin, we should do 
> CartisianProduct instead of BroacastNestedLoopJoin, especially  when the 
> broadcasted table is not that small.
> Today, we hit a query that take very long time but still not finished, once 
> decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it 
> just finished in seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13382) Update PySpark testing notes

2016-02-19 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155415#comment-15155415
 ] 

holdenk commented on SPARK-13382:
-

Note: while I've got a PR we should also update the wiki and I don't have 
permission to edit the wiki.

> Update PySpark testing notes
> 
>
> Key: SPARK-13382
> URL: https://issues.apache.org/jira/browse/SPARK-13382
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> As discussed on the mailing list, running the full python tests requires that 
> Spark is built with the hive assembly. We should update both the wiki and the 
> build instructions for Python to mention this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-19 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155390#comment-15155390
 ] 

Xiao Li commented on SPARK-12720:
-

Expression {{grouping_id()}} is needed for resolving this JIRA. Thus, it is 
blocked by Spark-12799, which resolves it. 

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13391) Use Apache Arrow as In-memory columnar store implementation

2016-02-19 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155268#comment-15155268
 ] 

Wes McKinney commented on SPARK-13391:
--

Indeed, one of the major motivations of Arrow (for Python and R) is higher data 
throughput between native pandas / R data-frame memory representation and 
Spark. I will be looking to add C-level data marshaling algorithms between 
Arrow and pandas (via NumPy arrays) to the Arrow codebase within the next 
couple of months. Will cross-post JIRAs as they develop

> Use Apache Arrow as In-memory columnar store implementation
> ---
>
> Key: SPARK-13391
> URL: https://issues.apache.org/jira/browse/SPARK-13391
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Bryński
>
> Idea.
> Apache Arrow (http://arrow.apache.org/) is Open Source implementation of 
> inmemory columnar store. It has APIs in many programming languages.
> We can think about using it in Apache Spark to avoid data (de-)serialization  
> when running PySpark (and R) UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-02-19 Thread Varadharajan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155263#comment-15155263
 ] 

Varadharajan commented on SPARK-13393:
--

Hi, I just tried the same scenario with prebuilt 1.6.0 and it still has this 
issue. I will give master branch a try today evening.

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-02-19 Thread Varadharajan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varadharajan updated SPARK-13393:
-
Description: 
Consider the below snippet:
{code:title=test.scala|borderStyle=solid}
case class Person(id: Int, name: String)

val df = sc.parallelize(List(
  Person(1, "varadha"),
  Person(2, "nagaraj")
)).toDF

val varadha = df.filter("id = 1")

val errorDF = df.join(varadha, df("id") === varadha("id"), 
"left_outer").select(df("id"), varadha("id") as "varadha_id")

val nagaraj = df.filter("id = 2").select(df("id") as "n_id")

val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
"left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
{code}

The `errorDF` dataframe, after the left join is messed up and shows as below:

| id|varadha_id|
|  1| 1|
|  2| 2 (*This should've been null*)| 

whereas correctDF has the correct output after the left join:

| id|nagaraj_id|
|  1|  null|
|  2| 2|

  was:
Consider the below snippet:
{code:title=test.scala|borderStyle=solid}
class Person(id: Int, name: String)

val df = sc.parallelize(List(
  Person(1, "varadha"),
  Person(2, "nagaraj")
)).toDF

val varadha = df.filter("id = 1")

val errorDF = df.join(varadha, df("id") === varadha("id"), 
"left_outer").select(df("id"), varadha("id") as "varadha_id")

val nagaraj = df.filter("id = 2").select(df("id") as "n_id")

val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
"left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
{code}

The `errorDF` dataframe, after the left join is messed up and shows as below:

| id|varadha_id|
|  1| 1|
|  2| 2 (*This should've been null*)| 

whereas correctDF has the correct output after the left join:

| id|nagaraj_id|
|  1|  null|
|  2| 2|


> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10001) Allow Ctrl-C in spark-shell to kill running job

2016-02-19 Thread Jon Maurer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155232#comment-15155232
 ] 

Jon Maurer commented on SPARK-10001:


Thank you for your feedback and consideration. I opened a new enhancement for a 
confirmation prompt prior to exiting the shell.  
https://issues.apache.org/jira/browse/SPARK-13412

> Allow Ctrl-C in spark-shell to kill running job
> ---
>
> Key: SPARK-10001
> URL: https://issues.apache.org/jira/browse/SPARK-10001
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 1.4.1
>Reporter: Cheolsoo Park
>Priority: Minor
>
> Hitting Ctrl-C in spark-sql (and other tools like presto) cancels any running 
> job and starts a new input line on the prompt. It would be nice if 
> spark-shell also can do that. Otherwise, in case a user submits a job, say he 
> made a mistake, and wants to cancel it, he needs to exit the shell and 
> re-login to continue his work. Re-login can be a pain especially in Spark on 
> yarn, since it takes a while to allocate AM container and initial executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13412) Spark Shell Ctrl-C behaviour suggestion

2016-02-19 Thread Jon Maurer (JIRA)

Jon Maurer created SPARK-13412:
--

 Summary: Spark Shell Ctrl-C behaviour suggestion
 Key: SPARK-13412
 URL: https://issues.apache.org/jira/browse/SPARK-13412
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell
Affects Versions: 1.6.0
Reporter: Jon Maurer
Priority: Minor


It would be useful to catch the interrupt from a ctrl-c and prompt for 
confirmation prior to closing spark shell. This is currently an issue when 
sitting at an idle prompt. For example, if a user accidentally enters ctrl-c 
then all previous progress is lost and must be run again. Instead, the desired 
behavior would instead prompt the user to enter 'yes' or another ctrl-c to exit 
the shell, thus preventing rework. 

There is related discussion about this sort of feature on the Scala issue 
tracker: https://issues.scala-lang.org/browse/SI-6302



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-19 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155229#comment-15155229
 ] 

Xiao Li commented on SPARK-1:
-

Thanks! 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-02-19 Thread Jeff Stein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155212#comment-15155212
 ] 

Jeff Stein commented on SPARK-13048:


As an aside, the code in the clustering namespace violates the [open/closed 
principle](https://en.wikipedia.org/wiki/Open/closed_principle).
  - LDAOptimizer is unnecessarily a sealed trait (I understand it's a developer 
api, but I'm a developer...)
  - EMLDAOptimizer is final
  - Lots of private[clustering]

All of this meant that writing a decent workaround for the bug took a lot more 
code than I would have hoped.

> EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
> --
>
> Key: SPARK-13048
> URL: https://issues.apache.org/jira/browse/SPARK-13048
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
> Environment: Standalone Spark cluster
>Reporter: Jeff Stein
>
> In EMLDAOptimizer, all checkpoints are deleted before returning the 
> DistributedLDAModel.
> The most recent checkpoint is still necessary for operations on the 
> DistributedLDAModel under a couple scenarios:
> - The graph doesn't fit in memory on the worker nodes (e.g. very large data 
> set).
> - Late worker failures that require reading the now-dependent checkpoint.
> I ran into this problem running a 10M record LDA model in a memory starved 
> environment. The model consistently failed in either the {{collect at 
> LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the 
> {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the 
> model). In both cases, a FileNotFoundException is thrown attempting to access 
> a checkpoint file.
> I'm not sure what the correct fix is here; it might involve a class signature 
> change. An alternative simple fix is to leave the last checkpoint around and 
> expect the user to clean the checkpoint directory themselves.
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
> {noformat}
> Relevant code is included below.
> LDAOptimizer.scala:
> {noformat}
>   override private[clustering] def getLDAModel(iterationTimes: 
> Array[Double]): LDAModel = {
> require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
> this.graphCheckpointer.deleteAllCheckpoints()
> // The constructor's default arguments assume gammaShape = 100 to ensure 
> equivalence in
> // LDAModel.toLocal conversion
> new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
> this.vocabSize,
>   Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
> this.topicConcentration,
>   iterationTimes)
>   }
> {noformat}
> PeriodicCheckpointer.scala
> {noformat}
>   /**
>* Call this at the end to delete any remaining checkpoint files.
>*/
>   def deleteAllCheckpoints(): Unit = {
> while (checkpointQueue.nonEmpty) {
>   removeCheckpointFile()
> }
>   }
>   /**
>* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
>* This prints a warning but does not fail if the files cannot be removed.
>*/
>   private def removeCheckpointFile(): Unit = {
> val old = checkpointQueue.dequeue()
> // Since the old checkpoint is not deleted by Spark, we manually delete 
> it.
> val fs = FileSystem.get(sc.hadoopConfiguration)
> getCheckpointFiles(old).foreach { checkpointFile =>
>   try {
> fs.delete(new Path(checkpointFile), true)
>   } catch {
> case e: Exception =>
>   logWarning("PeriodicCheckpointer could not remove old checkpoint 
> file: " +
> checkpointFile)
>   }
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13407) TaskMetrics.fromAccumulatorUpdates can crash when trying to access garbage-collected accumulators

2016-02-19 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13407.

   Resolution: Fixed
Fix Version/s: 2.0.0

Fixed by my patch.

> TaskMetrics.fromAccumulatorUpdates can crash when trying to access 
> garbage-collected accumulators
> -
>
> Key: SPARK-13407
> URL: https://issues.apache.org/jira/browse/SPARK-13407
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> TaskMetrics.fromAccumulatorUpdates can fail if accumulators have been 
> garbage-collected:
> {code}
> java.lang.IllegalAccessError: Attempted to access garbage collected 
> accumulator 481596
>   at 
> org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133)
>   at 
> org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:132)
>   at 
> org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:130)
>   at scala.Option.map(Option.scala:145)
>   at org.apache.spark.Accumulators$.get(Accumulator.scala:130)
>   at 
> org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:414)
>   at 
> org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:412)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.executor.TaskMetrics$.fromAccumulatorUpdates(TaskMetrics.scala:412)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:499)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:493)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener.onExecutorMetricsUpdate(JobProgressListener.scala:493)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1178)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64)
> {code}
> In order to guard against this, we can eliminate the need to access 
> driver-side accumulators when constructing TaskMetrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13329) Considering output for statistics of logical plan

2016-02-19 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13329:
--
Summary: Considering output for statistics of logical plan  (was: 
Considering output for statistics of logicol plan)

> Considering output for statistics of logical plan
> -
>
> Key: SPARK-13329
> URL: https://issues.apache.org/jira/browse/SPARK-13329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> The current implementation of statistics of UnaryNode does not considering 
> output (for example, Project), we should considering it to have a better 
> guess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13409) Log the stacktrace when stopping a SparkContext

2016-02-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13409:

Description: Somethings we saw a stopped SparkContext, then have no idea 
it's stopped by what, we should log that for troubleshooting.  (was: Somethings 
we saw a stopped SparkContext, then have no idea it's stopped by what, we 
should remember that for troubleshooting.)

> Log the stacktrace when stopping a SparkContext
> ---
>
> Key: SPARK-13409
> URL: https://issues.apache.org/jira/browse/SPARK-13409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>
> Somethings we saw a stopped SparkContext, then have no idea it's stopped by 
> what, we should log that for troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13409) Log the stacktrace when stopping a SparkContext

2016-02-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-13409:

Summary: Log the stacktrace when stopping a SparkContext  (was: Remember 
the stacktrace when stop a SparkContext)

> Log the stacktrace when stopping a SparkContext
> ---
>
> Key: SPARK-13409
> URL: https://issues.apache.org/jira/browse/SPARK-13409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>
> Somethings we saw a stopped SparkContext, then have no idea it's stopped by 
> what, we should remember that for troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13091) Rewrite/Propagate constraints for Aliases

2016-02-19 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13091.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11144
[https://github.com/apache/spark/pull/11144]

> Rewrite/Propagate constraints for Aliases
> -
>
> Key: SPARK-13091
> URL: https://issues.apache.org/jira/browse/SPARK-13091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> We'd want to duplicate constraints when there is an alias (i.e. for "SELECT 
> a, a AS b", any constraints on a now apply to b)
> This is a follow up task based on [~marmbrus]'s suggestion in 
> https://docs.google.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit#heading=h.6hjcndo36qze



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-19 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155047#comment-15155047
 ] 

Reynold Xin commented on SPARK-1:
-

OK I think this is going to be really difficult to fix right now. However, once 
we refactor the sql internals and introduce the concept of a local plan tree, 
then we might be able to just hash the local plan tree and use that as the 
seed. 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13261) Expose maxCharactersPerColumn as a user configurable option

2016-02-19 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13261.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11147
[https://github.com/apache/spark/pull/11147]

> Expose maxCharactersPerColumn as a user configurable option
> ---
>
> Key: SPARK-13261
> URL: https://issues.apache.org/jira/browse/SPARK-13261
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hossein Falaki
> Fix For: 2.0.0
>
>
> We are using Univocity parser in the CSV data source in Spark. The parser has 
> a fairly small limit for maximum number of characters per column. Spark's CSV 
> data source updates it but it is not exposed to user. There are still use 
> cases where the limit is too small. I think we should just expose it as an 
> option. I suggest "maxCharsPerColumn" for the option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13411) change in null aggregation behavior from spark 1.5.2 and 1.6.0

2016-02-19 Thread Barry Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barry Becker updated SPARK-13411:
-
Description: 
I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely 
different.

Suppose I have a dataframe with a double column, "foo", that is all null valued.

If I do
val ext: DataFrame = df.agg(min("foo"), max("foo"), 
count(col("foo")).alias("nonNullCount"))

In 1.5.2 I could do ext.getDouble(0) and get Double.NaN.
In 1.6.0, when I try this I get "value in null at index 0". Maybe the new 
behavior is correct, but I think there is a typo in the message. It should say 
"value is null at index 0".

Which behavior is correct? If 1.6.0 is correct, then it looks like I will need 
to add isNull checks everywhere when retrieving values.



  was:
I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely 
different.

Suppose I have a dataframe with a double column, "foo", that is all null valued.

If I do
val ext: DataFrame = df.agg(min("foo"), max("foo"), 
count(col("foo")).alias("nonNullCount"))

In 1.5.2 I could do ext.getDouble(0) and get Double.NaN.
In 1.6.0, when I try this I get 

Which is correct.
I think the 1.5.2 behavior is better otherwise I need to add special case 
handling for when a column is all null.




> change in null aggregation behavior from  spark 1.5.2 and 1.6.0
> ---
>
> Key: SPARK-13411
> URL: https://issues.apache.org/jira/browse/SPARK-13411
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Barry Becker
>
> I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely 
> different.
> Suppose I have a dataframe with a double column, "foo", that is all null 
> valued.
> If I do
> val ext: DataFrame = df.agg(min("foo"), max("foo"), 
> count(col("foo")).alias("nonNullCount"))
> In 1.5.2 I could do ext.getDouble(0) and get Double.NaN.
> In 1.6.0, when I try this I get "value in null at index 0". Maybe the new 
> behavior is correct, but I think there is a typo in the message. It should 
> say "value is null at index 0".
> Which behavior is correct? If 1.6.0 is correct, then it looks like I will 
> need to add isNull checks everywhere when retrieving values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12966) Postgres JDBC ArrayType(DecimalType) 'Unable to find server array type'

2016-02-19 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12966.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Postgres JDBC ArrayType(DecimalType) 'Unable to find server array type'
> ---
>
> Key: SPARK-12966
> URL: https://issues.apache.org/jira/browse/SPARK-12966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Brandon Bradley
> Fix For: 2.0.0
>
>
> Similar to SPARK-12747 but for DecimalType.
> Do we need to handle precision and scale?
> I've already starting trying to work on this. I cannot see if Postgres JDBC 
> driver handles precision and scale or just converts to default BigDecimal 
> precision and scale.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13411) change in null aggregation behavior from spark 1.5.2 and 1.6.0

2016-02-19 Thread Barry Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barry Becker updated SPARK-13411:
-
Description: 
I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely 
different.

Suppose I have a dataframe with a double column, "foo", that is all null valued.

If I do
val ext: DataFrame = df.agg(min("foo"), max("foo"), 
count(col("foo")).alias("nonNullCount"))

In 1.5.2 I could do ext.getDouble(0) and get Double.NaN.
In 1.6.0, when I try this I get 

Which is correct.
I think the 1.5.2 behavior is better otherwise I need to add special case 
handling for when a column is all null.



  was:
I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely 
different. I suspect 1.6.0 is wrong.

Suppose I have a dataframe with a double column, "foo", that is all null valued.

If I do
val ext: DataFrame = df.agg(min("foo"), max("foo"), 
count(col("foo")).alias("nonNullCount"))

then in 1.6.0 I get a completely empty dataframe as the result.
In 1.5.2, I got a single row with the aggregate min and max values being 
Double.NaN.

Which is correct.
I think the 1.5.2 behavior is better otherwise I need to add special case 
handling for when a column is all null.




> change in null aggregation behavior from  spark 1.5.2 and 1.6.0
> ---
>
> Key: SPARK-13411
> URL: https://issues.apache.org/jira/browse/SPARK-13411
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Barry Becker
>
> I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely 
> different.
> Suppose I have a dataframe with a double column, "foo", that is all null 
> valued.
> If I do
> val ext: DataFrame = df.agg(min("foo"), max("foo"), 
> count(col("foo")).alias("nonNullCount"))
> In 1.5.2 I could do ext.getDouble(0) and get Double.NaN.
> In 1.6.0, when I try this I get 
> Which is correct.
> I think the 1.5.2 behavior is better otherwise I need to add special case 
> handling for when a column is all null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13411) change in null aggregation behavior from spark 1.5.2 and 1.6.0

2016-02-19 Thread Barry Becker (JIRA)

Barry Becker created SPARK-13411:


 Summary: change in null aggregation behavior from  spark 1.5.2 and 
1.6.0
 Key: SPARK-13411
 URL: https://issues.apache.org/jira/browse/SPARK-13411
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Barry Becker


I don't know if the behavior in 1.5.3 or 1.6.0 is correct, but its definitely 
different. I suspect 1.6.0 is wrong.

Suppose I have a dataframe with a double column, "foo", that is all null valued.

If I do
val ext: DataFrame = df.agg(min("foo"), max("foo"), 
count(col("foo")).alias("nonNullCount"))

then in 1.6.0 I get a completely empty dataframe as the result.
In 1.5.2, I got a single row with the aggregate min and max values being 
Double.NaN.

Which is correct.
I think the 1.5.2 behavior is better otherwise I need to add special case 
handling for when a column is all null.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.

2016-02-19 Thread Franklyn Dsouza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-13410:

Description: 
Unioning two DataFrames that contain UDTs fails with 
{quote}
AnalysisException: u"unresolved operator 'Union;"
{quote}

I tracked this down to this line 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202

Which compares datatypes between the output attributes of both logical plans. 
However for UDTs this will be a new instance of the UserDefinedType or 
PythonUserDefinedType 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
 

So this equality check will check if the two instances are the same and since 
they aren't references to a singleton this check fails.  **Note: this will work 
fine if you are unioning the dataframe with itself.**

I have a proposed patch for this which overrides the equality operator on the 
two classes here: https://github.com/apache/spark/pull/11279

Reproduction steps

{code}
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

#note they need to be two separate dataframes
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
{code}

  was:
Unioning two DataFrames that contain UDTs fails with 
{quote}
AnalysisException: u"unresolved operator 'Union;"
{quote}

I tracked this down to this line 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202

Which compares datatypes between the output attributes of both logical plans. 
However for UDTs this will be a new instance of the UserDefinedType or 
PythonUserDefinedType 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
 

So this equality check will check if the two instances are the same and since 
they aren't references to a singleton this check fails.  Note: this will work 
fine if you are unioning the dataframe with itself.

I have a proposed patch for this which overrides the equality operator on the 
two classes here: https://github.com/apache/spark/pull/11279

Reproduction steps

{code}
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

#note they need to be two separate dataframes
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
{code}


> unionAll throws error with DataFrames containing UDT columns.
> -
>
> Key: SPARK-13410
> URL: https://issues.apache.org/jira/browse/SPARK-13410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Franklyn Dsouza
>  Labels: patch
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Unioning two DataFrames that contain UDTs fails with 
> {quote}
> AnalysisException: u"unresolved operator 'Union;"
> {quote}
> I tracked this down to this line 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202
> Which compares datatypes between the output attributes of both logical plans. 
> However for UDTs this will be a new instance of the UserDefinedType or 
> PythonUserDefinedType 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
>  
> So this equality check will check if the two instances are the same and since 
> they aren't references to a singleton this check fails.  **Note: this will 
> work fine if you are unioning the dataframe with itself.**
> I have a proposed patch for this which overrides the equality operator on the 
> two classes here: https://github.com/apache/spark/pull/11279
> Reproduction steps
> {code}
> from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
> from pyspark.sql import types
> schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
> #note they need to be two separate dataframes
> a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
> b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
> c = a.unionAll(b)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Updated] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.

2016-02-19 Thread Franklyn Dsouza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-13410:

Description: 
Unioning two DataFrames that contain UDTs fails with 
{quote}
AnalysisException: u"unresolved operator 'Union;"
{quote}

I tracked this down to this line 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202

Which compares datatypes between the output attributes of both logical plans. 
However for UDTs this will be a new instance of the UserDefinedType or 
PythonUserDefinedType 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
 

So this equality check will check if the two instances are the same and since 
they aren't references to a singleton this check fails. 

*Note: this will work fine if you are unioning the dataframe with itself.*

I have a proposed patch for this which overrides the equality operator on the 
two classes here: https://github.com/apache/spark/pull/11279

Reproduction steps

{code}
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

#note they need to be two separate dataframes
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
{code}

  was:
Unioning two DataFrames that contain UDTs fails with 
{quote}
AnalysisException: u"unresolved operator 'Union;"
{quote}

I tracked this down to this line 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202

Which compares datatypes between the output attributes of both logical plans. 
However for UDTs this will be a new instance of the UserDefinedType or 
PythonUserDefinedType 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
 

So this equality check will check if the two instances are the same and since 
they aren't references to a singleton this check fails.  **Note: this will work 
fine if you are unioning the dataframe with itself.**

I have a proposed patch for this which overrides the equality operator on the 
two classes here: https://github.com/apache/spark/pull/11279

Reproduction steps

{code}
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

#note they need to be two separate dataframes
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
{code}


> unionAll throws error with DataFrames containing UDT columns.
> -
>
> Key: SPARK-13410
> URL: https://issues.apache.org/jira/browse/SPARK-13410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Franklyn Dsouza
>  Labels: patch
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Unioning two DataFrames that contain UDTs fails with 
> {quote}
> AnalysisException: u"unresolved operator 'Union;"
> {quote}
> I tracked this down to this line 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202
> Which compares datatypes between the output attributes of both logical plans. 
> However for UDTs this will be a new instance of the UserDefinedType or 
> PythonUserDefinedType 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
>  
> So this equality check will check if the two instances are the same and since 
> they aren't references to a singleton this check fails. 
> *Note: this will work fine if you are unioning the dataframe with itself.*
> I have a proposed patch for this which overrides the equality operator on the 
> two classes here: https://github.com/apache/spark/pull/11279
> Reproduction steps
> {code}
> from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
> from pyspark.sql import types
> schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
> #note they need to be two separate dataframes
> a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
> b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
> c = a.unionAll(b)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Updated] (SPARK-13410) unionAll AnalysisException with DataFrames containing UDT columns.

2016-02-19 Thread Franklyn Dsouza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-13410:

Summary: unionAll AnalysisException with DataFrames containing UDT columns. 
 (was: unionAll throws error with DataFrames containing UDT columns.)

> unionAll AnalysisException with DataFrames containing UDT columns.
> --
>
> Key: SPARK-13410
> URL: https://issues.apache.org/jira/browse/SPARK-13410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Franklyn Dsouza
>  Labels: patch
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Unioning two DataFrames that contain UDTs fails with 
> {quote}
> AnalysisException: u"unresolved operator 'Union;"
> {quote}
> I tracked this down to this line 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202
> Which compares datatypes between the output attributes of both logical plans. 
> However for UDTs this will be a new instance of the UserDefinedType or 
> PythonUserDefinedType 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
>  
> So this equality check will check if the two instances are the same and since 
> they aren't references to a singleton this check fails. 
> *Note: this will work fine if you are unioning the dataframe with itself.*
> I have a proposed patch for this which overrides the equality operator on the 
> two classes here: https://github.com/apache/spark/pull/11279
> Reproduction steps
> {code}
> from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
> from pyspark.sql import types
> schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
> #note they need to be two separate dataframes
> a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
> b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
> c = a.unionAll(b)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13408) Exception in resultHandler will shutdown SparkContext

2016-02-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154962#comment-15154962
 ] 

Apache Spark commented on SPARK-13408:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11280

> Exception in resultHandler will shutdown SparkContext
> -
>
> Key: SPARK-13408
> URL: https://issues.apache.org/jira/browse/SPARK-13408
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>
> {code}
> davies@localhost:~/work/spark$ bin/spark-submit 
> python/pyspark/sql/dataframe.py
> NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
> ahead of assembly.
> 16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback 
> address: 127.0.0.1; using 192.168.0.143 instead (on interface en0)
> 16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> **
> File 
> "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
> line 554, in pyspark.sql.dataframe.DataFrame.alias
> Failed example:
> joined_df.select(col("df_as1.name"), col("df_as2.name"), 
> col("df_as2.age")).collect()
> Differences (ndiff with -expected +actual):
> - [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', 
> name=u'Alice', age=2)]
> + [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', 
> name=u'Bob', age=5)]
> org.apache.spark.SparkDriverExecutionException: Execution error
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47)
>   at 
> org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
>   at 
> org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929)
>   at 
> org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185)
>   ... 4 more
> org.apache.spark.SparkDriverExecutionException: Execution error
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Caused by: java.lang.NullPointerException
>   at 
>

[jira] [Assigned] (SPARK-13408) Exception in resultHandler will shutdown SparkContext

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13408:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Exception in resultHandler will shutdown SparkContext
> -
>
> Key: SPARK-13408
> URL: https://issues.apache.org/jira/browse/SPARK-13408
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> {code}
> davies@localhost:~/work/spark$ bin/spark-submit 
> python/pyspark/sql/dataframe.py
> NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
> ahead of assembly.
> 16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback 
> address: 127.0.0.1; using 192.168.0.143 instead (on interface en0)
> 16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> **
> File 
> "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
> line 554, in pyspark.sql.dataframe.DataFrame.alias
> Failed example:
> joined_df.select(col("df_as1.name"), col("df_as2.name"), 
> col("df_as2.age")).collect()
> Differences (ndiff with -expected +actual):
> - [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', 
> name=u'Alice', age=2)]
> + [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', 
> name=u'Bob', age=5)]
> org.apache.spark.SparkDriverExecutionException: Execution error
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47)
>   at 
> org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
>   at 
> org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31)
>   at 
> org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929)
>   at 
> org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185)
>   ... 4 more
> org.apache.spark.SparkDriverExecutionException: Execution error
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
>

[jira] [Commented] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.

2016-02-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154935#comment-15154935
 ] 

Apache Spark commented on SPARK-13410:
--

User 'damnMeddlingKid' has created a pull request for this issue:
https://github.com/apache/spark/pull/11279

> unionAll throws error with DataFrames containing UDT columns.
> -
>
> Key: SPARK-13410
> URL: https://issues.apache.org/jira/browse/SPARK-13410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Franklyn Dsouza
>  Labels: patch
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Unioning two DataFrames that contain UDTs fails with 
> {quote}
> AnalysisException: u"unresolved operator 'Union;"
> {quote}
> I tracked this down to this line 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202
> Which compares datatypes between the output attributes of both logical plans. 
> However for UDTs this will be a new instance of the UserDefinedType or 
> PythonUserDefinedType 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
>  
> So this equality check will check if the two instances are the same and since 
> they aren't references to a singleton this check fails.  Note: this will work 
> fine if you are unioning the dataframe with itself.
> I have a patch for this which overrides the equality operator on the two 
> classes here: https://github.com/damnMeddlingKid/spark/pull/2
> Reproduction steps
> {code}
> from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
> from pyspark.sql import types
> schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
> #note they need to be two separate dataframes
> a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
> b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
> c = a.unionAll(b)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13410:


Assignee: Apache Spark

> unionAll throws error with DataFrames containing UDT columns.
> -
>
> Key: SPARK-13410
> URL: https://issues.apache.org/jira/browse/SPARK-13410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Franklyn Dsouza
>Assignee: Apache Spark
>  Labels: patch
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Unioning two DataFrames that contain UDTs fails with 
> {quote}
> AnalysisException: u"unresolved operator 'Union;"
> {quote}
> I tracked this down to this line 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202
> Which compares datatypes between the output attributes of both logical plans. 
> However for UDTs this will be a new instance of the UserDefinedType or 
> PythonUserDefinedType 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
>  
> So this equality check will check if the two instances are the same and since 
> they aren't references to a singleton this check fails.  Note: this will work 
> fine if you are unioning the dataframe with itself.
> I have a patch for this which overrides the equality operator on the two 
> classes here: https://github.com/damnMeddlingKid/spark/pull/2
> Reproduction steps
> {code}
> from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
> from pyspark.sql import types
> schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
> #note they need to be two separate dataframes
> a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
> b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
> c = a.unionAll(b)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.

2016-02-19 Thread Franklyn Dsouza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-13410:

Description: 
Unioning two DataFrames that contain UDTs fails with 
{quote}
AnalysisException: u"unresolved operator 'Union;"
{quote}

I tracked this down to this line 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202

Which compares datatypes between the output attributes of both logical plans. 
However for UDTs this will be a new instance of the UserDefinedType or 
PythonUserDefinedType 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
 

So this equality check will check if the two instances are the same and since 
they aren't references to a singleton this check fails.  Note: this will work 
fine if you are unioning the dataframe with itself.

I have a proposed patch for this which overrides the equality operator on the 
two classes here: https://github.com/apache/spark/pull/11279

Reproduction steps

{code}
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

#note they need to be two separate dataframes
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
{code}

  was:
Unioning two DataFrames that contain UDTs fails with 
{quote}
AnalysisException: u"unresolved operator 'Union;"
{quote}

I tracked this down to this line 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202

Which compares datatypes between the output attributes of both logical plans. 
However for UDTs this will be a new instance of the UserDefinedType or 
PythonUserDefinedType 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
 

So this equality check will check if the two instances are the same and since 
they aren't references to a singleton this check fails.  Note: this will work 
fine if you are unioning the dataframe with itself.

I have a patch for this which overrides the equality operator on the two 
classes here: https://github.com/damnMeddlingKid/spark/pull/2

Reproduction steps

{code}
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

#note they need to be two separate dataframes
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
{code}


> unionAll throws error with DataFrames containing UDT columns.
> -
>
> Key: SPARK-13410
> URL: https://issues.apache.org/jira/browse/SPARK-13410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Franklyn Dsouza
>  Labels: patch
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Unioning two DataFrames that contain UDTs fails with 
> {quote}
> AnalysisException: u"unresolved operator 'Union;"
> {quote}
> I tracked this down to this line 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202
> Which compares datatypes between the output attributes of both logical plans. 
> However for UDTs this will be a new instance of the UserDefinedType or 
> PythonUserDefinedType 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
>  
> So this equality check will check if the two instances are the same and since 
> they aren't references to a singleton this check fails.  Note: this will work 
> fine if you are unioning the dataframe with itself.
> I have a proposed patch for this which overrides the equality operator on the 
> two classes here: https://github.com/apache/spark/pull/11279
> Reproduction steps
> {code}
> from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
> from pyspark.sql import types
> schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
> #note they need to be two separate dataframes
> a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
> b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
> c = a.unionAll(b)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Assigned] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13410:


Assignee: (was: Apache Spark)

> unionAll throws error with DataFrames containing UDT columns.
> -
>
> Key: SPARK-13410
> URL: https://issues.apache.org/jira/browse/SPARK-13410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Franklyn Dsouza
>  Labels: patch
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Unioning two DataFrames that contain UDTs fails with 
> {quote}
> AnalysisException: u"unresolved operator 'Union;"
> {quote}
> I tracked this down to this line 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202
> Which compares datatypes between the output attributes of both logical plans. 
> However for UDTs this will be a new instance of the UserDefinedType or 
> PythonUserDefinedType 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
>  
> So this equality check will check if the two instances are the same and since 
> they aren't references to a singleton this check fails.  Note: this will work 
> fine if you are unioning the dataframe with itself.
> I have a patch for this which overrides the equality operator on the two 
> classes here: https://github.com/damnMeddlingKid/spark/pull/2
> Reproduction steps
> {code}
> from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
> from pyspark.sql import types
> schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
> #note they need to be two separate dataframes
> a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
> b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
> c = a.unionAll(b)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12864) Fetch failure from AM restart

2016-02-19 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12864:
--
Summary: Fetch failure from AM restart  (was:  initialize executorIdCounter 
after ApplicationMaster killed for max number of executor failures reached)

> Fetch failure from AM restart
> -
>
> Key: SPARK-12864
> URL: https://issues.apache.org/jira/browse/SPARK-12864
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.1, 1.4.1, 1.5.2
>Reporter: iward
>
> Currently, when max number of executor failures reached the 
> *maxNumExecutorFailures*,  *ApplicationMaster* will be killed and re-register 
> another one.This time, *YarnAllocator* will be created a new instance.
> But, the value of property *executorIdCounter* in  *YarnAllocator* will reset 
> to *0*. Then the *Id* of new executor will starting from 1. This will confuse 
> with the executor has already created before, which will cause 
> FetchFailedException.
> For example, the following is the task log:
> {noformat}
> 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has 
> disassociated: 172.22.92.14:45125
> 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])
> {noformat}
> {noformat}
> 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: 
> Registered executor: 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793])
>  with ID 1
> {noformat}
> {noformat}
> Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): 
> FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337
> ), shuffleId=5, mapId=2, reduceId=3, message=
> 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: 
> /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl
> e_5_2_0.index (No such file or directory)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> {noformat}
> As the task log show, the executor id of  *BJHC-HERA-16217.hadoop.jd.local* 
> is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and 
> cause FetchFailedException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13410) unionAll throws error with DataFrames containing UDT columns.

2016-02-19 Thread Franklyn Dsouza (JIRA)

Franklyn Dsouza created SPARK-13410:
---

 Summary: unionAll throws error with DataFrames containing UDT 
columns.
 Key: SPARK-13410
 URL: https://issues.apache.org/jira/browse/SPARK-13410
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0, 1.5.0
Reporter: Franklyn Dsouza


Unioning two DataFrames that contain UDTs fails with 
{quote}
AnalysisException: u"unresolved operator 'Union;"
{quote}

I tracked this down to this line 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202

Which compares datatypes between the output attributes of both logical plans. 
However for UDTs this will be a new instance of the UserDefinedType or 
PythonUserDefinedType 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158
 

So this equality check will check if the two instances are the same and since 
they aren't references to a singleton this check fails.  Note: this will work 
fine if you are unioning the dataframe with itself.

I have a patch for this which overrides the equality operator on the two 
classes here: https://github.com/damnMeddlingKid/spark/pull/2

Reproduction steps

{code}
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

#note they need to be two separate dataframes
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13382) Update PySpark testing notes

2016-02-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154926#comment-15154926
 ] 

Apache Spark commented on SPARK-13382:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/11278

> Update PySpark testing notes
> 
>
> Key: SPARK-13382
> URL: https://issues.apache.org/jira/browse/SPARK-13382
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> As discussed on the mailing list, running the full python tests requires that 
> Spark is built with the hive assembly. We should update both the wiki and the 
> build instructions for Python to mention this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13382) Update PySpark testing notes

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13382:


Assignee: Apache Spark

> Update PySpark testing notes
> 
>
> Key: SPARK-13382
> URL: https://issues.apache.org/jira/browse/SPARK-13382
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>
> As discussed on the mailing list, running the full python tests requires that 
> Spark is built with the hive assembly. We should update both the wiki and the 
> build instructions for Python to mention this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13382) Update PySpark testing notes

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13382:


Assignee: (was: Apache Spark)

> Update PySpark testing notes
> 
>
> Key: SPARK-13382
> URL: https://issues.apache.org/jira/browse/SPARK-13382
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> As discussed on the mailing list, running the full python tests requires that 
> Spark is built with the hive assembly. We should update both the wiki and the 
> build instructions for Python to mention this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10001) Allow Ctrl-C in spark-shell to kill running job

2016-02-19 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154885#comment-15154885
 ] 

Ryan Blue commented on SPARK-10001:
---

bq. I . . . am uneasy about adopting unusual semantics for a standard signal, 
which should rightly kill the shell by default.

I just checked nodejs, python, and irb (ruby) and the behavior in all three is 
to return execution to the shell's prompt, not to exit. I think that makes 
sense since it basically mimics a normal bash shell. I agree with Jon that it 
would be a nice feature, though I would rather get the current patch in and 
address the new suggestion in a follow up issue.

[~tri...@gmail.com], could you open a follow-up issue that outlines the 
behavior you suggest?

> Allow Ctrl-C in spark-shell to kill running job
> ---
>
> Key: SPARK-10001
> URL: https://issues.apache.org/jira/browse/SPARK-10001
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 1.4.1
>Reporter: Cheolsoo Park
>Priority: Minor
>
> Hitting Ctrl-C in spark-sql (and other tools like presto) cancels any running 
> job and starts a new input line on the prompt. It would be nice if 
> spark-shell also can do that. Otherwise, in case a user submits a job, say he 
> made a mistake, and wants to cancel it, he needs to exit the shell and 
> re-login to continue his work. Re-login can be a pain especially in Spark on 
> yarn, since it takes a while to allocate AM container and initial executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13409) Remember the stacktrace when stop a SparkContext

2016-02-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13409:
---
Component/s: Spark Core

> Remember the stacktrace when stop a SparkContext
> 
>
> Key: SPARK-13409
> URL: https://issues.apache.org/jira/browse/SPARK-13409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>
> Somethings we saw a stopped SparkContext, then have no idea it's stopped by 
> what, we should remember that for troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13409) Remember the stacktrace when stop a SparkContext

2016-02-19 Thread Davies Liu (JIRA)

Davies Liu created SPARK-13409:
--

 Summary: Remember the stacktrace when stop a SparkContext
 Key: SPARK-13409
 URL: https://issues.apache.org/jira/browse/SPARK-13409
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


Somethings we saw a stopped SparkContext, then have no idea it's stopped by 
what, we should remember that for troubleshooting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13408) Exception in resultHandler will shutdown SparkContext

2016-02-19 Thread Davies Liu (JIRA)

Davies Liu created SPARK-13408:
--

 Summary: Exception in resultHandler will shutdown SparkContext
 Key: SPARK-13408
 URL: https://issues.apache.org/jira/browse/SPARK-13408
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu
Assignee: Shixiong Zhu


{code}
davies@localhost:~/work/spark$ bin/spark-submit python/pyspark/sql/dataframe.py
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
ahead of assembly.
16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback 
address: 127.0.0.1; using 192.168.0.143 instead (on interface en0)
16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
**
File 
"/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
line 554, in pyspark.sql.dataframe.DataFrame.alias
Failed example:
joined_df.select(col("df_as1.name"), col("df_as2.name"), 
col("df_as2.age")).collect()
Differences (ndiff with -expected +actual):
- [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', name=u'Alice', 
age=2)]
+ [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', name=u'Bob', 
age=5)]
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
at java.util.PriorityQueue.offer(PriorityQueue.java:329)
at 
org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47)
at 
org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
at 
org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31)
at 
org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318)
at 
org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932)
at 
org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929)
at 
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185)
... 4 more
org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
at

[jira] [Assigned] (SPARK-13387) Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher.

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13387:


Assignee: Apache Spark

> Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher.
> ---
>
> Key: SPARK-13387
> URL: https://issues.apache.org/jira/browse/SPARK-13387
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>Priority: Minor
>
> As SPARK_JAVA_OPTS is getting deprecated, to allow setting java properties 
> for MesosClusterDispatcher it also should support SPARK_DAEMON_JAVA_OPTS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13387) Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher.

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13387:


Assignee: (was: Apache Spark)

> Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher.
> ---
>
> Key: SPARK-13387
> URL: https://issues.apache.org/jira/browse/SPARK-13387
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Priority: Minor
>
> As SPARK_JAVA_OPTS is getting deprecated, to allow setting java properties 
> for MesosClusterDispatcher it also should support SPARK_DAEMON_JAVA_OPTS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13387) Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher.

2016-02-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154842#comment-15154842
 ] 

Apache Spark commented on SPARK-13387:
--

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11277

> Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher.
> ---
>
> Key: SPARK-13387
> URL: https://issues.apache.org/jira/browse/SPARK-13387
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Priority: Minor
>
> As SPARK_JAVA_OPTS is getting deprecated, to allow setting java properties 
> for MesosClusterDispatcher it also should support SPARK_DAEMON_JAVA_OPTS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13407) TaskMetrics.fromAccumulatorUpdates can crash when trying to access garbage-collected accumulators

2016-02-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154826#comment-15154826
 ] 

Apache Spark commented on SPARK-13407:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11276

> TaskMetrics.fromAccumulatorUpdates can crash when trying to access 
> garbage-collected accumulators
> -
>
> Key: SPARK-13407
> URL: https://issues.apache.org/jira/browse/SPARK-13407
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> TaskMetrics.fromAccumulatorUpdates can fail if accumulators have been 
> garbage-collected:
> {code}
> java.lang.IllegalAccessError: Attempted to access garbage collected 
> accumulator 481596
>   at 
> org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133)
>   at 
> org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:132)
>   at 
> org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:130)
>   at scala.Option.map(Option.scala:145)
>   at org.apache.spark.Accumulators$.get(Accumulator.scala:130)
>   at 
> org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:414)
>   at 
> org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:412)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.executor.TaskMetrics$.fromAccumulatorUpdates(TaskMetrics.scala:412)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:499)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:493)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at 
> org.apache.spark.ui.jobs.JobProgressListener.onExecutorMetricsUpdate(JobProgressListener.scala:493)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:56)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1178)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64)
> {code}
> In order to guard against this, we can eliminate the need to access 
> driver-side accumulators when constructing TaskMetrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13407) TaskMetrics.fromAccumulatorUpdates can crash when trying to access garbage-collected accumulators

2016-02-19 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-13407:
--

 Summary: TaskMetrics.fromAccumulatorUpdates can crash when trying 
to access garbage-collected accumulators
 Key: SPARK-13407
 URL: https://issues.apache.org/jira/browse/SPARK-13407
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


TaskMetrics.fromAccumulatorUpdates can fail if accumulators have been 
garbage-collected:

{code}
java.lang.IllegalAccessError: Attempted to access garbage collected accumulator 
481596
at 
org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133)
at 
org.apache.spark.Accumulators$$anonfun$get$1$$anonfun$apply$1.apply(Accumulator.scala:133)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:132)
at 
org.apache.spark.Accumulators$$anonfun$get$1.apply(Accumulator.scala:130)
at scala.Option.map(Option.scala:145)
at org.apache.spark.Accumulators$.get(Accumulator.scala:130)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:414)
at 
org.apache.spark.executor.TaskMetrics$$anonfun$9.apply(TaskMetrics.scala:412)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.executor.TaskMetrics$.fromAccumulatorUpdates(TaskMetrics.scala:412)
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:499)
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onExecutorMetricsUpdate$2.apply(JobProgressListener.scala:493)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.ui.jobs.JobProgressListener.onExecutorMetricsUpdate(JobProgressListener.scala:493)
at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:56)
at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1178)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64)
{code}

In order to guard against this, we can eliminate the need to access driver-side 
accumulators when constructing TaskMetrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13406) NPE in LazilyGeneratedOrdering

2016-02-19 Thread Davies Liu (JIRA)

Davies Liu created SPARK-13406:
--

 Summary: NPE in LazilyGeneratedOrdering
 Key: SPARK-13406
 URL: https://issues.apache.org/jira/browse/SPARK-13406
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Josh Rosen


{code}
File 
"/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
line ?, in pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy
Failed example:
sampled.groupBy("key").count().orderBy("key").show()
Exception raised:
Traceback (most recent call last):
  File "//anaconda/lib/python2.7/doctest.py", line 1315, in __run
compileflags, 1) in test.globs
  File "", line 1, in 
sampled.groupBy("key").count().orderBy("key").show()
  File 
"/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
line 217, in show
print(self._jdf.showString(n, truncate))
  File 
"/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py", 
line 835, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
"/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
45, in deco
return f(*a, **kw)
  File 
"/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py", line 
310, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o681.showString.
: org.apache.spark.SparkDriverExecutionException: Execution error
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1782)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:937)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1318)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1305)
at 
org.apache.spark.sql.execution.TakeOrderedAndProject.executeCollect(limit.scala:94)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:157)
at 
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520)
at 
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at 
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1769)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1519)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1526)
at 
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1396)
at 
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1395)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:1782)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1395)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1477)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:167)
at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:290)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at

[jira] [Updated] (SPARK-13384) Keep attribute qualifiers after dedup in Analyzer

2016-02-19 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13384:
-
Assignee: Liang-Chi Hsieh

> Keep attribute qualifiers after dedup in Analyzer
> -
>
> Key: SPARK-13384
> URL: https://issues.apache.org/jira/browse/SPARK-13384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> When we de-duplicate attributes in Analyzer, we create new attributes. 
> However, we don't keep original qualifiers. Some plans will be failed to 
> analysed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13384) Keep attribute qualifiers after dedup in Analyzer

2016-02-19 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13384.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11261
[https://github.com/apache/spark/pull/11261]

> Keep attribute qualifiers after dedup in Analyzer
> -
>
> Key: SPARK-13384
> URL: https://issues.apache.org/jira/browse/SPARK-13384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> When we de-duplicate attributes in Analyzer, we create new attributes. 
> However, we don't keep original qualifiers. Some plans will be failed to 
> analysed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13046) Partitioning looks broken in 1.6

2016-02-19 Thread Julien Baley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152718#comment-15152718
 ] 

Julien Baley edited comment on SPARK-13046 at 2/19/16 7:56 PM:
---

Sorry it took me so long to come back to you.

We're using Hive (and Java), and I'm calling the 
`hiveContext.createExternalTable("table_name", "s3://bucket/some_path/", 
"parquet");`, i.e. I believe I'm passing the correct path and then Spark 
perhaps infers something wrongly in the middle?

I've changed my call to:
`hiveContext.createExternalTable("table_name", "parquet", 
ImmutableMap.of("path", "s3://bucket/some_path/", "basePath", 
"s3://bucket/some_path/"));`
and it seems to work.

I still feel there's a bug (along the line between createExternalTable and the 
partitioning code), since really I shouldn't have to pass twice the same value 
to createExternalTable, should I?


was (Author: julien.baley):
Sorry it took me so long to come back to you.

We're using Hive (and Java), and I'm calling the 
`hiveContext.createExternalTable("table_name", "s3://bucket/some_path/", 
"parquet");`, i.e. I believe I'm passing the correct path and then Spark 
perhaps infers something wrongly in the middle?

I've changed my call to:
`hiveContext.createExternalTable("table_name", "parquet", 
ImmutableMap.of("path", "s3://bucket/some_path/", "basePath", 
"s3://bucket/some_path/"));`
Is that what you meant [~yhuai] ?

It gets me:
org.apache.spark.SparkException: Failed to merge incompatible data types 
StringType and StructType(StructField(name,StringType,true), 
StructField(version,StringType,true))
when I try to query it afterwards, so I assume things still go wrong underneath.

> Partitioning looks broken in 1.6
> 
>
> Key: SPARK-13046
> URL: https://issues.apache.org/jira/browse/SPARK-13046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Julien Baley
>
> Hello,
> I have a list of files in s3:
> {code}
> s3://bucket/some_path/date_received=2016-01-13/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-14/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> s3://bucket/some_path/date_received=2016-01-15/fingerprint=2f6a09d370b4021d/{_SUCCESS,metadata,some
>  parquet files}
> {code}
> Until 1.5.2, it all worked well and passing s3://bucket/some_path/ (the same 
> for the three lines) would correctly identify 2 pairs of key/value, one 
> `date_received` and one `fingerprint`.
> From 1.6.0, I get the following exception:
> {code}
> assertion failed: Conflicting directory structures detected. Suspicious paths
> s3://bucket/some_path/date_received=2016-01-13
> s3://bucket/some_path/date_received=2016-01-14
> s3://bucket/some_path/date_received=2016-01-15
> {code}
> That is to say, the partitioning code now fails to identify 
> date_received=2016-01-13 as a key/value pair.
> I can see that there has been some activity on 
> spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
>  recently, so that seems related (especially the commits 
> https://github.com/apache/spark/commit/7b5d9051cf91c099458d092a6705545899134b3b
>   and 
> https://github.com/apache/spark/commit/de289bf279e14e47859b5fbcd70e97b9d0759f14
>  ).
> If I read correctly the tests added in those commits:
> -they don't seem to actually test the return value, only that it doesn't crash
> -they only test cases where the s3 path contain 1 key/value pair (which 
> otherwise would catch the bug)
> This is problematic for us as we're trying to migrate all of our spark 
> services to 1.6.0 and this bug is a real blocker. I know it's possible to 
> force a 'union', but I'd rather not do that if the bug can be fixed.
> Any question, please shoot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13341) Casting Unix timestamp to SQL timestamp fails

2016-02-19 Thread Srinivasa Reddy Vundela (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154694#comment-15154694
 ] 

Srinivasa Reddy Vundela commented on SPARK-13341:
-

I guess the following commit is the reason for the change
https://github.com/apache/spark/commit/9ed4ad4265cf9d3135307eb62dae6de0b220fc21

Seems HIVE-3454 fixed in 1.2.0 and if customers are using earlier versions of 
HIVE they will see this problem.

> Casting Unix timestamp to SQL timestamp fails
> -
>
> Key: SPARK-13341
> URL: https://issues.apache.org/jira/browse/SPARK-13341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: William Dee
>
> The way that unix timestamp casting is handled has been broken between Spark 
> 1.5.2 and Spark 1.6.0. This can be easily demonstrated via the spark-shell:
> {code:title=1.5.2}
> scala> sqlContext.sql("SELECT CAST(145558084 AS TIMESTAMP) as ts, 
> CAST(CAST(145558084 AS TIMESTAMP) AS DATE) as d").show
> ++--+
> |  ts| d|
> ++--+
> |2016-02-16 00:00:...|2016-02-16|
> ++--+
> {code}
> {code:title=1.6.0}
> scala> sqlContext.sql("SELECT CAST(145558084 AS TIMESTAMP) as ts, 
> CAST(CAST(145558084 AS TIMESTAMP) AS DATE) as d").show
> ++--+
> |  ts| d|
> ++--+
> |48095-07-09 12:06...|095-07-09|
> ++--+
> {code}
> I'm not sure what exactly is causing this but this defect has definitely been 
> introduced in Spark 1.6.0 as jobs that relied on this functionality ran on 
> 1.5.2 and now don't run on 1.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13375) PySpark API Utils missing item: kFold

2016-02-19 Thread Bruno Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Wu updated SPARK-13375:
-
Affects Version/s: (was: 1.6.0)
   1.5.0

> PySpark API Utils missing item: kFold
> -
>
> Key: SPARK-13375
> URL: https://issues.apache.org/jira/browse/SPARK-13375
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Bruno Wu
>Priority: Minor
>
> kFold function has not been implemented in MLUtils in Python API for MLlib 
> (pyspark.mllib.util as of 1.6.0)
> This JIRA ticket is opened to add this function into pyspark.mllib.util.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13405) Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering

2016-02-19 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13405:
-
Labels: flaky-test  (was: )

> Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering
> --
>
> Key: SPARK-13405
> URL: https://issues.apache.org/jira/browse/SPARK-13405
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: flaky-test
>
> {code}
>  org.scalatest.exceptions.TestFailedException: 
> Assert failed: : null equaled null onQueryTerminated called before 
> onQueryStarted
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   
> org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204)
>   
> org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349)
>   
> org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32)
>   org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13405) Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13405:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering
> --
>
> Key: SPARK-13405
> URL: https://issues.apache.org/jira/browse/SPARK-13405
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> {code}
>  org.scalatest.exceptions.TestFailedException: 
> Assert failed: : null equaled null onQueryTerminated called before 
> onQueryStarted
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   
> org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204)
>   
> org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349)
>   
> org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32)
>   org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13405) Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering

2016-02-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154580#comment-15154580
 ] 

Apache Spark commented on SPARK-13405:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11275

> Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering
> --
>
> Key: SPARK-13405
> URL: https://issues.apache.org/jira/browse/SPARK-13405
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> {code}
>  org.scalatest.exceptions.TestFailedException: 
> Assert failed: : null equaled null onQueryTerminated called before 
> onQueryStarted
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   
> org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204)
>   
> org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349)
>   
> org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32)
>   org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13405) Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13405:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering
> --
>
> Key: SPARK-13405
> URL: https://issues.apache.org/jira/browse/SPARK-13405
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> {code}
>  org.scalatest.exceptions.TestFailedException: 
> Assert failed: : null equaled null onQueryTerminated called before 
> onQueryStarted
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   
> org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204)
>   
> org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349)
>   
> org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32)
>   org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   
> org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2016-02-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1405:
-
Description: 
Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics 
from text corpus. Different with current machine learning algorithms in MLlib, 
instead of using optimization algorithms such as gradient desent, LDA uses 
expectation algorithms such as Gibbs sampling. 

In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and 
a Gibbs sampling core.

Algorithm survey from Pedro: 
https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing
API design doc from Joseph: 
https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing

  was:
Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics 
from text corpus. Different with current machine learning algorithms in MLlib, 
instead of using optimization algorithms such as gradient desent, LDA uses 
expectation algorithms such as Gibbs sampling. 

In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and 
a Gibbs sampling core.


> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Joseph K. Bradley
>Priority: Critical
>  Labels: features
> Fix For: 1.3.0
>
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.
> Algorithm survey from Pedro: 
> https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing
> API design doc from Joseph: 
> https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13405) Flaky test: o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering

2016-02-19 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-13405:


 Summary: Flaky test: 
o.a.s.sql.util.ContinuousQueryListenerSuite.event ordering
 Key: SPARK-13405
 URL: https://issues.apache.org/jira/browse/SPARK-13405
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


{code}
 org.scalatest.exceptions.TestFailedException: 
Assert failed: : null equaled null onQueryTerminated called before 
onQueryStarted
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)

org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)

org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector$$anonfun$onQueryTerminated$1.apply$mcV$sp(ContinuousQueryListenerSuite.scala:204)

org.scalatest.concurrent.AsyncAssertions$Waiter.apply(AsyncAssertions.scala:349)

org.apache.spark.sql.util.ContinuousQueryListenerSuite$QueryStatusCollector.onQueryTerminated(ContinuousQueryListenerSuite.scala:203)

org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:67)

org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.doPostEvent(ContinuousQueryListenerBus.scala:32)
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)

org.apache.spark.sql.execution.streaming.ContinuousQueryListenerBus.postToAll(ContinuousQueryListenerBus.scala:32)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13304) Broadcast join with two ints could be very slow

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13304:


Assignee: Apache Spark

> Broadcast join with two ints could be very slow
> ---
>
> Key: SPARK-13304
> URL: https://issues.apache.org/jira/browse/SPARK-13304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> If the two join columns have the same value, the hash code of them will be (a 
> ^ b), which is 0, then the HashMap will be very very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13304) Broadcast join with two ints could be very slow

2016-02-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154570#comment-15154570
 ] 

Apache Spark commented on SPARK-13304:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11188

> Broadcast join with two ints could be very slow
> ---
>
> Key: SPARK-13304
> URL: https://issues.apache.org/jira/browse/SPARK-13304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> If the two join columns have the same value, the hash code of them will be (a 
> ^ b), which is 0, then the HashMap will be very very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13304) Broadcast join with two ints could be very slow

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13304:


Assignee: (was: Apache Spark)

> Broadcast join with two ints could be very slow
> ---
>
> Key: SPARK-13304
> URL: https://issues.apache.org/jira/browse/SPARK-13304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> If the two join columns have the same value, the hash code of them will be (a 
> ^ b), which is 0, then the HashMap will be very very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13404) Create the variables for input when it's used

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13404:


Assignee: Apache Spark  (was: Davies Liu)

> Create the variables for input when it's used
> -
>
> Key: SPARK-13404
> URL: https://issues.apache.org/jira/browse/SPARK-13404
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Right now, we create the variables in the first operator (usually 
> InputAdapter), they could be wasted if most of rows after filtered out 
> immediately.
> We should defer that until they are used by following operators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13404) Create the variables for input when it's used

2016-02-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154559#comment-15154559
 ] 

Apache Spark commented on SPARK-13404:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11274

> Create the variables for input when it's used
> -
>
> Key: SPARK-13404
> URL: https://issues.apache.org/jira/browse/SPARK-13404
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we create the variables in the first operator (usually 
> InputAdapter), they could be wasted if most of rows after filtered out 
> immediately.
> We should defer that until they are used by following operators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13404) Create the variables for input when it's used

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13404:


Assignee: Davies Liu  (was: Apache Spark)

> Create the variables for input when it's used
> -
>
> Key: SPARK-13404
> URL: https://issues.apache.org/jira/browse/SPARK-13404
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we create the variables in the first operator (usually 
> InputAdapter), they could be wasted if most of rows after filtered out 
> immediately.
> We should defer that until they are used by following operators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13349) adding a split and union to a streaming application cause big performance hit

2016-02-19 Thread krishna ramachandran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154558#comment-15154558
 ] 

krishna ramachandran edited comment on SPARK-13349 at 2/19/16 6:01 PM:
---

enabling "cache" for a DStream causes the app to run out of memory. I believe 
this is a bug



was (Author: ramach1776):
enabling "cache" for a DStream causes the app to run out of memory


> adding a split and union to a streaming application cause big performance hit
> -
>
> Key: SPARK-13349
> URL: https://issues.apache.org/jira/browse/SPARK-13349
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.4.1
>Reporter: krishna ramachandran
>Priority: Critical
>
> We have a streaming application containing approximately 12 jobs every batch, 
> running in streaming mode (4 sec batches). Each job writes output to cassandra
> each job can contain several stages.
> job 1
> ---> receive Stream A --> map --> filter -> (union with another stream B) --> 
> map --> groupbykey --> transform --> reducebykey --> map
> we go thro' few more jobs of transforms and save to database. 
> Around stage 5, we union the output of Dstream from job 1 (in red) with 
> another stream (generated by split during job 2) and save that state
> It appears the whole execution thus far is repeated which is redundant (I can 
> see this in execution graph & also performance -> processing time). 
> Processing time per batch nearly doubles or triples.
> This additional & redundant processing cause each batch to run as much as 2.5 
> times slower compared to runs without the union - union for most batches does 
> not alter the original DStream (union with an empty set). If I cache the 
> DStream from job 1(red block output), performance improves substantially but 
> hit out of memory errors within few hours.
> What is the recommended way to cache/unpersist in such a scenario? there is 
> no dstream level "unpersist"
> setting "spark.streaming.unpersist" to true and 
> streamingContext.remember("duration") did not help. Still seeing out of 
> memory errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-13349) adding a split and union to a streaming application cause big performance hit

2016-02-19 Thread krishna ramachandran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

krishna ramachandran reopened SPARK-13349:
--

enabling "cache" for a DStream causes the app to run out of memory


> adding a split and union to a streaming application cause big performance hit
> -
>
> Key: SPARK-13349
> URL: https://issues.apache.org/jira/browse/SPARK-13349
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.4.1
>Reporter: krishna ramachandran
>Priority: Critical
>
> We have a streaming application containing approximately 12 jobs every batch, 
> running in streaming mode (4 sec batches). Each job writes output to cassandra
> each job can contain several stages.
> job 1
> ---> receive Stream A --> map --> filter -> (union with another stream B) --> 
> map --> groupbykey --> transform --> reducebykey --> map
> we go thro' few more jobs of transforms and save to database. 
> Around stage 5, we union the output of Dstream from job 1 (in red) with 
> another stream (generated by split during job 2) and save that state
> It appears the whole execution thus far is repeated which is redundant (I can 
> see this in execution graph & also performance -> processing time). 
> Processing time per batch nearly doubles or triples.
> This additional & redundant processing cause each batch to run as much as 2.5 
> times slower compared to runs without the union - union for most batches does 
> not alter the original DStream (union with an empty set). If I cache the 
> DStream from job 1(red block output), performance improves substantially but 
> hit out of memory errors within few hours.
> What is the recommended way to cache/unpersist in such a scenario? there is 
> no dstream level "unpersist"
> setting "spark.streaming.unpersist" to true and 
> streamingContext.remember("duration") did not help. Still seeing out of 
> memory errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13349) adding a split and union to a streaming application cause big performance hit

2016-02-19 Thread krishna ramachandran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154555#comment-15154555
 ] 

krishna ramachandran commented on SPARK-13349:
--

Hi Sean
I posted to user@
2 problems

1) not much traction
2) though I registered multiple times I keep getting this message at Nable

  "This post has NOT been accepted by the mailing list yet"
the message I posted is pasted below.

this is not just a question - it is a bug

We have a streaming application containing approximately 12 jobs every batch, 
running in streaming mode (4 sec batches). Each  job has several 
transformations and 1 action (output to cassandra) which causes the execution 
of the job (DAG) 

For example the first job, 

job 1 
---> receive Stream A --> map --> filter -> (union with another stream B) --> 
map --> groupbykey --> transform --> reducebykey --> map 

Likewise we go thro' few more transforms and save to database (job2, job3...) 

Recently we added a new transformation further downstream wherein we union the 
output of DStream from job 1 (in italics) with output from a new 
transformation(job 5). It appears the whole execution thus far is repeated 
which is redundant (I can see this in execution graph & also performance -> 
processing time). 

That is, with this additional transformation (union with a stream processed 
upstream) each batch runs as much as 2.5 times slower compared to runs without 
the union. If I cache the DStream from job 1(italics), performance improves 
substantially but hit out of memory errors within few hours. 

What is the recommended way to cache/unpersist in such a scenario? there is no 
dstream level "unpersist" 
setting "spark.streaming.unpersist" to true and 
streamingContext.remember("duration") did not help. 

> adding a split and union to a streaming application cause big performance hit
> -
>
> Key: SPARK-13349
> URL: https://issues.apache.org/jira/browse/SPARK-13349
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.4.1
>Reporter: krishna ramachandran
>Priority: Critical
>
> We have a streaming application containing approximately 12 jobs every batch, 
> running in streaming mode (4 sec batches). Each job writes output to cassandra
> each job can contain several stages.
> job 1
> ---> receive Stream A --> map --> filter -> (union with another stream B) --> 
> map --> groupbykey --> transform --> reducebykey --> map
> we go thro' few more jobs of transforms and save to database. 
> Around stage 5, we union the output of Dstream from job 1 (in red) with 
> another stream (generated by split during job 2) and save that state
> It appears the whole execution thus far is repeated which is redundant (I can 
> see this in execution graph & also performance -> processing time). 
> Processing time per batch nearly doubles or triples.
> This additional & redundant processing cause each batch to run as much as 2.5 
> times slower compared to runs without the union - union for most batches does 
> not alter the original DStream (union with an empty set). If I cache the 
> DStream from job 1(red block output), performance improves substantially but 
> hit out of memory errors within few hours.
> What is the recommended way to cache/unpersist in such a scenario? there is 
> no dstream level "unpersist"
> setting "spark.streaming.unpersist" to true and 
> streamingContext.remember("duration") did not help. Still seeing out of 
> memory errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13404) Create the variables for input when it's used

2016-02-19 Thread Davies Liu (JIRA)

Davies Liu created SPARK-13404:
--

 Summary: Create the variables for input when it's used
 Key: SPARK-13404
 URL: https://issues.apache.org/jira/browse/SPARK-13404
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Right now, we create the variables in the first operator (usually 
InputAdapter), they could be wasted if most of rows after filtered out 
immediately.

We should defer that until they are used by following operators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13373) Generate code for sort merge join

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13373:


Assignee: Apache Spark  (was: Davies Liu)

> Generate code for sort merge join
> -
>
> Key: SPARK-13373
> URL: https://issues.apache.org/jira/browse/SPARK-13373
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13373) Generate code for sort merge join

2016-02-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154549#comment-15154549
 ] 

Apache Spark commented on SPARK-13373:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11248

> Generate code for sort merge join
> -
>
> Key: SPARK-13373
> URL: https://issues.apache.org/jira/browse/SPARK-13373
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13373) Generate code for sort merge join

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13373:


Assignee: Davies Liu  (was: Apache Spark)

> Generate code for sort merge join
> -
>
> Key: SPARK-13373
> URL: https://issues.apache.org/jira/browse/SPARK-13373
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13403) HiveConf used for SparkSQL is not based on the Hadoop configuration

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13403:


Assignee: Apache Spark

> HiveConf used for SparkSQL is not based on the Hadoop configuration
> ---
>
> Key: SPARK-13403
> URL: https://issues.apache.org/jira/browse/SPARK-13403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> The HiveConf instances used by HiveContext are not instantiated by passing in 
> the SparkContext's Hadoop conf and are instead based only on the config files 
> in the environment. Hadoop best practice is to instantiate just one 
> Configuration from the environment and then pass that conf when instantiating 
> others so that modifications aren't lost.
> Spark will set configuration variables that start with "spark.hadoop." from 
> spark-defaults.conf when creating {{sc.hadoopConfiguration}}, which are not 
> correctly passed to the HiveConf because of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13403) HiveConf used for SparkSQL is not based on the Hadoop configuration

2016-02-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13403:


Assignee: (was: Apache Spark)

> HiveConf used for SparkSQL is not based on the Hadoop configuration
> ---
>
> Key: SPARK-13403
> URL: https://issues.apache.org/jira/browse/SPARK-13403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>
> The HiveConf instances used by HiveContext are not instantiated by passing in 
> the SparkContext's Hadoop conf and are instead based only on the config files 
> in the environment. Hadoop best practice is to instantiate just one 
> Configuration from the environment and then pass that conf when instantiating 
> others so that modifications aren't lost.
> Spark will set configuration variables that start with "spark.hadoop." from 
> spark-defaults.conf when creating {{sc.hadoopConfiguration}}, which are not 
> correctly passed to the HiveConf because of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13403) HiveConf used for SparkSQL is not based on the Hadoop configuration

2016-02-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154533#comment-15154533
 ] 

Apache Spark commented on SPARK-13403:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/11273

> HiveConf used for SparkSQL is not based on the Hadoop configuration
> ---
>
> Key: SPARK-13403
> URL: https://issues.apache.org/jira/browse/SPARK-13403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Ryan Blue
>
> The HiveConf instances used by HiveContext are not instantiated by passing in 
> the SparkContext's Hadoop conf and are instead based only on the config files 
> in the environment. Hadoop best practice is to instantiate just one 
> Configuration from the environment and then pass that conf when instantiating 
> others so that modifications aren't lost.
> Spark will set configuration variables that start with "spark.hadoop." from 
> spark-defaults.conf when creating {{sc.hadoopConfiguration}}, which are not 
> correctly passed to the HiveConf because of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13403) HiveConf used for SparkSQL is not based on the Hadoop configuration

2016-02-19 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-13403:
-

 Summary: HiveConf used for SparkSQL is not based on the Hadoop 
configuration
 Key: SPARK-13403
 URL: https://issues.apache.org/jira/browse/SPARK-13403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Ryan Blue


The HiveConf instances used by HiveContext are not instantiated by passing in 
the SparkContext's Hadoop conf and are instead based only on the config files 
in the environment. Hadoop best practice is to instantiate just one 
Configuration from the environment and then pass that conf when instantiating 
others so that modifications aren't lost.

Spark will set configuration variables that start with "spark.hadoop." from 
spark-defaults.conf when creating {{sc.hadoopConfiguration}}, which are not 
correctly passed to the HiveConf because of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13402) List Spark R dependencies

2016-02-19 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-13402.
---
Resolution: Not A Problem

> List Spark R dependencies
> -
>
> Key: SPARK-13402
> URL: https://issues.apache.org/jira/browse/SPARK-13402
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SparkR
>Reporter: holdenk
>Priority: Trivial
>
> Especially for developers in other languages who want to build the docs and 
> similar it would be good to have a list of packages that SparkR depends on so 
> they can quickly build the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13402) List Spark R dependencies

2016-02-19 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154507#comment-15154507
 ] 

holdenk commented on SPARK-13402:
-

Oh let me close this, I was looking in the wrong place (inside of the R 
documentation and the R setup scripts).

> List Spark R dependencies
> -
>
> Key: SPARK-13402
> URL: https://issues.apache.org/jira/browse/SPARK-13402
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SparkR
>Reporter: holdenk
>Priority: Trivial
>
> Especially for developers in other languages who want to build the docs and 
> similar it would be good to have a list of packages that SparkR depends on so 
> they can quickly build the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13402) List Spark R dependencies

2016-02-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154505#comment-15154505
 ] 

Sean Owen commented on SPARK-13402:
---

https://github.com/apache/spark/blob/master/docs/README.md says knitr and 
devtools; are there more or was that what you're looking for?

> List Spark R dependencies
> -
>
> Key: SPARK-13402
> URL: https://issues.apache.org/jira/browse/SPARK-13402
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SparkR
>Reporter: holdenk
>Priority: Trivial
>
> Especially for developers in other languages who want to build the docs and 
> similar it would be good to have a list of packages that SparkR depends on so 
> they can quickly build the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13402) List Spark R dependencies

2016-02-19 Thread holdenk (JIRA)

holdenk created SPARK-13402:
---

 Summary: List Spark R dependencies
 Key: SPARK-13402
 URL: https://issues.apache.org/jira/browse/SPARK-13402
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SparkR
Reporter: holdenk
Priority: Trivial


Especially for developers in other languages who want to build the docs and 
similar it would be good to have a list of packages that SparkR depends on so 
they can quickly build the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13400) Stop using deprecated Octal escape literals

2016-02-19 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154483#comment-15154483
 ] 

holdenk commented on SPARK-13400:
-

oh yah for sure, I was going to try running some test generation tools and 
other things.

> Stop using deprecated Octal escape literals
> ---
>
> Key: SPARK-13400
> URL: https://issues.apache.org/jira/browse/SPARK-13400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>Priority: Trivial
>
> We use some deprecated octal literals.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13400) Stop using deprecated Octal escape literals

2016-02-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154479#comment-15154479
 ] 

Sean Owen commented on SPARK-13400:
---

Yeah all these are good. We need to zap virtually all the warnings before 2.0. 
Keep going and I'll review.

I also want to file some changes to clean up things that aren't compiler 
warnings but that are bugs identifiable from static analysis, or needlessly 
slow code that can be simplified.

> Stop using deprecated Octal escape literals
> ---
>
> Key: SPARK-13400
> URL: https://issues.apache.org/jira/browse/SPARK-13400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: holdenk
>Priority: Trivial
>
> We use some deprecated octal literals.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13401) Fix SQL test warnings

2016-02-19 Thread holdenk (JIRA)

holdenk created SPARK-13401:
---

 Summary: Fix SQL test warnings
 Key: SPARK-13401
 URL: https://issues.apache.org/jira/browse/SPARK-13401
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Reporter: holdenk
Priority: Trivial


SQL tests have a few number of warnings about unreachable code, non-exhaustive 
matches, and unchecked type casts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13400) Stop using deprecated Octal escape literals

2016-02-19 Thread holdenk (JIRA)

holdenk created SPARK-13400:
---

 Summary: Stop using deprecated Octal escape literals
 Key: SPARK-13400
 URL: https://issues.apache.org/jira/browse/SPARK-13400
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: holdenk
Priority: Trivial


We use some deprecated octal literals.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13399) Investigate type erasure warnings in CheckpointSuite

2016-02-19 Thread holdenk (JIRA)

holdenk created SPARK-13399:
---

 Summary: Investigate type erasure warnings in CheckpointSuite
 Key: SPARK-13399
 URL: https://issues.apache.org/jira/browse/SPARK-13399
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk
Priority: Trivial


[warn] 
/home/holden/repos/spark/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala:154:
 abstract type V in type 
org.apache.spark.streaming.TestOutputStreamWithPartitions[V] is unchecked since 
it is eliminated by erasure
[warn] dstream.isInstanceOf[TestOutputStreamWithPartitions[V]]
[warn] ^
[warn] 
/home/holden/repos/spark/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala:911:
 abstract type V in type 
org.apache.spark.streaming.TestOutputStreamWithPartitions[V] is unchecked since 
it is eliminated by erasure
[warn]   dstream.isInstanceOf[TestOutputStreamWithPartitions[V]]
[warn]   ^




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13398) Move away from deprecated ThreadPoolTaskSupport

2016-02-19 Thread holdenk (JIRA)

holdenk created SPARK-13398:
---

 Summary: Move away from deprecated ThreadPoolTaskSupport
 Key: SPARK-13398
 URL: https://issues.apache.org/jira/browse/SPARK-13398
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: holdenk
Priority: Trivial


ThreadPoolTaskSupport has been replaced by ForkJoinTaskSupport



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13397) Cleanup transient annotations which aren't being applied

2016-02-19 Thread holdenk (JIRA)

holdenk created SPARK-13397:
---

 Summary: Cleanup transient annotations which aren't being applied
 Key: SPARK-13397
 URL: https://issues.apache.org/jira/browse/SPARK-13397
 Project: Spark
  Issue Type: Sub-task
Reporter: holdenk
Priority: Trivial


A number of places we have transient markers which are discarded as unused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13396) Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates

2016-02-19 Thread holdenk (JIRA)

holdenk created SPARK-13396:
---

 Summary: Stop using our internal deprecated .metrics on 
ExceptionFailure instead use accumUpdates
 Key: SPARK-13396
 URL: https://issues.apache.org/jira/browse/SPARK-13396
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: holdenk
Priority: Minor


src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala:385: value 
metrics in class ExceptionFailure is deprecated: use accumUpdates instead




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13395) Silence or skip unsafe deprecation warnings

2016-02-19 Thread holdenk (JIRA)

holdenk created SPARK-13395:
---

 Summary: Silence or skip unsafe deprecation warnings
 Key: SPARK-13395
 URL: https://issues.apache.org/jira/browse/SPARK-13395
 Project: Spark
  Issue Type: Sub-task
Reporter: holdenk
Priority: Trivial


A number of places inside of Spark we use the unsafe API which produces 
warnings we probably want to silence if its possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13395) Silence or skip unsafe deprecation warnings

2016-02-19 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-13395:

Component/s: (was: Streaming)
 (was: Examples)

> Silence or skip unsafe deprecation warnings
> ---
>
> Key: SPARK-13395
> URL: https://issues.apache.org/jira/browse/SPARK-13395
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: holdenk
>Priority: Trivial
>
> A number of places inside of Spark we use the unsafe API which produces 
> warnings we probably want to silence if its possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-02-19 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154393#comment-15154393
 ] 

Herman van Hovell commented on SPARK-13393:
---

I think you are encountering the following bug: 
https://issues.apache.org/jira/browse/SPARK-10737

Could try to run this on 1.5.2/1.6.0/master?

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 149 matches

Mail list logo