[jira] [Created] (SPARK-7993) Improve DataFrame.show() output

2015-06-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7993:
--

 Summary: Improve DataFrame.show() output
 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker


1. Each column should be at the minimum 3 characters wide. Right now if the 
widest value is 1, it is just 1 char wide, which looks ugly. Example below:

2. If a DataFrame have more than N number of rows (N = 20 by default for show), 
at the end we should display a message like only showing the top 20 rows.

{code}
+--+--+-+
| a| b|c|
+--+--+-+
| 1| 2|3|
| 1| 2|1|
| 1| 2|3|
| 3| 6|3|
| 1| 2|3|
| 5|10|1|
| 1| 2|3|
| 7|14|3|
| 1| 2|3|
| 9|18|1|
| 1| 2|3|
|11|22|3|
| 1| 2|3|
|13|26|1|
| 1| 2|3|
|15|30|3|
| 1| 2|3|
|17|34|1|
| 1| 2|3|
|19|38|3|
+--+--+-+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext

2015-06-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-7799:

Summary: Move StreamingContext.actorStream to a separate project and 
deprecate it in StreamingContext  (was: Move StreamingContext.actorStream to 
a separate project)

 Move StreamingContext.actorStream to a separate project and deprecate it in 
 StreamingContext
 --

 Key: SPARK-7799
 URL: https://issues.apache.org/jira/browse/SPARK-7799
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Shixiong Zhu

 Move {{StreamingContext.actorStream}} to a separate project and deprecate it 
 in {{StreamingContext}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7966) add Spreading Activation algorithm to GraphX

2015-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566967#comment-14566967
 ] 

Apache Spark commented on SPARK-7966:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6549

 add Spreading Activation algorithm to GraphX
 

 Key: SPARK-7966
 URL: https://issues.apache.org/jira/browse/SPARK-7966
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Tarek Auel
Priority: Minor

 I'm wondering if you would like to add the Spreading Activation algorithm to 
 GraphX. I have implemented it, using the Pregel-API and would love to share 
 it with the community.
 Spreading activation is a algorithm that was invented to search in 
 associative networks. The basic idea is, that you have one (or multiple) 
 starting nodes. The activation spreads out from these nodes to the neighbours 
 and the neighbours of the neighbours. The activation decreases after every 
 hop. Nodes that were reached by many activations will have a higher total 
 activation level.
 Spreading Activation is for many use cases useful. Imagine you have the 
 social network of two people. If you apply the spreading activation to this 
 social graph with the two people as starting nodes, you will get the nodes 
 that are most important for both.
 Some resources:
 http://www.websci11.org/fileadmin/websci/posters/105_paper.pdf
 https://webfiles.uci.edu/eloftus/CollinsLoftus_PsychReview_75.pdf?uniq=20ou4w



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7966) add Spreading Activation algorithm to GraphX

2015-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7966:
---

Assignee: (was: Apache Spark)

 add Spreading Activation algorithm to GraphX
 

 Key: SPARK-7966
 URL: https://issues.apache.org/jira/browse/SPARK-7966
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Tarek Auel
Priority: Minor

 I'm wondering if you would like to add the Spreading Activation algorithm to 
 GraphX. I have implemented it, using the Pregel-API and would love to share 
 it with the community.
 Spreading activation is a algorithm that was invented to search in 
 associative networks. The basic idea is, that you have one (or multiple) 
 starting nodes. The activation spreads out from these nodes to the neighbours 
 and the neighbours of the neighbours. The activation decreases after every 
 hop. Nodes that were reached by many activations will have a higher total 
 activation level.
 Spreading Activation is for many use cases useful. Imagine you have the 
 social network of two people. If you apply the spreading activation to this 
 social graph with the two people as starting nodes, you will get the nodes 
 that are most important for both.
 Some resources:
 http://www.websci11.org/fileadmin/websci/posters/105_paper.pdf
 https://webfiles.uci.edu/eloftus/CollinsLoftus_PsychReview_75.pdf?uniq=20ou4w



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7966) add Spreading Activation algorithm to GraphX

2015-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7966:
---

Assignee: Apache Spark

 add Spreading Activation algorithm to GraphX
 

 Key: SPARK-7966
 URL: https://issues.apache.org/jira/browse/SPARK-7966
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Tarek Auel
Assignee: Apache Spark
Priority: Minor

 I'm wondering if you would like to add the Spreading Activation algorithm to 
 GraphX. I have implemented it, using the Pregel-API and would love to share 
 it with the community.
 Spreading activation is a algorithm that was invented to search in 
 associative networks. The basic idea is, that you have one (or multiple) 
 starting nodes. The activation spreads out from these nodes to the neighbours 
 and the neighbours of the neighbours. The activation decreases after every 
 hop. Nodes that were reached by many activations will have a higher total 
 activation level.
 Spreading Activation is for many use cases useful. Imagine you have the 
 social network of two people. If you apply the spreading activation to this 
 social graph with the two people as starting nodes, you will get the nodes 
 that are most important for both.
 Some resources:
 http://www.websci11.org/fileadmin/websci/posters/105_paper.pdf
 https://webfiles.uci.edu/eloftus/CollinsLoftus_PsychReview_75.pdf?uniq=20ou4w



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7994) Remove StreamingContext.actorStream

2015-06-01 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-7994:
---

 Summary: Remove StreamingContext.actorStream
 Key: SPARK-7994
 URL: https://issues.apache.org/jira/browse/SPARK-7994
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7799) Move StreamingContext.actorStream to a separate project and deprecate it in StreamingContext

2015-06-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-7799:

Target Version/s: 1.5.0  (was: 1.6.0)

 Move StreamingContext.actorStream to a separate project and deprecate it in 
 StreamingContext
 --

 Key: SPARK-7799
 URL: https://issues.apache.org/jira/browse/SPARK-7799
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Shixiong Zhu

 Move {{StreamingContext.actorStream}} to a separate project and deprecate it 
 in {{StreamingContext}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7995) Move AkkaRpcEnv to a separate project and remove Akka from the dependencies of Core

2015-06-01 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-7995:
---

 Summary: Move AkkaRpcEnv to a separate project and remove Akka 
from the dependencies of Core
 Key: SPARK-7995
 URL: https://issues.apache.org/jira/browse/SPARK-7995
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7996) Deprecate the developer api SparkEnv.actorSystem

2015-06-01 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-7996:
---

 Summary: Deprecate the developer api SparkEnv.actorSystem
 Key: SPARK-7996
 URL: https://issues.apache.org/jira/browse/SPARK-7996
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7997) Remove the developer api SparkEnv.actorSystem

2015-06-01 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-7997:
---

 Summary: Remove the developer api SparkEnv.actorSystem
 Key: SPARK-7997
 URL: https://issues.apache.org/jira/browse/SPARK-7997
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7998) A better frequent item API

2015-06-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7998:
--

 Summary: A better frequent item API
 Key: SPARK-7998
 URL: https://issues.apache.org/jira/browse/SPARK-7998
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


The current freqItems API is really awkward to use. It returns a DataFrame with 
a single row, in which each value is an array of frequent items. 

This design doesn't work well for exploratory data analysis (running show -- 
when there are more than 2 or 3 frequent values, the values get cut off):
{code}
In [74]: df.stat.freqItems([a, b, c], 0.4).show()
+--+--+-+
|   a_freqItems|   b_freqItems|  c_freqItems|
+--+--+-+
|ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
+--+--+-+
{code}

It also doesn't work well for serious engineering, since it is hard to get the 
value out.

We should just create a new function (so we maintain source/binary 
compatibility) that returns a list of list of values.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7998) A better frequent item API

2015-06-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7998:
---
Description: 
The current freqItems API is really awkward to use. It returns a DataFrame with 
a single row, in which each value is an array of frequent items. 

This design doesn't work well for exploratory data analysis (running show -- 
when there are more than 2 or 3 frequent values, the values get cut off):
{code}
In [74]: df.stat.freqItems([a, b, c], 0.4).show()
+--+--+-+
|   a_freqItems|   b_freqItems|  c_freqItems|
+--+--+-+
|ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
+--+--+-+
{code}

It also doesn't work well for serious engineering, since it is hard to get the 
value out.

We should create a new function (so we maintain source/binary compatibility) 
that returns a list of list of values.


  was:
The current freqItems API is really awkward to use. It returns a DataFrame with 
a single row, in which each value is an array of frequent items. 

This design doesn't work well for exploratory data analysis (running show -- 
when there are more than 2 or 3 frequent values, the values get cut off):
{code}
In [74]: df.stat.freqItems([a, b, c], 0.4).show()
+--+--+-+
|   a_freqItems|   b_freqItems|  c_freqItems|
+--+--+-+
|ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
+--+--+-+
{code}

It also doesn't work well for serious engineering, since it is hard to get the 
value out.

We should just create a new function (so we maintain source/binary 
compatibility) that returns a list of list of values.



 A better frequent item API
 --

 Key: SPARK-7998
 URL: https://issues.apache.org/jira/browse/SPARK-7998
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 The current freqItems API is really awkward to use. It returns a DataFrame 
 with a single row, in which each value is an array of frequent items. 
 This design doesn't work well for exploratory data analysis (running show -- 
 when there are more than 2 or 3 frequent values, the values get cut off):
 {code}
 In [74]: df.stat.freqItems([a, b, c], 0.4).show()
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 It also doesn't work well for serious engineering, since it is hard to get 
 the value out.
 We should create a new function (so we maintain source/binary compatibility) 
 that returns a list of list of values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7993) Improve DataFrame.show() output

2015-06-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7993:
---
Labels: starter  (was: )

 Improve DataFrame.show() output
 ---

 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
  Labels: starter

 1. Each column should be at the minimum 3 characters wide. Right now if the 
 widest value is 1, it is just 1 char wide, which looks ugly. Example below:
 2. If a DataFrame have more than N number of rows (N = 20 by default for 
 show), at the end we should display a message like only showing the top 20 
 rows.
 {code}
 +--+--+-+
 | a| b|c|
 +--+--+-+
 | 1| 2|3|
 | 1| 2|1|
 | 1| 2|3|
 | 3| 6|3|
 | 1| 2|3|
 | 5|10|1|
 | 1| 2|3|
 | 7|14|3|
 | 1| 2|3|
 | 9|18|1|
 | 1| 2|3|
 |11|22|3|
 | 1| 2|3|
 |13|26|1|
 | 1| 2|3|
 |15|30|3|
 | 1| 2|3|
 |17|34|1|
 | 1| 2|3|
 |19|38|3|
 +--+--+-+
 only showing top 20 rows    add this at the end
 {code}
 3. For array values, instead of printing ArrayBuffer, we should just print 
 square brackets:
 {code}
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 should be
 {code}
 +---+---+---+
 |a_freqItems|b_freqItems|c_freqItems|
 +---+---+---+
 |[11, 1]|[2, 22]| [1, 3]|
 +---+---+---+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7993) Improve DataFrame.show() output

2015-06-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7993:
---
Description: 
1. Each column should be at the minimum 3 characters wide. Right now if the 
widest value is 1, it is just 1 char wide, which looks ugly. Example below:

2. If a DataFrame have more than N number of rows (N = 20 by default for show), 
at the end we should display a message like only showing the top 20 rows.

{code}
+--+--+-+
| a| b|c|
+--+--+-+
| 1| 2|3|
| 1| 2|1|
| 1| 2|3|
| 3| 6|3|
| 1| 2|3|
| 5|10|1|
| 1| 2|3|
| 7|14|3|
| 1| 2|3|
| 9|18|1|
| 1| 2|3|
|11|22|3|
| 1| 2|3|
|13|26|1|
| 1| 2|3|
|15|30|3|
| 1| 2|3|
|17|34|1|
| 1| 2|3|
|19|38|3|
+--+--+-+
only showing top 20 rows    add this at the end
{code}

3. For array values, instead of printing ArrayBuffer, we should just print 
square brackets:

{code}
+--+--+-+
|   a_freqItems|   b_freqItems|  c_freqItems|
+--+--+-+
|ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
+--+--+-+
{code}

should be

{code}
+---+---+---+
|a_freqItems|b_freqItems|c_freqItems|
+---+---+---+
|[11, 1]|[2, 22]| [1, 3]|
+---+---+---+
{code}



  was:
1. Each column should be at the minimum 3 characters wide. Right now if the 
widest value is 1, it is just 1 char wide, which looks ugly. Example below:

2. If a DataFrame have more than N number of rows (N = 20 by default for show), 
at the end we should display a message like only showing the top 20 rows.

{code}
+--+--+-+
| a| b|c|
+--+--+-+
| 1| 2|3|
| 1| 2|1|
| 1| 2|3|
| 3| 6|3|
| 1| 2|3|
| 5|10|1|
| 1| 2|3|
| 7|14|3|
| 1| 2|3|
| 9|18|1|
| 1| 2|3|
|11|22|3|
| 1| 2|3|
|13|26|1|
| 1| 2|3|
|15|30|3|
| 1| 2|3|
|17|34|1|
| 1| 2|3|
|19|38|3|
+--+--+-+
{code}


 Improve DataFrame.show() output
 ---

 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
  Labels: starter

 1. Each column should be at the minimum 3 characters wide. Right now if the 
 widest value is 1, it is just 1 char wide, which looks ugly. Example below:
 2. If a DataFrame have more than N number of rows (N = 20 by default for 
 show), at the end we should display a message like only showing the top 20 
 rows.
 {code}
 +--+--+-+
 | a| b|c|
 +--+--+-+
 | 1| 2|3|
 | 1| 2|1|
 | 1| 2|3|
 | 3| 6|3|
 | 1| 2|3|
 | 5|10|1|
 | 1| 2|3|
 | 7|14|3|
 | 1| 2|3|
 | 9|18|1|
 | 1| 2|3|
 |11|22|3|
 | 1| 2|3|
 |13|26|1|
 | 1| 2|3|
 |15|30|3|
 | 1| 2|3|
 |17|34|1|
 | 1| 2|3|
 |19|38|3|
 +--+--+-+
 only showing top 20 rows    add this at the end
 {code}
 3. For array values, instead of printing ArrayBuffer, we should just print 
 square brackets:
 {code}
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 should be
 {code}
 +---+---+---+
 |a_freqItems|b_freqItems|c_freqItems|
 +---+---+---+
 |[11, 1]|[2, 22]| [1, 3]|
 +---+---+---+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7999) Graph complement

2015-06-01 Thread Tarek Auel (JIRA)
Tarek Auel created SPARK-7999:
-

 Summary: Graph complement
 Key: SPARK-7999
 URL: https://issues.apache.org/jira/browse/SPARK-7999
 Project: Spark
  Issue Type: Improvement
Reporter: Tarek Auel
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7999) Graph complement

2015-06-01 Thread Tarek Auel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarek Auel updated SPARK-7999:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-7893

 Graph complement
 

 Key: SPARK-7999
 URL: https://issues.apache.org/jira/browse/SPARK-7999
 Project: Spark
  Issue Type: Sub-task
Reporter: Tarek Auel
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7999) Graph complement

2015-06-01 Thread Tarek Auel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarek Auel updated SPARK-7999:
--
Description: 
This task is for implementing the complement operation (compare to parent task).

http://techieme.in/complex-graph-operations/

 Graph complement
 

 Key: SPARK-7999
 URL: https://issues.apache.org/jira/browse/SPARK-7999
 Project: Spark
  Issue Type: Sub-task
Reporter: Tarek Auel
Priority: Minor

 This task is for implementing the complement operation (compare to parent 
 task).
 http://techieme.in/complex-graph-operations/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7999) Graph complement

2015-06-01 Thread Tarek Auel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566999#comment-14566999
 ] 

Tarek Auel commented on SPARK-7999:
---

I would propose

def complement(attr: ED): Graph[VD, ED]

as interface

 Graph complement
 

 Key: SPARK-7999
 URL: https://issues.apache.org/jira/browse/SPARK-7999
 Project: Spark
  Issue Type: Sub-task
Reporter: Tarek Auel
Priority: Minor

 This task is for implementing the complement operation (compare to parent 
 task).
 http://techieme.in/complex-graph-operations/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7999) Graph complement

2015-06-01 Thread Tarek Auel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14566999#comment-14566999
 ] 

Tarek Auel edited comment on SPARK-7999 at 6/1/15 7:04 AM:
---

I would propose

def complement(attr: ED, selfLoops: Boolean = false): Graph[VD, ED]

as interface. The self-loop parameter defines whether self loops (A--A) should 
be created or not.


was (Author: tarekauel):
I would propose

def complement(attr: ED): Graph[VD, ED]

as interface

 Graph complement
 

 Key: SPARK-7999
 URL: https://issues.apache.org/jira/browse/SPARK-7999
 Project: Spark
  Issue Type: Sub-task
Reporter: Tarek Auel
Priority: Minor

 This task is for implementing the complement operation (compare to parent 
 task).
 http://techieme.in/complex-graph-operations/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7980) Support SQLContext.range(end)

2015-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567207#comment-14567207
 ] 

Apache Spark commented on SPARK-7980:
-

User 'animeshbaranawal' has created a pull request for this issue:
https://github.com/apache/spark/pull/6553

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567221#comment-14567221
 ] 

Sean Owen commented on SPARK-8008:
--

Isnt this what connection pooling is for? Is that an option?

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]

2015-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4782:
---

Assignee: (was: Apache Spark)

 Add inferSchema support for RDD[Map[String, Any]]
 -

 Key: SPARK-4782
 URL: https://issues.apache.org/jira/browse/SPARK-4782
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang
Priority: Minor

 The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to 
 be converting each Map to JSON String first and use JsonRDD.inferSchema on it.
 It's very inefficient.
 Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for 
 Schemaless data as adding Map like interface to any serialization format is 
 easy.
 So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new 
 serialization format we want to support, we just need to add a Map interface 
 wrapper to it*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]

2015-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4782:
---

Assignee: Apache Spark

 Add inferSchema support for RDD[Map[String, Any]]
 -

 Key: SPARK-4782
 URL: https://issues.apache.org/jira/browse/SPARK-4782
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang
Assignee: Apache Spark
Priority: Minor

 The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to 
 be converting each Map to JSON String first and use JsonRDD.inferSchema on it.
 It's very inefficient.
 Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for 
 Schemaless data as adding Map like interface to any serialization format is 
 easy.
 So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new 
 serialization format we want to support, we just need to add a Map interface 
 wrapper to it*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8011) DecimalType is not a datatype

2015-06-01 Thread Bipin Roshan Nag (JIRA)
Bipin Roshan Nag created SPARK-8011:
---

 Summary: DecimalType is not a datatype
 Key: SPARK-8011
 URL: https://issues.apache.org/jira/browse/SPARK-8011
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 1.3.1
Reporter: Bipin Roshan Nag


When I run the following in spark-shell :

 StructType(StructField(ID,IntegerType,true), 
StructField(Value,DecimalType,true))

I get

console:50: error: type mismatch;
 found   : org.apache.spark.sql.types.DecimalType.type
 required: org.apache.spark.sql.types.DataType
   StructType(StructField(ID,IntegerType,true), 
StructField(Value,DecimalType,true))





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-06-01 Thread Steven W (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567267#comment-14567267
 ] 

Steven W edited comment on SPARK-5389 at 6/1/15 12:56 PM:
--

I started seeing this when I installed JDK 6 on top of JDK 8. I re-installed 
JDK 8 and it worked after that. So, I think else was unexpected at this time. 
just shows up anytime Java can't run. (Spark 1.3.1, Java 6u45)


was (Author: sjwoodard):
I started seeing this when I installed JDK 6 on top of JDK 8. I re-installed 
JDK 8 and it worked after that. So, I think else was unexpected at this time. 
just shows up anytime Java can't run.

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell, Windows
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG, spark_bug.png


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Rene Treffer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567230#comment-14567230
 ] 

Rene Treffer commented on SPARK-8008:
-

At the moment each partition uses it's own connection as far as I can tell, I 
have to double check how this works on a cluster where even multiple server 
might fetch data.

I'm currently loading year+month wise, due to DB schema (index on actual days, 
locality based on year/month).

I don't think larger batches would be an solution. 3 months may require 160Mio 
rows. I don't think batching that into one partition is a good idea.

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7890) Document that Spark 2.11 now supports Kafka

2015-06-01 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567167#comment-14567167
 ] 

Iulian Dragos commented on SPARK-7890:
--

[~srowen] thanks for fixing it, and sorry for being unresponsive. I've been 
traveling a few days without a good internet connection.

 Document that Spark 2.11 now supports Kafka
 ---

 Key: SPARK-7890
 URL: https://issues.apache.org/jira/browse/SPARK-7890
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Critical
 Fix For: 1.4.1, 1.5.0


 The building-spark.html page needs to be updated. It's a simple fix, just 
 remove the caveat about Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]

2015-06-01 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567279#comment-14567279
 ] 

Jianshi Huang commented on SPARK-4782:
--

Thanks Luca for the clever fix!

I also noticed that the schema inference in JsonRDD is too JSON specific. As 
JSON's datatype is quite limited.

Jianshi

 Add inferSchema support for RDD[Map[String, Any]]
 -

 Key: SPARK-4782
 URL: https://issues.apache.org/jira/browse/SPARK-4782
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang
Priority: Minor

 The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to 
 be converting each Map to JSON String first and use JsonRDD.inferSchema on it.
 It's very inefficient.
 Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for 
 Schemaless data as adding Map like interface to any serialization format is 
 easy.
 So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new 
 serialization format we want to support, we just need to add a Map interface 
 wrapper to it*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]

2015-06-01 Thread Luca Rosellini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567254#comment-14567254
 ] 

Luca Rosellini commented on SPARK-4782:
---

Hi Jianshi,
I've just hit the same problem as you, it seems quite inefficient to have to 
serialize to JSON when you already have a {{Map\[String,Any\]}}.

I've opened a PR in github that adds this feature in a generic way, check it 
out at: [https://github.com/apache/spark/pull/6554].

Hopefully it will be merged in master.

The patch extends {{inferSchema}} functionality to any RDD of type T for which 
you can provide a function mapping from {{RDD\[T\]}} to 
{{RDD\[Map\[String,Any\]\]}}.

In your case, you already have an {{RDD\[Map\[String,Any\]\]}}, so you can 
simply pass the identity function, something like this:

{{JsonRDD.inferSchema(json, 1.0, conf.columnNameOfCorruptRecord, \{ 
(a:RDD\[Map\[String,Any\]\],b:String) = a \}))}}

 Add inferSchema support for RDD[Map[String, Any]]
 -

 Key: SPARK-4782
 URL: https://issues.apache.org/jira/browse/SPARK-4782
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang
Priority: Minor

 The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to 
 be converting each Map to JSON String first and use JsonRDD.inferSchema on it.
 It's very inefficient.
 Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for 
 Schemaless data as adding Map like interface to any serialization format is 
 easy.
 So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new 
 serialization format we want to support, we just need to add a Map interface 
 wrapper to it*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6816) Add SparkConf API to configure SparkR

2015-06-01 Thread Rick Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567187#comment-14567187
 ] 

Rick Moritz commented on SPARK-6816:


One current drawback with SparkR's configuration option is the inability to set 
driver VM-options. These are crucial, when attempting to run sparkR on a 
Hortonworks HDP, as both driver and appliation-master need to be aware of the 
hdp.version variable in order to resolve the classpath.

While it is possible to pass this variable to the executors, there's no way to 
pass this option to the driver, excepting the following exploit/work-around:

The SPARK_MEM variable can be abused to pass the required parameters to the 
driver's VM, by using String concatenation. Setting the variable to (e.g.)  
512m -Dhdp.version=NNN appends the -D option to the -X option which is 
currently read from this environment variable. Adding a secondary variable to 
the System.env which gets parsed for JVM options would be far more obvious and 
less hacky, or by adding a separate environment list for the driver, extending 
what's currently available for executors.

I'm adding this as a comment to this issue, since I believe it is sufficiently 
closely related not to warrant a separate issue.

 Add SparkConf API to configure SparkR
 -

 Key: SPARK-6816
 URL: https://issues.apache.org/jira/browse/SPARK-6816
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the only way to configure SparkR is to pass in arguments to 
 sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python 
 to make configuration easier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567219#comment-14567219
 ] 

Michael Armbrust commented on SPARK-8008:
-

I'm okay adding documentation about this behavior where ever you think it would 
help, but I would say this is by design.

I'd suggest that if you want lower concurrency use fewer partitions to extract 
the data and then {{repartition}} if you need higher concurrency for subsequent 
operations.

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]

2015-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567237#comment-14567237
 ] 

Apache Spark commented on SPARK-4782:
-

User 'lucarosellini' has created a pull request for this issue:
https://github.com/apache/spark/pull/6554

 Add inferSchema support for RDD[Map[String, Any]]
 -

 Key: SPARK-4782
 URL: https://issues.apache.org/jira/browse/SPARK-4782
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang
Priority: Minor

 The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to 
 be converting each Map to JSON String first and use JsonRDD.inferSchema on it.
 It's very inefficient.
 Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for 
 Schemaless data as adding Map like interface to any serialization format is 
 easy.
 So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new 
 serialization format we want to support, we just need to add a Map interface 
 wrapper to it*
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567227#comment-14567227
 ] 

Michael Armbrust commented on SPARK-8008:
-

I think connection pooling is used primarily to avoid the overhead of making a 
new connection for each operation.  In the case of extracting large amounts of 
data, the user may actually want multiple concurrent connections from the same 
machine.

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-01 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567016#comment-14567016
 ] 

Saisai Shao commented on SPARK-4352:


Hi [~sandyr], thanks a lot for your suggestion. IIUC the algorithm you describe 
is trying to make the executor request be proportional to the node preference, 
say your desired tasks on cluster are 3 : 3 : 2 : 1, so you're trying to 
allocate the executors following this, but I'm curious about  algorithm on 7 
and 18 situation, what you describe is:

requests for 5 executors with nodes = a, b, c, d
requests for 2 executors with nodes = a, b, c

that is 7 : 7 : 7 : 5

is it better like this:

requests for 2 executors with nodes = a, b, c, d
requests for 2 executors with nodes = a, b, c
requests for 3 executors with nodes = a, b

here is 7 : 7 : 4 : 2

Also for 18 situation, why not:

requests for 6 executors with nodes = a, b, c, d
requests for 6 executors with nodes = a, b, c
requests for 6 executors with nodes = a, b

Would you please help to explain it, maybe I missed in some places:).


 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output

2015-06-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567020#comment-14567020
 ] 

Reynold Xin commented on SPARK-7993:


Thanks. Note that once you change the show output, you might need to update 
some Python unit tests since some of the functions use show's output.


 Improve DataFrame.show() output
 ---

 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
  Labels: starter

 1. Each column should be at the minimum 3 characters wide. Right now if the 
 widest value is 1, it is just 1 char wide, which looks ugly. Example below:
 2. If a DataFrame have more than N number of rows (N = 20 by default for 
 show), at the end we should display a message like only showing the top 20 
 rows.
 {code}
 +--+--+-+
 | a| b|c|
 +--+--+-+
 | 1| 2|3|
 | 1| 2|1|
 | 1| 2|3|
 | 3| 6|3|
 | 1| 2|3|
 | 5|10|1|
 | 1| 2|3|
 | 7|14|3|
 | 1| 2|3|
 | 9|18|1|
 | 1| 2|3|
 |11|22|3|
 | 1| 2|3|
 |13|26|1|
 | 1| 2|3|
 |15|30|3|
 | 1| 2|3|
 |17|34|1|
 | 1| 2|3|
 |19|38|3|
 +--+--+-+
 only showing top 20 rows    add this at the end
 {code}
 3. For array values, instead of printing ArrayBuffer, we should just print 
 square brackets:
 {code}
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 should be
 {code}
 +---+---+---+
 |a_freqItems|b_freqItems|c_freqItems|
 +---+---+---+
 |[11, 1]|[2, 22]| [1, 3]|
 +---+---+---+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567234#comment-14567234
 ] 

Michael Armbrust commented on SPARK-8008:
-

What is the problem with large partitions (as long as you aren't caching them, 
where there is a 2GB limit)?

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Rene Treffer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567240#comment-14567240
 ] 

Rene Treffer commented on SPARK-8008:
-

I've seen very poor performance when streaming it as one partition for example 
(WHERE 1=1). I'll retry with different partition counts.

But I still think there should be a warning about the behavior, as I didn't 
naturally understand that partition count == parallelism in this case (although 
it's logical after some thinking).

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8011) DecimalType is not a datatype

2015-06-01 Thread Bipin Roshan Nag (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bipin Roshan Nag updated SPARK-8011:

Description: 
When I run the following in spark-shell :

 StructType(StructField(ID,IntegerType,true), 
StructField(Value,DecimalType,true))
I get

console:50: error: type mismatch;
 found   : org.apache.spark.sql.types.DecimalType.type
 required: org.apache.spark.sql.types.DataType
   StructType(StructField(ID,IntegerType,true), 
StructField(Value,DecimalType,true))



  was:
When I run the following in spark-shell :

 StructType(StructField(ID,IntegerType,true), 
StructField(Value,DecimalType,true))

I get

console:50: error: type mismatch;
 found   : org.apache.spark.sql.types.DecimalType.type
 required: org.apache.spark.sql.types.DataType
   StructType(StructField(ID,IntegerType,true), 
StructField(Value,DecimalType,true))




 DecimalType is not a datatype
 -

 Key: SPARK-8011
 URL: https://issues.apache.org/jira/browse/SPARK-8011
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 1.3.1
Reporter: Bipin Roshan Nag

 When I run the following in spark-shell :
  StructType(StructField(ID,IntegerType,true), 
 StructField(Value,DecimalType,true))
 I get
 console:50: error: type mismatch;
  found   : org.apache.spark.sql.types.DecimalType.type
  required: org.apache.spark.sql.types.DataType
StructType(StructField(ID,IntegerType,true), 
 StructField(Value,DecimalType,true))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-06-01 Thread Steven W (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567267#comment-14567267
 ] 

Steven W commented on SPARK-5389:
-

I started seeing this when I installed JDK 6 on top of JDK 8. I re-installed 
JDK 8 and it worked after that. So, I think else was unexpected at this time. 
just shows up anytime Java can't run.

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell, Windows
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG, spark_bug.png


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8001) Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout

2015-06-01 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-8001:
---

 Summary: Make AsynchronousListenerBus.waitUntilEmpty throw 
TimeoutException if timeout
 Key: SPARK-8001
 URL: https://issues.apache.org/jira/browse/SPARK-8001
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Priority: Minor


TimeoutException is a more explicit failure. In addition, the caller may forget 
to call {{assert}} to check the return value of 
{{AsynchronousListenerBus.waitUntilEmpty}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8005) Support INPUT__FILE__NAME virtual column

2015-06-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8005:
--

 Summary: Support INPUT__FILE__NAME virtual column
 Key: SPARK-8005
 URL: https://issues.apache.org/jira/browse/SPARK-8005
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


INPUT__FILE__NAME: input file name.

One way to do this is to do it through a thread local variable in the 
SqlNewHadoopRDD.scala, and read that thread local variable in an expression. 
(similar to SparkPartitionID expression)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8006) Support BLOCK__OFFSET__INSIDE__FILE virtual column

2015-06-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8006:
--

 Summary: Support BLOCK__OFFSET__INSIDE__FILE virtual column
 Key: SPARK-8006
 URL: https://issues.apache.org/jira/browse/SPARK-8006
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


See Hive's semantics: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output

2015-06-01 Thread Akhil Thatipamula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567135#comment-14567135
 ] 

Akhil Thatipamula commented on SPARK-7993:
--

[~rxin] Does the 3rd modification effect 'List' as well.
For instance,
++
|modules|
++
|List(mllib, sql, ...|
++
should it be
++
|   modules|
++
| [mllib, sql, ...|
++
?

 Improve DataFrame.show() output
 ---

 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
  Labels: starter

 1. Each column should be at the minimum 3 characters wide. Right now if the 
 widest value is 1, it is just 1 char wide, which looks ugly. Example below:
 2. If a DataFrame have more than N number of rows (N = 20 by default for 
 show), at the end we should display a message like only showing the top 20 
 rows.
 {code}
 +--+--+-+
 | a| b|c|
 +--+--+-+
 | 1| 2|3|
 | 1| 2|1|
 | 1| 2|3|
 | 3| 6|3|
 | 1| 2|3|
 | 5|10|1|
 | 1| 2|3|
 | 7|14|3|
 | 1| 2|3|
 | 9|18|1|
 | 1| 2|3|
 |11|22|3|
 | 1| 2|3|
 |13|26|1|
 | 1| 2|3|
 |15|30|3|
 | 1| 2|3|
 |17|34|1|
 | 1| 2|3|
 |19|38|3|
 +--+--+-+
 only showing top 20 rows    add this at the end
 {code}
 3. For array values, instead of printing ArrayBuffer, we should just print 
 square brackets:
 {code}
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 should be
 {code}
 +---+---+---+
 |a_freqItems|b_freqItems|c_freqItems|
 +---+---+---+
 |[11, 1]|[2, 22]| [1, 3]|
 +---+---+---+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7980) Support SQLContext.range(end)

2015-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7980:
---

Assignee: (was: Apache Spark)

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7246) Rank for DataFrames

2015-06-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567023#comment-14567023
 ] 

Reynold Xin commented on SPARK-7246:


This is done now with window functions right?


 Rank for DataFrames
 ---

 Key: SPARK-7246
 URL: https://issues.apache.org/jira/browse/SPARK-7246
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Xiangrui Meng

 `rank` maps a numeric column to a long column with rankings. `rank` should be 
 an expression. Where it lives is TBD. One suggestion is `funcs.stat`.
 {code}
 df.select(name, rank(time))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8002) Support virtual columns in SQL and DataFrames

2015-06-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8002:
--

 Summary: Support virtual columns in SQL and DataFrames
 Key: SPARK-8002
 URL: https://issues.apache.org/jira/browse/SPARK-8002
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8003) Support SPARK__PARTITION__ID in SQL

2015-06-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8003:
---
Description: 
SPARK__PARTITION__ID column should return the partition index of the Spark 
partition. Note that we already have a DataFrame function for it: 
https://github.com/apache/spark/blob/78a6723e8758b429f877166973cc4f1bbfce73c4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L705



  was:
PARTITION__ID column should return the partition index of the Spark partition. 
Note that we already have a DataFrame function for it: 
https://github.com/apache/spark/blob/78a6723e8758b429f877166973cc4f1bbfce73c4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L705




 Support SPARK__PARTITION__ID in SQL
 ---

 Key: SPARK-8003
 URL: https://issues.apache.org/jira/browse/SPARK-8003
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SPARK__PARTITION__ID column should return the partition index of the Spark 
 partition. Note that we already have a DataFrame function for it: 
 https://github.com/apache/spark/blob/78a6723e8758b429f877166973cc4f1bbfce73c4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L705



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-06-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8007:
---
Description: 
Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
SparkPartitionID expression.

A cool use case is to understand physical data skew:
{code}
df.groupBy(SPARK__PARTITION__ID).count()
{code}

  was:Create the infrastructure so we can resolve df(SPARK_PARTITION__ID) to 
SparkPartitionID expression.


 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Rene Treffer (JIRA)
Rene Treffer created SPARK-8008:
---

 Summary: sqlContext.jdbc can kill your database due to high 
concurrency
 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer


Spark tries to load as many partitions as possible in parallel, which can in 
turn overload the database although it would be possible to load all partitions 
given a lower concurrency.

It would be nice to either limit the maximum concurrency or to at least warn 
about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output

2015-06-01 Thread Akhil Thatipamula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567017#comment-14567017
 ] 

Akhil Thatipamula commented on SPARK-7993:
--

[~rxin] I will work on this.

 Improve DataFrame.show() output
 ---

 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
  Labels: starter

 1. Each column should be at the minimum 3 characters wide. Right now if the 
 widest value is 1, it is just 1 char wide, which looks ugly. Example below:
 2. If a DataFrame have more than N number of rows (N = 20 by default for 
 show), at the end we should display a message like only showing the top 20 
 rows.
 {code}
 +--+--+-+
 | a| b|c|
 +--+--+-+
 | 1| 2|3|
 | 1| 2|1|
 | 1| 2|3|
 | 3| 6|3|
 | 1| 2|3|
 | 5|10|1|
 | 1| 2|3|
 | 7|14|3|
 | 1| 2|3|
 | 9|18|1|
 | 1| 2|3|
 |11|22|3|
 | 1| 2|3|
 |13|26|1|
 | 1| 2|3|
 |15|30|3|
 | 1| 2|3|
 |17|34|1|
 | 1| 2|3|
 |19|38|3|
 +--+--+-+
 only showing top 20 rows    add this at the end
 {code}
 3. For array values, instead of printing ArrayBuffer, we should just print 
 square brackets:
 {code}
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 should be
 {code}
 +---+---+---+
 |a_freqItems|b_freqItems|c_freqItems|
 +---+---+---+
 |[11, 1]|[2, 22]| [1, 3]|
 +---+---+---+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8009) [Mesos] Allow provisioning of executor logging configuration

2015-06-01 Thread Gerard Maas (JIRA)
Gerard Maas created SPARK-8009:
--

 Summary: [Mesos] Allow provisioning of executor logging 
configuration 
 Key: SPARK-8009
 URL: https://issues.apache.org/jira/browse/SPARK-8009
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 1.3.1
 Environment: Mesos executor
Reporter: Gerard Maas


It's currently not possible to provide a custom logging configuration for the 
Mesos executors. 
Upon startup of the executor JVM, it loads a default config file from the Spark 
assembly, visible by this line in stderr: 

 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties

That line comes from Logging.scala [1] where a default config is loaded if none 
is found in the classpath upon the startup of the Spark Mesos executor in the 
Mesos sandbox. At that point in time, none of the application-specific 
resources have been shipped yet, as the executor JVM is just starting up.  

To load a custom configuration file we should have it already on the sandbox 
before the executor JVM starts and add it to the classpath on the startup 
command.

For the classpath customization, It looks like it should be possible to pass a 
-Dlog4j.configuration  property by using the 'spark.executor.extraClassPath' 
that will be picked up at [2] and that should be added to the command that 
starts the executor JVM, but the resource must be already on the host before we 
can do that. Therefore we need some means of 'shipping' the log4j.configuration 
file to the allocated executor.

This all boils down to the need of shipping extra files to the sandbox. 

There's a workaround: open up the Spark assembly, replace the 
log4j-default.properties and pack it up again.  That would work, although kind 
of rudimentary as people may use the same assembly for many jobs.  Probably, 
accessing the log4j API programmatically should also work (we didn't try that 
yet)

[1] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Logging.scala#L128
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L77




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7798) Move AkkaRpcEnv to a separate project

2015-06-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567096#comment-14567096
 ] 

Sean Owen commented on SPARK-7798:
--

What do you mean by separate project? I don't think this warrants its own 
module. Can this please be combined with other move, deprecate and remove 
JIRAs? we don't need three of them.

 Move AkkaRpcEnv to a separate project
 ---

 Key: SPARK-7798
 URL: https://issues.apache.org/jira/browse/SPARK-7798
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8010) Implict promote Numeric type to String type in HiveTypeCoercion

2015-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567105#comment-14567105
 ] 

Apache Spark commented on SPARK-8010:
-

User 'OopsOutOfMemory' has created a pull request for this issue:
https://github.com/apache/spark/pull/6551

 Implict promote Numeric type to String type in HiveTypeCoercion
 ---

 Key: SPARK-8010
 URL: https://issues.apache.org/jira/browse/SPARK-8010
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.3.1

   Original Estimate: 48h
  Remaining Estimate: 48h

 1. Given a query
 `select coalesce(null, 1, '1') from dual` will cause exception:
   
   java.lang.RuntimeException: Could not determine return type of Coalesce for 
 IntegerType,StringType
 2. Given a query:
 `select case when true then 1 else '1' end from dual` will cause exception:
   java.lang.RuntimeException: Types in CASE WHEN must be the same or 
 coercible to a common type: StringType != IntegerType
 I checked the code, the main cause is the HiveTypeCoercion doesn't do 
 implicit convert when there is a IntegerType and StringType.
 Numeric types can be promoted to string type in case throw exceptions.
 Since Hive will always do this. It need to be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8010) Implict promote Numeric type to String type in HiveTypeCoercion

2015-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8010:
---

Assignee: (was: Apache Spark)

 Implict promote Numeric type to String type in HiveTypeCoercion
 ---

 Key: SPARK-8010
 URL: https://issues.apache.org/jira/browse/SPARK-8010
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.3.1

   Original Estimate: 48h
  Remaining Estimate: 48h

 1. Given a query
 `select coalesce(null, 1, '1') from dual` will cause exception:
   
   java.lang.RuntimeException: Could not determine return type of Coalesce for 
 IntegerType,StringType
 2. Given a query:
 `select case when true then 1 else '1' end from dual` will cause exception:
   java.lang.RuntimeException: Types in CASE WHEN must be the same or 
 coercible to a common type: StringType != IntegerType
 I checked the code, the main cause is the HiveTypeCoercion doesn't do 
 implicit convert when there is a IntegerType and StringType.
 Numeric types can be promoted to string type in case throw exceptions.
 Since Hive will always do this. It need to be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7993) Improve DataFrame.show() output

2015-06-01 Thread Akhil Thatipamula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567135#comment-14567135
 ] 

Akhil Thatipamula edited comment on SPARK-7993 at 6/1/15 10:25 AM:
---

[~rxin] Does the 3rd modification effect 'List' as well.
For instance,
++
|modules|
++
|List(mllib, sql, ...|
++
should it be?
++
|   modules|
++
| [mllib, sql, ...|
++



was (Author: 6133d):
[~rxin] Does the 3rd modification effect 'List' as well.
For instance,
++
|modules|
++
|List(mllib, sql, ...|
++
should it be
++
|   modules|
++
| [mllib, sql, ...|
++
?

 Improve DataFrame.show() output
 ---

 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
  Labels: starter

 1. Each column should be at the minimum 3 characters wide. Right now if the 
 widest value is 1, it is just 1 char wide, which looks ugly. Example below:
 2. If a DataFrame have more than N number of rows (N = 20 by default for 
 show), at the end we should display a message like only showing the top 20 
 rows.
 {code}
 +--+--+-+
 | a| b|c|
 +--+--+-+
 | 1| 2|3|
 | 1| 2|1|
 | 1| 2|3|
 | 3| 6|3|
 | 1| 2|3|
 | 5|10|1|
 | 1| 2|3|
 | 7|14|3|
 | 1| 2|3|
 | 9|18|1|
 | 1| 2|3|
 |11|22|3|
 | 1| 2|3|
 |13|26|1|
 | 1| 2|3|
 |15|30|3|
 | 1| 2|3|
 |17|34|1|
 | 1| 2|3|
 |19|38|3|
 +--+--+-+
 only showing top 20 rows    add this at the end
 {code}
 3. For array values, instead of printing ArrayBuffer, we should just print 
 square brackets:
 {code}
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 should be
 {code}
 +---+---+---+
 |a_freqItems|b_freqItems|c_freqItems|
 +---+---+---+
 |[11, 1]|[2, 22]| [1, 3]|
 +---+---+---+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7993) Improve DataFrame.show() output

2015-06-01 Thread Akhil Thatipamula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567135#comment-14567135
 ] 

Akhil Thatipamula edited comment on SPARK-7993 at 6/1/15 10:26 AM:
---

[~rxin] Does the 3rd modification effect 'List' as well.
For instance,
|List(mllib, sql, ...|
should it be?
| [mllib, sql, ...|



was (Author: 6133d):
[~rxin] Does the 3rd modification effect 'List' as well.
For instance,
++
|modules|
++
|List(mllib, sql, ...|
++
should it be?
++
|   modules|
++
| [mllib, sql, ...|
++


 Improve DataFrame.show() output
 ---

 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
  Labels: starter

 1. Each column should be at the minimum 3 characters wide. Right now if the 
 widest value is 1, it is just 1 char wide, which looks ugly. Example below:
 2. If a DataFrame have more than N number of rows (N = 20 by default for 
 show), at the end we should display a message like only showing the top 20 
 rows.
 {code}
 +--+--+-+
 | a| b|c|
 +--+--+-+
 | 1| 2|3|
 | 1| 2|1|
 | 1| 2|3|
 | 3| 6|3|
 | 1| 2|3|
 | 5|10|1|
 | 1| 2|3|
 | 7|14|3|
 | 1| 2|3|
 | 9|18|1|
 | 1| 2|3|
 |11|22|3|
 | 1| 2|3|
 |13|26|1|
 | 1| 2|3|
 |15|30|3|
 | 1| 2|3|
 |17|34|1|
 | 1| 2|3|
 |19|38|3|
 +--+--+-+
 only showing top 20 rows    add this at the end
 {code}
 3. For array values, instead of printing ArrayBuffer, we should just print 
 square brackets:
 {code}
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 should be
 {code}
 +---+---+---+
 |a_freqItems|b_freqItems|c_freqItems|
 +---+---+---+
 |[11, 1]|[2, 22]| [1, 3]|
 +---+---+---+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5302) Add support for SQLContext partition columns

2015-06-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567061#comment-14567061
 ] 

Reynold Xin commented on SPARK-5302:


[~btiernay] is this resolved now SPARK-5182 is resolved?

 Add support for SQLContext partition columns
 --

 Key: SPARK-5302
 URL: https://issues.apache.org/jira/browse/SPARK-5302
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Bob Tiernay

 For {{SQLContext}} (not {{HiveContext}}) it would be very convenient to 
 support a virtual column that maps to part of the the file path, similar to 
 what is done in Hive for partitions (e.g. {{/data/clicks/dt=2015-01-01/}} 
 where {{dt}} is a column of type {{TEXT}}). 
 The API could allow the user to type the column using an appropriate 
 {{DataType}} instance. This new field could be addressed in SQL statements 
 much the same as is done in Hive. 
 As a consequence, pruning of partitions could be possible when executing a 
 query and also remove the need to materialize a column in each logical 
 partition that is already encoded in the path name. Furthermore, this would 
 provide an nice interop and migration strategy for Hive users who may one day 
 use {{SQLContext}} directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8004) Spark does not enclose column names when fetchting from jdbc sources

2015-06-01 Thread Rene Treffer (JIRA)
Rene Treffer created SPARK-8004:
---

 Summary: Spark does not enclose column names when fetchting from 
jdbc sources
 Key: SPARK-8004
 URL: https://issues.apache.org/jira/browse/SPARK-8004
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer


Spark failes to load tables that have a keyword as column names

Sample error:
{code}

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 
(TID 4322, localhost): 
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in 
your SQL syntax; check the manual that corresponds to your MySQL server version 
for the right syntax to use near 'key,value FROM [XX]'
{code}

A correct query would have been
{code}
SELECT `key`.`value` FROM 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8003) Support PARTITION__ID in SQL

2015-06-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8003:
--

 Summary: Support PARTITION__ID in SQL
 Key: SPARK-8003
 URL: https://issues.apache.org/jira/browse/SPARK-8003
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


PARTITION__ID column should return the partition index of the Spark partition. 
Note that we already have a DataFrame function for it: 
https://github.com/apache/spark/blob/78a6723e8758b429f877166973cc4f1bbfce73c4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L705





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8010) Implict promote Numeric type to String type in HiveTypeCoercion

2015-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8010:
---

Assignee: Apache Spark

 Implict promote Numeric type to String type in HiveTypeCoercion
 ---

 Key: SPARK-8010
 URL: https://issues.apache.org/jira/browse/SPARK-8010
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
Assignee: Apache Spark
 Fix For: 1.3.1

   Original Estimate: 48h
  Remaining Estimate: 48h

 1. Given a query
 `select coalesce(null, 1, '1') from dual` will cause exception:
   
   java.lang.RuntimeException: Could not determine return type of Coalesce for 
 IntegerType,StringType
 2. Given a query:
 `select case when true then 1 else '1' end from dual` will cause exception:
   java.lang.RuntimeException: Types in CASE WHEN must be the same or 
 coercible to a common type: StringType != IntegerType
 I checked the code, the main cause is the HiveTypeCoercion doesn't do 
 implicit convert when there is a IntegerType and StringType.
 Numeric types can be promoted to string type in case throw exceptions.
 Since Hive will always do this. It need to be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7980) Support SQLContext.range(end)

2015-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567142#comment-14567142
 ] 

Apache Spark commented on SPARK-7980:
-

User 'animeshbaranawal' has created a pull request for this issue:
https://github.com/apache/spark/pull/6552

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7980) Support SQLContext.range(end)

2015-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7980:
---

Assignee: Apache Spark

 Support SQLContext.range(end)
 -

 Key: SPARK-7980
 URL: https://issues.apache.org/jira/browse/SPARK-7980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 SQLContext.range should also allow only specifying the end position, similar 
 to Python's own range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output

2015-06-01 Thread Akhil Thatipamula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567040#comment-14567040
 ] 

Akhil Thatipamula commented on SPARK-7993:
--

Thanks for mentioning, I will take of care of that.

 Improve DataFrame.show() output
 ---

 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
  Labels: starter

 1. Each column should be at the minimum 3 characters wide. Right now if the 
 widest value is 1, it is just 1 char wide, which looks ugly. Example below:
 2. If a DataFrame have more than N number of rows (N = 20 by default for 
 show), at the end we should display a message like only showing the top 20 
 rows.
 {code}
 +--+--+-+
 | a| b|c|
 +--+--+-+
 | 1| 2|3|
 | 1| 2|1|
 | 1| 2|3|
 | 3| 6|3|
 | 1| 2|3|
 | 5|10|1|
 | 1| 2|3|
 | 7|14|3|
 | 1| 2|3|
 | 9|18|1|
 | 1| 2|3|
 |11|22|3|
 | 1| 2|3|
 |13|26|1|
 | 1| 2|3|
 |15|30|3|
 | 1| 2|3|
 |17|34|1|
 | 1| 2|3|
 |19|38|3|
 +--+--+-+
 only showing top 20 rows    add this at the end
 {code}
 3. For array values, instead of printing ArrayBuffer, we should just print 
 square brackets:
 {code}
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 should be
 {code}
 +---+---+---+
 |a_freqItems|b_freqItems|c_freqItems|
 +---+---+---+
 |[11, 1]|[2, 22]| [1, 3]|
 +---+---+---+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-06-01 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8007:
--

 Summary: Support resolving virtual columns in DataFrames
 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


Create the infrastructure so we can resolve df(SPARK_PARTITION__ID) to 
SparkPartitionID expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7798) Move AkkaRpcEnv to a separate project

2015-06-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7798.
--
  Resolution: Duplicate
Target Version/s:   (was: 1.6.0)

You've got some duplication here. I think this is a lot of noise for this one 
task, lots of JIRAs? can this not just be a couple steps?

 Move AkkaRpcEnv to a separate project
 ---

 Key: SPARK-7798
 URL: https://issues.apache.org/jira/browse/SPARK-7798
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7798) Move AkkaRpcEnv to a separate project

2015-06-01 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567103#comment-14567103
 ] 

Shixiong Zhu commented on SPARK-7798:
-

I want to propose one move and deprecate JIRA for 1.5, and one remove 
JIRA for 1.6. Thank you for pointing out this duplicate JIRA.

 Move AkkaRpcEnv to a separate project
 ---

 Key: SPARK-7798
 URL: https://issues.apache.org/jira/browse/SPARK-7798
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8010) Implict promote Numeric type to String type in HiveTypeCoercion

2015-06-01 Thread Li Sheng (JIRA)
Li Sheng created SPARK-8010:
---

 Summary: Implict promote Numeric type to String type in 
HiveTypeCoercion
 Key: SPARK-8010
 URL: https://issues.apache.org/jira/browse/SPARK-8010
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Li Sheng
 Fix For: 1.3.1


1. Given a query
`select coalesce(null, 1, '1') from dual` will cause exception:
  
  java.lang.RuntimeException: Could not determine return type of Coalesce for 
IntegerType,StringType

2. Given a query:
`select case when true then 1 else '1' end from dual` will cause exception:

  java.lang.RuntimeException: Types in CASE WHEN must be the same or coercible 
to a common type: StringType != IntegerType

I checked the code, the main cause is the HiveTypeCoercion doesn't do implicit 
convert when there is a IntegerType and StringType.

Numeric types can be promoted to string type in case throw exceptions.

Since Hive will always do this. It need to be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8001) Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout

2015-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8001:
---

Assignee: (was: Apache Spark)

 Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout
 -

 Key: SPARK-8001
 URL: https://issues.apache.org/jira/browse/SPARK-8001
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Priority: Minor

 TimeoutException is a more explicit failure. In addition, the caller may 
 forget to call {{assert}} to check the return value of 
 {{AsynchronousListenerBus.waitUntilEmpty}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8003) Support SPARK__PARTITION__ID in SQL

2015-06-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8003:
---
Summary: Support SPARK__PARTITION__ID in SQL  (was: Support PARTITION__ID 
in SQL)

 Support SPARK__PARTITION__ID in SQL
 ---

 Key: SPARK-8003
 URL: https://issues.apache.org/jira/browse/SPARK-8003
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 PARTITION__ID column should return the partition index of the Spark 
 partition. Note that we already have a DataFrame function for it: 
 https://github.com/apache/spark/blob/78a6723e8758b429f877166973cc4f1bbfce73c4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L705



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7893) Complex Operators between Graphs

2015-06-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7893.
--
Resolution: Duplicate

I'd prefer to close this kind of overview JIRA, as it doesn't seem to contain 
enough to tie together sub-JIRAs. They're all graph operations, yes, but aren't 
part of a larger piece of work.

 Complex Operators between Graphs
 

 Key: SPARK-7893
 URL: https://issues.apache.org/jira/browse/SPARK-7893
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Andy Huang
  Labels: complex, graph, join, operators, union

 Currently there are 30+ operators in GraphX, while few of them consider 
 operators between graphs. The only one is _*mask*_, which takes another graph 
 as a parameter and return a new graph.
 In many complex case,such as _*streaming graph, small graph merge into huge 
 graph*_, higher level operators of graphs can help users to focus and think 
 in graph. Performance optimization can be done internally and be transparent 
 to them.
 Complex graph operator list is 
 here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]. 
 This issue will focus on two frequently-used operators first: *union* and 
 *join*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7993) Improve DataFrame.show() output

2015-06-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567036#comment-14567036
 ] 

Reynold Xin commented on SPARK-7993:


Please cc me on your pull request (my github id is @rxin)

 Improve DataFrame.show() output
 ---

 Key: SPARK-7993
 URL: https://issues.apache.org/jira/browse/SPARK-7993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
  Labels: starter

 1. Each column should be at the minimum 3 characters wide. Right now if the 
 widest value is 1, it is just 1 char wide, which looks ugly. Example below:
 2. If a DataFrame have more than N number of rows (N = 20 by default for 
 show), at the end we should display a message like only showing the top 20 
 rows.
 {code}
 +--+--+-+
 | a| b|c|
 +--+--+-+
 | 1| 2|3|
 | 1| 2|1|
 | 1| 2|3|
 | 3| 6|3|
 | 1| 2|3|
 | 5|10|1|
 | 1| 2|3|
 | 7|14|3|
 | 1| 2|3|
 | 9|18|1|
 | 1| 2|3|
 |11|22|3|
 | 1| 2|3|
 |13|26|1|
 | 1| 2|3|
 |15|30|3|
 | 1| 2|3|
 |17|34|1|
 | 1| 2|3|
 |19|38|3|
 +--+--+-+
 only showing top 20 rows    add this at the end
 {code}
 3. For array values, instead of printing ArrayBuffer, we should just print 
 square brackets:
 {code}
 +--+--+-+
 |   a_freqItems|   b_freqItems|  c_freqItems|
 +--+--+-+
 |ArrayBuffer(11, 1)|ArrayBuffer(2, 22)|ArrayBuffer(1, 3)|
 +--+--+-+
 {code}
 should be
 {code}
 +---+---+---+
 |a_freqItems|b_freqItems|c_freqItems|
 +---+---+---+
 |[11, 1]|[2, 22]| [1, 3]|
 +---+---+---+
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7999) Graph complement

2015-06-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567098#comment-14567098
 ] 

Sean Owen commented on SPARK-7999:
--

So I'm not sure it's clear the parent issue is even something that would be 
accepted, as it's a big umbrella JIRA. 
I would start by reviewing 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark (the 
MLlib part applies) here to argue whether this should be included in GraphX on 
the mailing list, rather than start with a JIRA or PR.

 Graph complement
 

 Key: SPARK-7999
 URL: https://issues.apache.org/jira/browse/SPARK-7999
 Project: Spark
  Issue Type: Sub-task
Reporter: Tarek Auel
Priority: Minor

 This task is for implementing the complement operation (compare to parent 
 task).
 http://techieme.in/complex-graph-operations/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8012) ArrayIndexOutOfBoundsException in SerializationDebugger

2015-06-01 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-8012:


 Summary: ArrayIndexOutOfBoundsException in SerializationDebugger
 Key: SPARK-8012
 URL: https://issues.apache.org/jira/browse/SPARK-8012
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Jianshi Huang


It makes NonSerializable exception less obvious.

{noformat}
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:248)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:166)
at 
org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107)
at 
org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:66)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:683)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:682)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:682)
at 
org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:40)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
at 
org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:159)
at 
org.apache.spark.sql.sources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:131)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.sources.DataSourceStrategy$.buildPartitionedTableScan(DataSourceStrategy.scala:131)
at 
org.apache.spark.sql.sources.DataSourceStrategy$.apply(DataSourceStrategy.scala:80)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$HashJoin$.apply(SparkStrategies.scala:109)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 

[jira] [Commented] (SPARK-2315) drop, dropRight and dropWhile which take RDD input and return RDD

2015-06-01 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567478#comment-14567478
 ] 

Erik Erlandson commented on SPARK-2315:
---

The 'drop' RDD methods have been made available on the 'silex' project 
(beginning with release 0.0.6):
https://github.com/willb/silex

Documentation:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.drop.DropRDDFunctions


 drop, dropRight and dropWhile which take RDD input and return RDD
 -

 Key: SPARK-2315
 URL: https://issues.apache.org/jira/browse/SPARK-2315
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Erik Erlandson
Assignee: Erik Erlandson
  Labels: features

 Last time I loaded in a text file, I found myself wanting to just skip the 
 first element as it was a header. I wrote candidate methods drop, 
 dropRight and dropWhile to satisfy this kind of need:
 val txt = sc.textFile(text_with_header.txt)
 val data = txt.drop(1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder

2015-06-01 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-8014:


 Summary: DataFrame.write.mode(error).save(...) should not scan 
the output folder
 Key: SPARK-8014
 URL: https://issues.apache.org/jira/browse/SPARK-8014
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang
Priority: Minor


I have code that set the wrong output location, but failed with strange errors, 
it scaned my ~/.Trash folder...

It turned out save will scan the output folder first before mode(error) does 
the check. 

Scanning is unnecessary for mode = error.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8013) Get JDBC server working with Scala 2.11

2015-06-01 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567561#comment-14567561
 ] 

Iulian Dragos commented on SPARK-8013:
--

There's a Scala 2.11.7 milestone due in July, hopefully we can get a solution 
in by then.

 Get JDBC server working with Scala 2.11
 ---

 Key: SPARK-8013
 URL: https://issues.apache.org/jira/browse/SPARK-8013
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Patrick Wendell
Assignee: Iulian Dragos
Priority: Critical

 It's worth some investigation here, but I believe the simplest solution is to 
 see if we can get Scala to shade it's use of JLine to avoid JLine conflicts 
 between Hive and the Spark repl.
 It's also possible that there is a simpler internal solution to the conflict 
 (I haven't looked at it in a long time). So doing some investigation of that 
 would be good. IIRC, there is use of Jline in our own repl code, in addition 
 to in Hive and also in the Scala 2.11 repl. Back when we created the 2.11 
 build I couldn't harmonize all the versions in a nice way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8004) Spark does not enclose column names when fetchting from jdbc sources

2015-06-01 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567467#comment-14567467
 ] 

Liang-Chi Hsieh commented on SPARK-8004:


I think backticks are only working for MySQL?

 Spark does not enclose column names when fetchting from jdbc sources
 

 Key: SPARK-8004
 URL: https://issues.apache.org/jira/browse/SPARK-8004
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark failes to load tables that have a keyword as column names
 Sample error:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage 157.0 
 (TID 4322, localhost): 
 com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
 in your SQL syntax; check the manual that corresponds to your MySQL server 
 version for the right syntax to use near 'key,value FROM [XX]'
 {code}
 A correct query would have been
 {code}
 SELECT `key`.`value` FROM 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-06-01 Thread Mark Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567468#comment-14567468
 ] 

Mark Smiley commented on SPARK-5389:


I have tried several settings for JAVA_HOME (C:\jdk1.8.0\bin, C:\jdk1.8.0\bin\, 
C:\jdk1.8.0, C:\jdk1.8.0\, even C:\jdk1.8.0\jre). None fixed the issue. I use 
Java a lot, and other apps (e.g., NetBeans) seem to have no issue with the 
JAVA_HOME setting. Note there are no spaces in the JAVA_HOME path. There is a 
space in the path to Scala, but that's the default installation path for Scala.
Also verified the same issue on Windows 8.1.

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell, Windows
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG, spark_bug.png


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-06-01 Thread Mark Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567468#comment-14567468
 ] 

Mark Smiley edited comment on SPARK-5389 at 6/1/15 3:54 PM:


I have tried several settings for JAVA_HOME (C:\jdk1.8.0\bin, C:\jdk1.8.0\bin\, 
C:\jdk1.8.0, C:\jdk1.8.0\, even C:\jdk1.8.0\jre). None fixed the issue. I use 
Java a lot, and other apps (e.g., NetBeans) seem to have no issue with the 
JAVA_HOME setting. Note there are no spaces in the JAVA_HOME path. There is a 
space in the path to Scala, but that's the default installation path for Scala.
There is no Java 6 on either of these systems.
Also verified the same issue on Windows 8.1.


was (Author: drfractal):
I have tried several settings for JAVA_HOME (C:\jdk1.8.0\bin, C:\jdk1.8.0\bin\, 
C:\jdk1.8.0, C:\jdk1.8.0\, even C:\jdk1.8.0\jre). None fixed the issue. I use 
Java a lot, and other apps (e.g., NetBeans) seem to have no issue with the 
JAVA_HOME setting. Note there are no spaces in the JAVA_HOME path. There is a 
space in the path to Scala, but that's the default installation path for Scala.
Also verified the same issue on Windows 8.1.

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell, Windows
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG, spark_bug.png


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8013) Get JDBC server working with Scala 2.11

2015-06-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8013:
---
Target Version/s: 1.5.0

 Get JDBC server working with Scala 2.11
 ---

 Key: SPARK-8013
 URL: https://issues.apache.org/jira/browse/SPARK-8013
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Patrick Wendell
Assignee: Iulian Dragos
Priority: Critical

 It's worth some investigation here, but I believe the simplest solution is to 
 see if we can get Scala to shade it's use of JLine to avoid JLine conflicts 
 between Hive and the Spark repl.
 It's also possible that there is a simpler internal solution to the conflict 
 (I haven't looked at it in a long time). So doing some investigation of that 
 would be good. IIRC, there is use of Jline in our own repl code, in addition 
 to in Hive and also in the Scala 2.11 repl. Back when we created the 2.11 
 build I couldn't harmonize all the versions in a nice way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8013) Get JDBC server working with Scala 2.11

2015-06-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8013:
---
Priority: Critical  (was: Major)

 Get JDBC server working with Scala 2.11
 ---

 Key: SPARK-8013
 URL: https://issues.apache.org/jira/browse/SPARK-8013
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Patrick Wendell
Assignee: Iulian Dragos
Priority: Critical

 It's worth some investigation here, but I believe the simplest solution is to 
 see if we can get Scala to shade it's use of JLine to avoid JLine conflicts 
 between Hive and the Spark repl.
 It's also possible that there is a simpler internal solution to the conflict 
 (I haven't looked at it in a long time). So doing some investigation of that 
 would be good. IIRC, there is use of Jline in our own repl code, in addition 
 to in Hive and also in the Scala 2.11 repl. Back when we created the 2.11 
 build I couldn't harmonize all the versions in a nice way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8013) Get JDBC server working with Scala 2.11

2015-06-01 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-8013:
--

 Summary: Get JDBC server working with Scala 2.11
 Key: SPARK-8013
 URL: https://issues.apache.org/jira/browse/SPARK-8013
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Patrick Wendell
Assignee: Iulian Dragos


It's worth some investigation here, but I believe the simplest solution is to 
see if we can get Scala to shade it's use of JLine to avoid JLine conflicts 
between Hive and the Spark repl.

It's also possible that there is a simpler internal solution to the conflict (I 
haven't looked at it in a long time). So doing some investigation of that would 
be good. IIRC, there is use of Jline in our own repl code, in addition to in 
Hive and also in the Scala 2.11 repl. Back when we created the 2.11 build I 
couldn't harmonize all the versions in a nice way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7857) IDF w/ minDocFreq on SparseVectors results in literal zeros

2015-06-01 Thread Karl Higley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567520#comment-14567520
 ] 

Karl Higley commented on SPARK-7857:


This is addressed by the addition of numNonZeros in SPARK-6756.

 IDF w/ minDocFreq on SparseVectors results in literal zeros
 ---

 Key: SPARK-7857
 URL: https://issues.apache.org/jira/browse/SPARK-7857
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Karl Higley
Priority: Minor

 When the IDF model's minDocFreq parameter is set to a non-zero threshold, the 
 IDF for any feature below that threshold is set to zero.  When the model is 
 used to transform a set of SparseVectors containing that feature, the 
 resulting SparseVectors contain entries whose values are zero.  The zero 
 entries should be omitted in order to simplify downstream processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7987) TransportContext.createServer(int port) is missing in Spark 1.4

2015-06-01 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567556#comment-14567556
 ] 

Marcelo Vanzin commented on SPARK-7987:
---

[~joshrosen] that annotation is nice but it cannot live in {{core/}} if this 
module is to use it.

Actually, it would be really nice to have a new top-level module for these 
annotations and other very generic helper code (such as JavaUtils.java, which 
is used in more than the network module).

 TransportContext.createServer(int port) is missing in Spark 1.4
 ---

 Key: SPARK-7987
 URL: https://issues.apache.org/jira/browse/SPARK-7987
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.4.0
Reporter: Patrick Wendell
Priority: Blocker

 From what I can tell the SPARK-6229 patch removed this API:
 https://github.com/apache/spark/commit/38d4e9e446b425ca6a8fe8d8080f387b08683842#diff-d9d4b8d8e82b7d96d5e779353e4b2f4eL85
 I think adding it back should be easy enough, but I cannot figure out why 
 this didn't trigger MIMA errors. I am wondering if MIMA was not enabled 
 properly for some of the new modules:
 /cc [~vanzin] [~rxin] and [~adav]
 I put this as a blocker level issue because I'm wondering if we just aren't 
 enforcing checks for some reason in some of our API's. So I think we need to 
 block the 1.4 release on at least making sure no other serious API's were 
 broken. If it turns out only this API was affected, or I'm just missing 
 something, we can downgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567345#comment-14567345
 ] 

Steve Loughran commented on SPARK-4352:
---

As usual, when YARN-1042 is done, life gets easier: the AM asks YARN for the 
anti-affine placement.

If you look at how other YARN clients have implemented anti-affinity 
(TWILL-82), the blacklist is used to block off all nodes in use, with a 
request-at-a-time ramp-up to avoid 1 outstanding request being granted on the 
same node. 

As well as anti-affinity, life would be even better with dynamic container 
resize: if a single executor could expand/relax CPU capacity on demand, you'd 
only need one per node and then handle multiple tasks by running more work 
there. (This does nothing for RAM consumption though)

now, for some other fun, 

# you may want to consider which surplus containers to release, both 
outstanding requests and actually granted. In particular, if you want to cancel 
1 outstanding request, which to choose? Any of them? The newest? The oldest? 
The node with the worst reliability statistics? Killing the newest works if you 
assume that the older containers have generated more host-local data that you 
wish to reuse.

# history may also be a factor in placement. If you are starting a session 
which continues/extends previous work, the previous location of the executors 
may be the first locality clue. Ask for containers on those nodes and there's a 
high likelihood that all the output data from the previous session will be 
stored locally on one of the nodes a container is assigned.

# Testing. There aren't any, are there? It's possible to simulate some of the 
basic operations, you just need to isolate the code which examines the 
application state and generates container request/release events from the 
actual interaction with the RM. 

I've done this before with the request to allocate/cancel [generating a list of 
operations to be submitted or 
simulated|https://github.com/apache/incubator-slider/blob/develop/slider-core/src/main/java/org/apache/slider/server/appmaster/state/AppState.java#L1908].
 When combined with a [mock YARN 
engine|https://github.com/apache/incubator-slider/tree/develop/slider-core/src/test/groovy/org/apache/slider/server/appmaster/model/mock],
 let us do things like [test historical placement 
logic|https://github.com/apache/incubator-slider/tree/develop/slider-core/src/test/groovy/org/apache/slider/server/appmaster/model/history]
 as well as whether to re-request containers on nodes where containers have 
just recently failed. While that mock stuff isn't that realistic, it can be 
used to test basic placement and failure handling logic.

More succinctly: you can write tests for this stuff by splitting request 
generation from the API calls  testing the request/release logic standalone

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-06-01 Thread Deepak Kumar V (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567389#comment-14567389
 ] 

Deepak Kumar V commented on SPARK-4105:
---

I see this issue when reading sequence file stored in Sequence File format 
(SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
)

All i do is 
sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new 
org.apache.spark.HashPartitioner(2053))
.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
  .set(spark.kryoserializer.buffer.mb, arguments.get(buffersize).get)
  .set(spark.kryoserializer.buffer.max.mb, 
arguments.get(maxbuffersize).get)
  .set(spark.driver.maxResultSize, arguments.get(maxResultSize).get)
  .set(spark.yarn.maxAppAttempts, 0)
  //.set(spark.akka.askTimeout, arguments.get(askTimeout).get)
  //.set(spark.akka.timeout, arguments.get(akkaTimeout).get)
  //.set(spark.worker.timeout, arguments.get(workerTimeout).get)
  
.registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))


and values are 
buffersize=128 maxbuffersize=1068 maxResultSize=200G 

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Attachments: JavaObjectToSerialize.java, 
 SparkFailedToUncompressGenerator.scala


 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at 

[jira] [Created] (SPARK-8015) flume-sink should not depend on Guava.

2015-06-01 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-8015:
-

 Summary: flume-sink should not depend on Guava.
 Key: SPARK-8015
 URL: https://issues.apache.org/jira/browse/SPARK-8015
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Priority: Minor


The flume-sink module, due to the shared shading code in our build, ends up 
depending on the {{org.spark-project}} Guava classes. That means users who 
deploy the sink in Flume will also need to provide those classes somehow, 
generally by also adding the Spark assembly, which means adding a whole bunch 
of other libraries to Flume, which may or may not cause other unforeseen 
problems.

It's better to not have that dependency in the flume-sink module instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder

2015-06-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8014:
--
Description: 
When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do 
metadata discovery if the destination folder exists.

To reproduce this issue, we may make an empty directory {{/tmp/foo}} and leave 
an empty file {{bar}} there, then execute the following code in Spark shell:
{code}
import sqlContext._
import sqlContext.implicits._

Seq(1 - a).toDF(i, 
s).write.format(parquet).mode(error).save(file:///tmp/foo)
{code}
From the exception stack trace we can see that metadata discovery code path is 
executed:
{noformat}
java.io.IOException: Could not read footer: java.lang.RuntimeException: 
file:/tmp/foo/bar is not a Parquet file (too small)
at 
parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
at 
org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
...
Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file 
(too small)
at 
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408)
at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228)
at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

  was:
When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do 
metadata discovery if the destination folder exists.

To reproduce this issue, we may make an empty directory {{/tmp/foo}} and leave 
an empty file {{bar}} there, then 


 DataFrame.write.mode(error).save(...) should not scan the output folder
 -

 Key: SPARK-8014
 URL: https://issues.apache.org/jira/browse/SPARK-8014
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang
Priority: Minor

 When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do 
 metadata discovery if the destination folder exists.
 To reproduce this issue, we may make an empty directory {{/tmp/foo}} and 
 leave an empty file {{bar}} there, then execute the following code in Spark 
 shell:
 {code}
 import sqlContext._
 import sqlContext.implicits._
 Seq(1 - a).toDF(i, 
 s).write.format(parquet).mode(error).save(file:///tmp/foo)
 {code}
 From the exception stack trace we can see that metadata discovery code path 
 is executed:
 {noformat}
 java.io.IOException: Could not read footer: java.lang.RuntimeException: 
 file:/tmp/foo/bar is not a Parquet file (too small)
 at 
 parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501)
 at 
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331)
 at 
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
 at 
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
 ...
 Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is 

[jira] [Updated] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder

2015-06-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8014:
--
Priority: Major  (was: Minor)

 DataFrame.write.mode(error).save(...) should not scan the output folder
 -

 Key: SPARK-8014
 URL: https://issues.apache.org/jira/browse/SPARK-8014
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang

 When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do 
 metadata discovery if the destination folder exists.
 To reproduce this issue, we may make an empty directory {{/tmp/foo}} and 
 leave an empty file {{bar}} there, then execute the following code in Spark 
 shell:
 {code}
 import sqlContext._
 import sqlContext.implicits._
 Seq(1 - a).toDF(i, 
 s).write.format(parquet).mode(error).save(file:///tmp/foo)
 {code}
 From the exception stack trace we can see that metadata discovery code path 
 is executed:
 {noformat}
 java.io.IOException: Could not read footer: java.lang.RuntimeException: 
 file:/tmp/foo/bar is not a Parquet file (too small)
 at 
 parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501)
 at 
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331)
 at 
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
 at 
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
 ...
 Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet 
 file (too small)
 at 
 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408)
 at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228)
 at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-06-01 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567601#comment-14567601
 ] 

Yana Kadiyska commented on SPARK-5389:
--

FWIW I just tried the 1.4-rc3 build 
(http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/) cdh4 
binary and it runs without issues. From the exact same command prompt I can run 
the 1.4 script but not the 1.2 script. So if we can't figure out a consistent 
repro, maybe other folks can confirm if the new cmd files work... 

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell, Windows
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG, spark_bug.png


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8014) DataFrame.write.mode(error).save(...) should not scan the output folder

2015-06-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-8014:
-

Assignee: Cheng Lian

 DataFrame.write.mode(error).save(...) should not scan the output folder
 -

 Key: SPARK-8014
 URL: https://issues.apache.org/jira/browse/SPARK-8014
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jianshi Huang
Assignee: Cheng Lian

 When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do 
 metadata discovery if the destination folder exists.
 To reproduce this issue, we may make an empty directory {{/tmp/foo}} and 
 leave an empty file {{bar}} there, then execute the following code in Spark 
 shell:
 {code}
 import sqlContext._
 import sqlContext.implicits._
 Seq(1 - a).toDF(i, 
 s).write.format(parquet).mode(error).save(file:///tmp/foo)
 {code}
 From the exception stack trace we can see that metadata discovery code path 
 is executed:
 {noformat}
 java.io.IOException: Could not read footer: java.lang.RuntimeException: 
 file:/tmp/foo/bar is not a Parquet file (too small)
 at 
 parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502)
 at 
 org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501)
 at 
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331)
 at 
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
 at 
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
 ...
 Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet 
 file (too small)
 at 
 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408)
 at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228)
 at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8016) YARN cluster / client modes have different app names for python

2015-06-01 Thread Andrew Or (JIRA)
Andrew Or created SPARK-8016:


 Summary: YARN cluster / client modes have different app names for 
python
 Key: SPARK-8016
 URL: https://issues.apache.org/jira/browse/SPARK-8016
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
Reporter: Andrew Or


See screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7909) spark-ec2 and associated tools not py3 ready

2015-06-01 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567735#comment-14567735
 ] 

Shivaram Venkataraman commented on SPARK-7909:
--

Yeah feel free to open a PR for the `print` fixes.

 spark-ec2 and associated tools not py3 ready
 

 Key: SPARK-7909
 URL: https://issues.apache.org/jira/browse/SPARK-7909
 Project: Spark
  Issue Type: Improvement
  Components: EC2
 Environment: ec2 python3
Reporter: Matthew Goodman

 At present there is not a possible permutation of tools that supports Python3 
 on both the launching computer and running cluster.  There are a couple 
 problems involved:
  - There is no prebuilt spark binary with python3 support.
  - spark-ec2/spark/init.sh contains inline py3 unfriendly print statements
  - Config files for cluster processes don't seem to make it to all nodes in a 
 working format.
 I have fixes for some of this, but the config and running context debugging 
 remains elusive to me.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-01 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567755#comment-14567755
 ] 

Sandy Ryza commented on SPARK-4352:
---

[~jerryshao] I wouldn't say that the goal is necessarily to get as close as 
possible to the ratio of requests (3 : 3 : 2 : 1 in the example).  My idea was 
to get as close as possible to sum(cores from all executor requests with that 
node on their preferred list) = number tasks that prefer that node.

Why?  Let's look at the situation where we're requesting 18 executors.  Let's 
say we request 6 executors with a preference for a, b, c, d like you 
suggested. YARN would be perfectly happy giving us 6 executors on node d.  But 
we only have 10 tasks (with executors that have 2 cores, this means 5 
executors) that need to run on node d.  So we'd really prefer that the 6th 
executor be scheduled on a, b, or c, because placing it on d confers no 
additional advantage.

For the situation where we're requesting 7 executors I have less of an argument 
for why my 5 : 2 is better than your 2 : 2 : 3.  Thinking about it more now, it 
seems like your approach could be closer to optimal because getting executors 
on a or b means more of our tasks get to run on local data.  So I would 
certainly be open to something that tries to preserve the ratio when the number 
of executors we're allowed to request is under the maximum number of tasks 
targeted for any particular node.

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) sqlContext.jdbc can kill your database due to high concurrency

2015-06-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567757#comment-14567757
 ] 

Sean Owen commented on SPARK-8008:
--

I suppose I meant you can block waiting on a new connection after the max is 
hit instead of opening far too many.

 sqlContext.jdbc can kill your database due to high concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7909) spark-ec2 and associated tools not py3 ready

2015-06-01 Thread Matthew Goodman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567685#comment-14567685
 ] 

Matthew Goodman commented on SPARK-7909:


Awesome, thanks for all the help on this.  There is one (possibly unrelated) 
issue remains, which is that httpd seems to fail to startup, giving the 
following traceback:

{code:title=HTTPD Failure Traceback|borderStyle=solid}
Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: 
Cannot load /etc/httpd/modules/mod_authz_core.so into server: 
/etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No such 
file or directory
{code}

Should I send in a PR [for this 
change|https://github.com/3Scan/spark-ec2/commit/3416dd07c492b0cddcc98c4fa83f9e4284ed8fc9]?
  

 spark-ec2 and associated tools not py3 ready
 

 Key: SPARK-7909
 URL: https://issues.apache.org/jira/browse/SPARK-7909
 Project: Spark
  Issue Type: Improvement
  Components: EC2
 Environment: ec2 python3
Reporter: Matthew Goodman

 At present there is not a possible permutation of tools that supports Python3 
 on both the launching computer and running cluster.  There are a couple 
 problems involved:
  - There is no prebuilt spark binary with python3 support.
  - spark-ec2/spark/init.sh contains inline py3 unfriendly print statements
  - Config files for cluster processes don't seem to make it to all nodes in a 
 working format.
 I have fixes for some of this, but the config and running context debugging 
 remains elusive to me.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8016) YARN cluster / client modes have different app names for python

2015-06-01 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8016:
-
Component/s: PySpark

 YARN cluster / client modes have different app names for python
 ---

 Key: SPARK-8016
 URL: https://issues.apache.org/jira/browse/SPARK-8016
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
Reporter: Andrew Or
 Attachments: python.png


 See screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4048) Enhance and extend hadoop-provided profile

2015-06-01 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567752#comment-14567752
 ] 

Marcelo Vanzin commented on SPARK-4048:
---

That is not a regression. The whole point of hadoop-provided is that *you* 
have to provide the needed jars. So if a jar is missing, you are failing to 
provide them.

 Enhance and extend hadoop-provided profile
 --

 Key: SPARK-4048
 URL: https://issues.apache.org/jira/browse/SPARK-4048
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.3.0


 The hadoop-provided profile is used to not package Hadoop dependencies inside 
 the Spark assembly. It works, sort of, but it could use some enhancements. A 
 quick list:
 - It doesn't include all things that could be removed from the assembly
 - It doesn't work well when you're publishing artifacts based on it 
 (SPARK-3812 fixes this)
 - There are other dependencies that could use similar treatment: Hive, HBase 
 (for the examples), Flume, Parquet, maybe others I'm missing at the moment.
 - Unit tests, more specifically, those that use local-cluster mode, do not 
 work when the assembly is built with this profile enabled.
 - The scripts to launch Spark jobs do not add needed provided jars to the 
 classpath when this profile is enabled, leaving it for people to figure that 
 out for themselves.
 - The examples assembly duplicates a lot of things in the main assembly.
 Part of this task is selfish since we build internally with this profile and 
 we'd like to make it easier for us to merge changes without having to keep 
 too many patches on top of upstream. But those feel like good improvements to 
 me, regardless.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8017) YARN cluster python --py-files does not work

2015-06-01 Thread Andrew Or (JIRA)
Andrew Or created SPARK-8017:


 Summary: YARN cluster python --py-files does not work
 Key: SPARK-8017
 URL: https://issues.apache.org/jira/browse/SPARK-8017
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
Reporter: Andrew Or


When I ran the following, it works in client mode but not cluster mode
{code}
bin/spark-submit --master yarn --deploy-mode X --py-files secondary.py app.py
{code}
where app.py depends on secondary.py.

Python YARN cluster mode is added recently so this is not a blocker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8015) flume-sink should not depend on Guava.

2015-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8015:
---

Assignee: (was: Apache Spark)

 flume-sink should not depend on Guava.
 --

 Key: SPARK-8015
 URL: https://issues.apache.org/jira/browse/SPARK-8015
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Priority: Minor

 The flume-sink module, due to the shared shading code in our build, ends up 
 depending on the {{org.spark-project}} Guava classes. That means users who 
 deploy the sink in Flume will also need to provide those classes somehow, 
 generally by also adding the Spark assembly, which means adding a whole bunch 
 of other libraries to Flume, which may or may not cause other unforeseen 
 problems.
 It's better to not have that dependency in the flume-sink module instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >