[jira] [Commented] (SPARK-12409) JDBC AND operator push down

2015-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063708#comment-15063708
 ] 

Apache Spark commented on SPARK-12409:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/10369

> JDBC AND operator push down 
> 
>
> Key: SPARK-12409
> URL: https://issues.apache.org/jira/browse/SPARK-12409
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Priority: Minor
>
> For simple AND such as 
> select * from test where THEID = 1 AND NAME = 'fred', 
> The filters pushed down to JDBC layers are EqualTo(THEID,1), 
> EqualTo(Name,fred). These are handled OK by the current code. 
> For query such as 
> SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" ,
> the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2)))
> So need to add And filter in JDBC layer.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11148) Unable to create views

2015-12-18 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063721#comment-15063721
 ] 

Cheng Lian commented on SPARK-11148:


Did you mean the Windows ODBC driver provided by Simba? AFAIK Databricks only 
provides download links to Simba's Spark ODBC drivers. If that's the case, you 
might want to check with Simba since these drivers are not open sourced.

> Unable to create views
> --
>
> Key: SPARK-11148
> URL: https://issues.apache.org/jira/browse/SPARK-11148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Ubuntu 14.04
> Spark-1.5.1-bin-hadoop2.6
> (I don't have Hadoop or Hive installed)
> Start spark-all.sh and thriftserver with mysql jar driver
>Reporter: Lunen
>Priority: Critical
>
> I am unable to create views within spark SQL. 
> Creating tables without specifying the column names work. eg.
> CREATE TABLE trade2 
> USING org.apache.spark.sql.jdbc
> OPTIONS ( 
> url "jdbc:mysql://192.168.30.191:3318/?user=root", 
> dbtable "database.trade", 
> driver "com.mysql.jdbc.Driver" 
> );
> Ceating tables with datatypes gives an error:
> CREATE TABLE trade2( 
> COL1 timestamp, 
> COL2 STRING, 
> COL3 STRING) 
> USING org.apache.spark.sql.jdbc 
> OPTIONS (
>   url "jdbc:mysql://192.168.30.191:3318/?user=root",   
>   dbtable "database.trade",   
>   driver "com.mysql.jdbc.Driver" 
> );
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow 
> user-specified schemas.; SQLState: null ErrorCode: 0
> Trying to create a VIEW from the table that was created.(The select statement 
> below returns data)
> CREATE VIEW viewtrade as Select Col1 from trade2;
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> SemanticException [Error 10004]: Line 1:30 Invalid table alias or column 
> reference 'Col1': (possible column names are: col)
> SQLState:  null
> ErrorCode: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12421) Fix copy() method of GenericRow

2015-12-18 Thread Burkard Doepfner (JIRA)
Burkard Doepfner created SPARK-12421:


 Summary: Fix copy() method of GenericRow 
 Key: SPARK-12421
 URL: https://issues.apache.org/jira/browse/SPARK-12421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Burkard Doepfner
Priority: Minor


The copy() method of the GenericRow class does actually not copy itself. The 
method just returns itself.

Simple reproduction code of the issue:
 import org.apache.spark.sql.Row;
val row = Row.fromSeq(Array(1,2,3,4,5))
val arr = row.toSeq.toArray
arr(0) = 6
row // first value changed to 6
val rowCopied = row.copy()
val arrCopied = rowCopied.toSeq.toArray
arrCopied(0) = 7
row // first value still changed (to 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12420) Have a built-in CSV data source implementation

2015-12-18 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12420:

Target Version/s: 2.0.0

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12420) Have a built-in CSV data source implementation

2015-12-18 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12420:
---

 Summary: Have a built-in CSV data source implementation
 Key: SPARK-12420
 URL: https://issues.apache.org/jira/browse/SPARK-12420
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin


CSV is the most common data format in the "small data" world. It is often the 
first format people want to try when they see Spark on a single node. Having to 
rely on a 3rd party component for this is a very bad user experience for new 
users.

We should consider inlining https://github.com/databricks/spark-csv





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12421) Fix copy() method of GenericRow

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12421:


Assignee: (was: Apache Spark)

> Fix copy() method of GenericRow 
> 
>
> Key: SPARK-12421
> URL: https://issues.apache.org/jira/browse/SPARK-12421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Burkard Doepfner
>Priority: Minor
>
> The copy() method of the GenericRow class does actually not copy itself. The 
> method just returns itself.
> Simple reproduction code of the issue:
>  import org.apache.spark.sql.Row;
> val row = Row.fromSeq(Array(1,2,3,4,5))
> val arr = row.toSeq.toArray
> arr(0) = 6
> row // first value changed to 6
> val rowCopied = row.copy()
> val arrCopied = rowCopied.toSeq.toArray
> arrCopied(0) = 7
> row // first value still changed (to 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12421) Fix copy() method of GenericRow

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12421:


Assignee: Apache Spark

> Fix copy() method of GenericRow 
> 
>
> Key: SPARK-12421
> URL: https://issues.apache.org/jira/browse/SPARK-12421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Burkard Doepfner
>Assignee: Apache Spark
>Priority: Minor
>
> The copy() method of the GenericRow class does actually not copy itself. The 
> method just returns itself.
> Simple reproduction code of the issue:
>  import org.apache.spark.sql.Row;
> val row = Row.fromSeq(Array(1,2,3,4,5))
> val arr = row.toSeq.toArray
> arr(0) = 6
> row // first value changed to 6
> val rowCopied = row.copy()
> val arrCopied = rowCopied.toSeq.toArray
> arrCopied(0) = 7
> row // first value still changed (to 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore

2015-12-18 Thread Lunen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lunen updated SPARK-12403:
--
Affects Version/s: 1.5.0
Fix Version/s: 1.4.1

> "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
> 
>
> Key: SPARK-12403
> URL: https://issues.apache.org/jira/browse/SPARK-12403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: ODBC connector query 
>Reporter: Lunen
> Fix For: 1.3.1, 1.4.1
>
>
> We are unable to query the SPARK tables using the ODBC driver from Simba 
> Spark(Databricks - "Simba Spark ODBC Driver 1.0")  We are able to do a show 
> databases and show tables, but not any queries. eg.
> Working:
> Select * from openquery(SPARK,'SHOW DATABASES')
> Select * from openquery(SPARK,'SHOW TABLES')
> Not working:
> Select * from openquery(SPARK,'Select * from lunentest')
> The error I get is:
> OLE DB provider "MSDASQL" for linked server "SPARK" returned message 
> "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest".
> Msg 7321, Level 16, State 2, Line 2
> An error occurred while preparing the query "Select * from lunentest" for 
> execution against OLE DB provider "MSDASQL" for linked server "SPARK"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12421) Fix copy() method of GenericRow

2015-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063730#comment-15063730
 ] 

Apache Spark commented on SPARK-12421:
--

User 'Apo1' has created a pull request for this issue:
https://github.com/apache/spark/pull/10374

> Fix copy() method of GenericRow 
> 
>
> Key: SPARK-12421
> URL: https://issues.apache.org/jira/browse/SPARK-12421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Burkard Doepfner
>Priority: Minor
>
> The copy() method of the GenericRow class does actually not copy itself. The 
> method just returns itself.
> Simple reproduction code of the issue:
>  import org.apache.spark.sql.Row;
> val row = Row.fromSeq(Array(1,2,3,4,5))
> val arr = row.toSeq.toArray
> arr(0) = 6
> row // first value changed to 6
> val rowCopied = row.copy()
> val arrCopied = rowCopied.toSeq.toArray
> arrCopied(0) = 7
> row // first value still changed (to 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12420) Have a built-in CSV data source implementation

2015-12-18 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063736#comment-15063736
 ] 

Jeff Zhang commented on SPARK-12420:


+1, this is very common use data format. Not sure why it is not built in at the 
beginning. If there's no license issue, then definitely should make it built-in 

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12420) Have a built-in CSV data source implementation

2015-12-18 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063736#comment-15063736
 ] 

Jeff Zhang edited comment on SPARK-12420 at 12/18/15 9:15 AM:
--

+1, this is very common data format. Not sure why it is not built in at the 
beginning. If there's no license issue, then definitely should make it built-in 


was (Author: zjffdu):
+1, this is very common use data format. Not sure why it is not built in at the 
beginning. If there's no license issue, then definitely should make it built-in 

> Have a built-in CSV data source implementation
> --
>
> Key: SPARK-12420
> URL: https://issues.apache.org/jira/browse/SPARK-12420
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> CSV is the most common data format in the "small data" world. It is often the 
> first format people want to try when they see Spark on a single node. Having 
> to rely on a 3rd party component for this is a very bad user experience for 
> new users.
> We should consider inlining https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12417) Orc bloom filter options are not propagated during file write in spark

2015-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063778#comment-15063778
 ] 

Apache Spark commented on SPARK-12417:
--

User 'rajeshbalamohan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10375

> Orc bloom filter options are not propagated during file write in spark
> --
>
> Key: SPARK-12417
> URL: https://issues.apache.org/jira/browse/SPARK-12417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: SPARK-12417.1.patch
>
>
> ORC bloom filter is supported by the version of hive used in Spark 1.5.2. 
> However, when trying to create orc file with bloom filter option, it does not 
> make use of it.
> E.g, following orc output does not create the bloom filter even though the 
> options are specified.
> {noformat}
> Map orcOption = new HashMap();
> orcOption.put("orc.bloom.filter.columns", "*");
> hiveContext.sql("select * from accounts where 
> effective_date='2015-12-30'").write().
> format("orc").options(orcOption).save("/tmp/accounts");
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12417) Orc bloom filter options are not propagated during file write in spark

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12417:


Assignee: Apache Spark

> Orc bloom filter options are not propagated during file write in spark
> --
>
> Key: SPARK-12417
> URL: https://issues.apache.org/jira/browse/SPARK-12417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Apache Spark
> Attachments: SPARK-12417.1.patch
>
>
> ORC bloom filter is supported by the version of hive used in Spark 1.5.2. 
> However, when trying to create orc file with bloom filter option, it does not 
> make use of it.
> E.g, following orc output does not create the bloom filter even though the 
> options are specified.
> {noformat}
> Map orcOption = new HashMap();
> orcOption.put("orc.bloom.filter.columns", "*");
> hiveContext.sql("select * from accounts where 
> effective_date='2015-12-30'").write().
> format("orc").options(orcOption).save("/tmp/accounts");
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Updated] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns

2015-12-18 Thread Gobinathan SP (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gobinathan SP updated SPARK-12313:
--
Description: 
When enabled spark.sql.hive.metastorePartitionPruning, the 
getPartitionsByFilter is used

For a table partitioned by p1 and p2, when triggered hc.sql("select col 
from tabl1 where p1='p1V' and p2= 'p2V' ")

The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' and 
col2= 'p2V' . The same is passed to the getPartitionsByFilter method as filter 
string.

On these cases the partitions are not returned from Hive's 
getPartitionsByFilter method. As a result, for the sql, the number of returned 
rows is always zero. 

However, filter on a single column always works. Probalbly  it doesn't come 
through this route

I'm using Oracle for Metstore V0.13.1

  was:
When enabled spark.sql.hive.metastorePartitionPruning, the 
getPartitionsByFilter is used

For a table partitioned by p1 and p2, when triggered hc.sql("select col 
from tabl1 where p1='p1V' and p2= 'p2V' ")
The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' and 
col2= 'p2V' .
On these cases the partitions are not returned from Hive's 
getPartitionsByFilter method. As a result, for the sql, the number of returned 
rows is always zero. 

However, filter on a single column always works. Probalbly  it doesn't come 
through this route

I'm using Oracle for Metstore V0.13.1


> getPartitionsByFilter doesnt handle predicates on all / multiple Partition 
> Columns
> --
>
> Key: SPARK-12313
> URL: https://issues.apache.org/jira/browse/SPARK-12313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Gobinathan SP
>Priority: Critical
>
> When enabled spark.sql.hive.metastorePartitionPruning, the 
> getPartitionsByFilter is used
> For a table partitioned by p1 and p2, when triggered hc.sql("select col 
> from tabl1 where p1='p1V' and p2= 'p2V' ")
> The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' 
> and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as 
> filter string.
> On these cases the partitions are not returned from Hive's 
> getPartitionsByFilter method. As a result, for the sql, the number of 
> returned rows is always zero. 
> However, filter on a single column always works. Probalbly  it doesn't come 
> through this route
> I'm using Oracle for Metstore V0.13.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12400) Avoid writing a shuffle file if a partition has no output (empty)

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12400:


Assignee: (was: Apache Spark)

> Avoid writing a shuffle file if a partition has no output (empty)
> -
>
> Key: SPARK-12400
> URL: https://issues.apache.org/jira/browse/SPARK-12400
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Reynold Xin
>
> A Spark user was asking for automatic setting of # reducers. When I pushed 
> for more, it turned out the problem for them is that 200 creates too many 
> files, when most partitions are empty.
> It seems like a simple thing we can do is to avoid creating shuffle files if 
> a partition is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12400) Avoid writing a shuffle file if a partition has no output (empty)

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12400:


Assignee: Apache Spark

> Avoid writing a shuffle file if a partition has no output (empty)
> -
>
> Key: SPARK-12400
> URL: https://issues.apache.org/jira/browse/SPARK-12400
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> A Spark user was asking for automatic setting of # reducers. When I pushed 
> for more, it turned out the problem for them is that 200 creates too many 
> files, when most partitions are empty.
> It seems like a simple thing we can do is to avoid creating shuffle files if 
> a partition is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12413) Mesos ZK persistence throws a NotSerializableException

2015-12-18 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-12413:
---
Assignee: Michael Gummelt

> Mesos ZK persistence throws a NotSerializableException
> --
>
> Key: SPARK-12413
> URL: https://issues.apache.org/jira/browse/SPARK-12413
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
>
> https://github.com/apache/spark/pull/10359 breaks ZK persistence due to 
> https://issues.scala-lang.org/browse/SI-6654
> This line throws a NotSerializable exception: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster
> The MesosClusterDispatcher attempts to serialize MesosDriverDescription 
> objects to ZK, but https://github.com/apache/spark/pull/10359 makes it so the 
> {{command}} property is unserializable
> Offer id: 72f4d1ce-67f7-41b0-95a3-aa6fb208df32-O189, cpu: 3.0, mem: 12995.0
> 15/12/17 21:52:44 DEBUG ClientCnxn: Got ping response for sessionid: 
> 0x151b1d1567e0002 after 0ms
> 15/12/17 21:52:44 DEBUG nio: created 
> SCEP@2e746d70{l(/10.0.6.166:41456)<->r(/10.0.0.240:17386),s=0,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=0}-{AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0}
> 15/12/17 21:52:44 DEBUG HttpParser: filled 1591/1591
> 15/12/17 21:52:44 DEBUG Server: REQUEST /v1/submissions/create on 
> AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=2,l=2,c=1174},r=1
> 15/12/17 21:52:44 DEBUG ContextHandler: scope null||/v1/submissions/create @ 
> o.s.j.s.ServletContextHandler{/,null}
> 15/12/17 21:52:44 DEBUG ContextHandler: context=||/v1/submissions/create @ 
> o.s.j.s.ServletContextHandler{/,null}
> 15/12/17 21:52:44 DEBUG ServletHandler: servlet |/v1/submissions/create|null 
> -> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet-368e091
> 15/12/17 21:52:44 DEBUG ServletHandler: chain=null
> 15/12/17 21:52:44 WARN ServletHandler: /v1/submissions/create
> java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at org.apache.spark.util.Utils$.serialize(Utils.scala:83)
>   at 
> org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:110)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:166)
>   at 
> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:132)
>   at 
> org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:258)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12413) Mesos ZK persistence throws a NotSerializableException

2015-12-18 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-12413.

Resolution: Fixed

> Mesos ZK persistence throws a NotSerializableException
> --
>
> Key: SPARK-12413
> URL: https://issues.apache.org/jira/browse/SPARK-12413
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
> Fix For: 1.6.0, 2.0.0
>
>
> https://github.com/apache/spark/pull/10359 breaks ZK persistence due to 
> https://issues.scala-lang.org/browse/SI-6654
> This line throws a NotSerializable exception: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster
> The MesosClusterDispatcher attempts to serialize MesosDriverDescription 
> objects to ZK, but https://github.com/apache/spark/pull/10359 makes it so the 
> {{command}} property is unserializable
> Offer id: 72f4d1ce-67f7-41b0-95a3-aa6fb208df32-O189, cpu: 3.0, mem: 12995.0
> 15/12/17 21:52:44 DEBUG ClientCnxn: Got ping response for sessionid: 
> 0x151b1d1567e0002 after 0ms
> 15/12/17 21:52:44 DEBUG nio: created 
> SCEP@2e746d70{l(/10.0.6.166:41456)<->r(/10.0.0.240:17386),s=0,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=0}-{AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0}
> 15/12/17 21:52:44 DEBUG HttpParser: filled 1591/1591
> 15/12/17 21:52:44 DEBUG Server: REQUEST /v1/submissions/create on 
> AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=2,l=2,c=1174},r=1
> 15/12/17 21:52:44 DEBUG ContextHandler: scope null||/v1/submissions/create @ 
> o.s.j.s.ServletContextHandler{/,null}
> 15/12/17 21:52:44 DEBUG ContextHandler: context=||/v1/submissions/create @ 
> o.s.j.s.ServletContextHandler{/,null}
> 15/12/17 21:52:44 DEBUG ServletHandler: servlet |/v1/submissions/create|null 
> -> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet-368e091
> 15/12/17 21:52:44 DEBUG ServletHandler: chain=null
> 15/12/17 21:52:44 WARN ServletHandler: /v1/submissions/create
> java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at org.apache.spark.util.Utils$.serialize(Utils.scala:83)
>   at 
> org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:110)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:166)
>   at 
> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:132)
>   at 
> org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:258)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12413) Mesos ZK persistence throws a NotSerializableException

2015-12-18 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-12413:
---
Fix Version/s: 2.0.0
   1.6.0

> Mesos ZK persistence throws a NotSerializableException
> --
>
> Key: SPARK-12413
> URL: https://issues.apache.org/jira/browse/SPARK-12413
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
> Fix For: 1.6.0, 2.0.0
>
>
> https://github.com/apache/spark/pull/10359 breaks ZK persistence due to 
> https://issues.scala-lang.org/browse/SI-6654
> This line throws a NotSerializable exception: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster
> The MesosClusterDispatcher attempts to serialize MesosDriverDescription 
> objects to ZK, but https://github.com/apache/spark/pull/10359 makes it so the 
> {{command}} property is unserializable
> Offer id: 72f4d1ce-67f7-41b0-95a3-aa6fb208df32-O189, cpu: 3.0, mem: 12995.0
> 15/12/17 21:52:44 DEBUG ClientCnxn: Got ping response for sessionid: 
> 0x151b1d1567e0002 after 0ms
> 15/12/17 21:52:44 DEBUG nio: created 
> SCEP@2e746d70{l(/10.0.6.166:41456)<->r(/10.0.0.240:17386),s=0,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=0}-{AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0}
> 15/12/17 21:52:44 DEBUG HttpParser: filled 1591/1591
> 15/12/17 21:52:44 DEBUG Server: REQUEST /v1/submissions/create on 
> AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=2,l=2,c=1174},r=1
> 15/12/17 21:52:44 DEBUG ContextHandler: scope null||/v1/submissions/create @ 
> o.s.j.s.ServletContextHandler{/,null}
> 15/12/17 21:52:44 DEBUG ContextHandler: context=||/v1/submissions/create @ 
> o.s.j.s.ServletContextHandler{/,null}
> 15/12/17 21:52:44 DEBUG ServletHandler: servlet |/v1/submissions/create|null 
> -> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet-368e091
> 15/12/17 21:52:44 DEBUG ServletHandler: chain=null
> 15/12/17 21:52:44 WARN ServletHandler: /v1/submissions/create
> java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at org.apache.spark.util.Utils$.serialize(Utils.scala:83)
>   at 
> org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:110)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:166)
>   at 
> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:132)
>   at 
> org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:258)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12413) Mesos ZK persistence throws a NotSerializableException

2015-12-18 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063845#comment-15063845
 ] 

Kousuke Saruta commented on SPARK-12413:


Memorandum: If 1.6.0-RC4 is not cut, we should modify Fix Versions from 1.6.0 
to 1.6.1.

> Mesos ZK persistence throws a NotSerializableException
> --
>
> Key: SPARK-12413
> URL: https://issues.apache.org/jira/browse/SPARK-12413
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Michael Gummelt
>Assignee: Michael Gummelt
> Fix For: 1.6.0, 2.0.0
>
>
> https://github.com/apache/spark/pull/10359 breaks ZK persistence due to 
> https://issues.scala-lang.org/browse/SI-6654
> This line throws a NotSerializable exception: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster
> The MesosClusterDispatcher attempts to serialize MesosDriverDescription 
> objects to ZK, but https://github.com/apache/spark/pull/10359 makes it so the 
> {{command}} property is unserializable
> Offer id: 72f4d1ce-67f7-41b0-95a3-aa6fb208df32-O189, cpu: 3.0, mem: 12995.0
> 15/12/17 21:52:44 DEBUG ClientCnxn: Got ping response for sessionid: 
> 0x151b1d1567e0002 after 0ms
> 15/12/17 21:52:44 DEBUG nio: created 
> SCEP@2e746d70{l(/10.0.6.166:41456)<->r(/10.0.0.240:17386),s=0,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=0}-{AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0}
> 15/12/17 21:52:44 DEBUG HttpParser: filled 1591/1591
> 15/12/17 21:52:44 DEBUG Server: REQUEST /v1/submissions/create on 
> AsyncHttpConnection@5dbcebe3,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=2,l=2,c=1174},r=1
> 15/12/17 21:52:44 DEBUG ContextHandler: scope null||/v1/submissions/create @ 
> o.s.j.s.ServletContextHandler{/,null}
> 15/12/17 21:52:44 DEBUG ContextHandler: context=||/v1/submissions/create @ 
> o.s.j.s.ServletContextHandler{/,null}
> 15/12/17 21:52:44 DEBUG ServletHandler: servlet |/v1/submissions/create|null 
> -> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet-368e091
> 15/12/17 21:52:44 DEBUG ServletHandler: chain=null
> 15/12/17 21:52:44 WARN ServletHandler: /v1/submissions/create
> java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at org.apache.spark.util.Utils$.serialize(Utils.scala:83)
>   at 
> org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:110)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:166)
>   at 
> org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:132)
>   at 
> org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:258)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12393) Add read.text and write.text for SparkR

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12393:


Assignee: (was: Apache Spark)

> Add read.text and write.text for SparkR
> ---
>
> Key: SPARK-12393
> URL: https://issues.apache.org/jira/browse/SPARK-12393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Add read.text and write.text for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12393) Add read.text and write.text for SparkR

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12393:


Assignee: Apache Spark

> Add read.text and write.text for SparkR
> ---
>
> Key: SPARK-12393
> URL: https://issues.apache.org/jira/browse/SPARK-12393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Add read.text and write.text for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12400) Avoid writing a shuffle file if a partition has no output (empty)

2015-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063891#comment-15063891
 ] 

Apache Spark commented on SPARK-12400:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/10376

> Avoid writing a shuffle file if a partition has no output (empty)
> -
>
> Key: SPARK-12400
> URL: https://issues.apache.org/jira/browse/SPARK-12400
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Reynold Xin
>
> A Spark user was asking for automatic setting of # reducers. When I pushed 
> for more, it turned out the problem for them is that 200 creates too many 
> files, when most partitions are empty.
> It seems like a simple thing we can do is to avoid creating shuffle files if 
> a partition is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12422) Binding Spark Standalone Master to public IP fails

2015-12-18 Thread Bennet Jeutter (JIRA)
Bennet Jeutter created SPARK-12422:
--

 Summary: Binding Spark Standalone Master to public IP fails
 Key: SPARK-12422
 URL: https://issues.apache.org/jira/browse/SPARK-12422
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.5.2
 Environment: Fails on direct deployment on Mac OSX and also in Docker 
Environment (running on OSX or Ubuntu)
Reporter: Bennet Jeutter
Priority: Blocker


The start of the Spark Standalone Master fails, when the host specified equals 
the public IP address. For example I created a Docker Machine with public IP 
192.168.99.100, then I run:
/usr/spark/bin/spark-class org.apache.spark.deploy.master.Master -h 
192.168.99.100

It'll fail with:
Exception in thread "main" java.net.BindException: Failed to bind to: 
/192.168.99.100:7093: Service 'sparkMaster' failed after 16 retries!
at 
org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at 
akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at 
akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at 
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at 
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at 
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at 
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at 
akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

So I thought oh well, lets just bind to the local IP and access it via public 
IP - this doesn't work, it will give:
dropping message [class akka.actor.ActorSelectionMessage] for non-local 
recipient [Actor[akka.tcp://sparkMaster@192.168.99.100:7077/]] arriving at 
[akka.tcp://sparkMaster@192.168.99.100:7077] inbound addresses are 
[akka.tcp://sparkMaster@spark-master:7077]

So there is currently no possibility to run all this... related stackoverflow 
issues:
* 
http://stackoverflow.com/questions/31659228/getting-java-net-bindexception-when-attempting-to-start-spark-master-on-ec2-node
* 
http://stackoverflow.com/questions/33768029/access-apache-spark-standalone-master-via-ip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12423) Mesos executor home should not be resolved on the driver's file system

2015-12-18 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-12423:
-

 Summary: Mesos executor home should not be resolved on the 
driver's file system
 Key: SPARK-12423
 URL: https://issues.apache.org/jira/browse/SPARK-12423
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.6.0
Reporter: Iulian Dragos


{{spark.mesos.executor.home}} should be an uninterpreted string. It is very 
possible that this path does not exist on the driver, and if it does, it may be 
a symlink that should not be resolved. Currently, this leads to failures in 
client mode.

For example, setting it to {{/var/spark/spark-1.6.0-bin-hadoop2.6/}} leads to 
executors failing:

{code}
sh: 1: /private/var/spark/spark-1.6.0-bin-hadoop2.6/bin/spark-class: not found
{code}

{{getCanonicalPath}} transforms {{/var/spark...}} into {{/private/var..}} 
because on my system there is a symlink from one to the other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6936:
-
Assignee: Michael Armbrust

> SQLContext.sql() caused deadlock in multi-thread env
> 
>
> Key: SPARK-6936
> URL: https://issues.apache.org/jira/browse/SPARK-6936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: JDK 1.8.x, RedHat
> Linux version 2.6.32-431.23.3.el6.x86_64 
> (mockbu...@x86-027.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
> Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014
>Reporter: Paul Wu
>Assignee: Michael Armbrust
>  Labels: deadlock, sql, threading
> Fix For: 1.5.0
>
>
> Doing (the same query) in more than one threads with SQLConext.sql may lead 
> to deadlock. Here is a way to reproduce it (since this is multi-thread issue, 
> the reproduction may or may not be so easy).
> 1. Register a relatively big table.
> 2.  Create two different classes and in the classes, do the same query in a 
> method and put the results in a set and print out the set size.
> 3.  Create two threads to use an object from each class in the run method. 
> Start the threads. For my tests,  it can have a deadlock just in a few runs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12396) Once driver client registered successfully,it still retry to connected to master.

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12396:
--
   Flags:   (was: Patch)
Target Version/s:   (was: 1.5.2)
  Labels:   (was: patch)
   Fix Version/s: (was: 1.5.2)

[~ZhangMei] please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  Don't 
set target/fix version, and there's no 'patch' type or flag used here. I don't 
see a pull request.

> Once driver client registered successfully,it still retry to connected to 
> master.
> -
>
> Key: SPARK-12396
> URL: https://issues.apache.org/jira/browse/SPARK-12396
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.5.1, 1.5.2
>Reporter: echo
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> As description in AppClient.scala,Once driver connect to a master 
> successfully, all scheduling work and Futures will be cancelled. But at 
> currently,it still try to connect to master. And it should not happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12387) JDBC IN operator push down

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12387:
--
Fix Version/s: (was: 1.6.0)

> JDBC  IN operator push down
> ---
>
> Key: SPARK-12387
> URL: https://issues.apache.org/jira/browse/SPARK-12387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> For SQL IN operator such as
> SELECT column_name(s)
> FROM table_name
> WHERE column_name IN (value1,value2,...)
> Currently this is not pushed down for JDBC datasource.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12403:
--
Fix Version/s: (was: 1.4.1)
   (was: 1.3.1)

[~lunendl] please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  It 
doesn't make sense to set fix version, let alone to 1.3.1/1.4.1

> "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
> 
>
> Key: SPARK-12403
> URL: https://issues.apache.org/jira/browse/SPARK-12403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: ODBC connector query 
>Reporter: Lunen
>
> We are unable to query the SPARK tables using the ODBC driver from Simba 
> Spark(Databricks - "Simba Spark ODBC Driver 1.0")  We are able to do a show 
> databases and show tables, but not any queries. eg.
> Working:
> Select * from openquery(SPARK,'SHOW DATABASES')
> Select * from openquery(SPARK,'SHOW TABLES')
> Not working:
> Select * from openquery(SPARK,'Select * from lunentest')
> The error I get is:
> OLE DB provider "MSDASQL" for linked server "SPARK" returned message 
> "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest".
> Msg 7321, Level 16, State 2, Line 2
> An error occurred while preparing the query "Select * from lunentest" for 
> execution against OLE DB provider "MSDASQL" for linked server "SPARK"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12401) Add support for enums in postgres

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12401:
--
   Priority: Minor  (was: Major)
Component/s: SQL

> Add support for enums in postgres
> -
>
> Key: SPARK-12401
> URL: https://issues.apache.org/jira/browse/SPARK-12401
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jaka Jancar
>Priority: Minor
>
> JSON and JSONB types [are now 
> converted|https://github.com/apache/spark/pull/8948/files] into strings on 
> the Spark side instead of throwing. It would be great it [enumerated 
> types|http://www.postgresql.org/docs/current/static/datatype-enum.html] were 
> treated similarly instead of failing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12346) GLM summary crashes with NoSuchElementException if attributes are missing names

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12346:
--
Component/s: SparkR

> GLM summary crashes with NoSuchElementException if attributes are missing 
> names
> ---
>
> Key: SPARK-12346
> URL: https://issues.apache.org/jira/browse/SPARK-12346
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Eric Liang
>
> In getModelFeatures() of SparkRWrappers.scala, we call _.name.get on all the 
> feature column attributes. This fails when the attribute name is not defined.
> One way of reproducing this is to perform glm() in R with a vector-type input 
> feature that lacks ML attrs, then trying to call summary() on it, for example:
> {code}
> df <- sql(sqlContext, "SELECT * FROM testData")
> df2 <- withColumnRenamed(df, "f1", "f2") // This drops the ML attrs from f1
> lrModel <- glm(hours_per_week ~ f2, data = df2, family = "gaussian")
> summary(lrModel) // NoSuchElementException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12418) spark shuffle FAILED_TO_UNCOMPRESS

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12418.
---
  Resolution: Duplicate
Target Version/s:   (was: 1.5.1)

Please search JIRA first

> spark shuffle FAILED_TO_UNCOMPRESS
> --
>
> Key: SPARK-12418
> URL: https://issues.apache.org/jira/browse/SPARK-12418
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
> Environment: hadoop 2.3.0
> spark 1.5.1
>Reporter: dirk.zhang
>
> when use default compression snappy,I get error when spark doing shuffle
>   Job aborted due to stage failure: Task 19 in stage 2.3 failed 4 times, 
> most recent failure: Lost task 19.3 in stage 2.3 (TID 10311, 192.168.6.36): 
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1179)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader$$anonfun$3.apply(HashShuffleReader.scala:53)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader$$anonfun$3.apply(HashShuffleReader.scala:52)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12370) Documentation should link to examples from its own release version

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12370:
--
   Priority: Minor  (was: Major)
Component/s: Documentation

> Documentation should link to examples from its own release version
> --
>
> Key: SPARK-12370
> URL: https://issues.apache.org/jira/browse/SPARK-12370
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Brian London
>Priority: Minor
>
> When documentation is built is should reference examples from the same build. 
>  There are times when the docs have links that point to files in the github 
> head which may not be valid on the current release.
> As an example the spark streaming page for 1.5.2 (currently at 
> http://spark.apache.org/docs/latest/streaming-programming-guide.html) links 
> to the stateful network word count example (at 
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala).
>   That example file utilizes a number of 1.6 features that are not available 
> in 1.5.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12369) DataFrameReader fails on globbing parquet paths

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12369:


Assignee: Apache Spark

> DataFrameReader fails on globbing parquet paths
> ---
>
> Key: SPARK-12369
> URL: https://issues.apache.org/jira/browse/SPARK-12369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yana Kadiyska
>Assignee: Apache Spark
>
> Start with a list of parquet paths where some or all do not exist:
> {noformat}
> val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")
>  sqlContext.read.parquet(paths:_*)
> java.lang.NullPointerException
> at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
> {noformat}
> It would be better to produce a dataframe from the paths that do exist and 
> log a warning that a path was missing. Not sure for "all paths are missing 
> case" -- probably return an emptyDF with no schema since that method already 
> does so on empty path list.But I would prefer not to have to pre-validate 
> paths



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9057) Add Scala, Java and Python example to show DStream.transform

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9057.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 8431
[https://github.com/apache/spark/pull/8431]

> Add Scala, Java and Python example to show DStream.transform
> 
>
> Key: SPARK-9057
> URL: https://issues.apache.org/jira/browse/SPARK-9057
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>  Labels: starter
> Fix For: 2.0.0
>
>
> Currently there is no example to show the use of transform. Would be good to 
> add an example, that uses transform to join a static RDD with the RDDs of a 
> DStream.
> Need to be done for all 3 supported languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12369) DataFrameReader fails on globbing parquet paths

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12369:


Assignee: (was: Apache Spark)

> DataFrameReader fails on globbing parquet paths
> ---
>
> Key: SPARK-12369
> URL: https://issues.apache.org/jira/browse/SPARK-12369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yana Kadiyska
>
> Start with a list of parquet paths where some or all do not exist:
> {noformat}
> val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")
>  sqlContext.read.parquet(paths:_*)
> java.lang.NullPointerException
> at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
> {noformat}
> It would be better to produce a dataframe from the paths that do exist and 
> log a warning that a path was missing. Not sure for "all paths are missing 
> case" -- probably return an emptyDF with no schema since that method already 
> does so on empty path list.But I would prefer not to have to pre-validate 
> paths



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8318) Spark Streaming Starter JIRAs

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8318.
--
Resolution: Implemented

> Spark Streaming Starter JIRAs
> -
>
> Key: SPARK-8318
> URL: https://issues.apache.org/jira/browse/SPARK-8318
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> This is a master JIRA to collect together all starter tasks related to Spark 
> Streaming. These are simple tasks that contributors can do to get familiar 
> with the process of contributing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9057) Add Scala, Java and Python example to show DStream.transform

2015-12-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9057:
-
Assignee: Jeff Lam

> Add Scala, Java and Python example to show DStream.transform
> 
>
> Key: SPARK-9057
> URL: https://issues.apache.org/jira/browse/SPARK-9057
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Jeff Lam
>  Labels: starter
> Fix For: 2.0.0
>
>
> Currently there is no example to show the use of transform. Would be good to 
> add an example, that uses transform to join a static RDD with the RDDs of a 
> DStream.
> Need to be done for all 3 supported languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12319) Address endian specific problems surfaced in 1.6

2015-12-18 Thread Adam Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Roberts updated SPARK-12319:
-
Environment: Problems are evident on BE  (was: BE platforms)

> Address endian specific problems surfaced in 1.6
> 
>
> Key: SPARK-12319
> URL: https://issues.apache.org/jira/browse/SPARK-12319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Problems are evident on BE
>Reporter: Adam Roberts
>Priority: Critical
>
> JIRA to cover endian specific problems - since testing 1.6 I've noticed 
> problems with DataFrames on BE platforms, e.g. 
> https://issues.apache.org/jira/browse/SPARK-9858
> [~joshrosen] [~yhuai]
> Current progress: using com.google.common.io.LittleEndianDataInputStream and 
> com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer 
> fixes three test failures in ExchangeCoordinatorSuite but I'm concerned 
> around performance/wider functional implications
> "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
> with reordering" fails as we expect "one, 1" but instead get "one, 9" - we 
> believe the issue lies within BitSetMethods.java, specifically around: return 
> (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12369) DataFrameReader fails on globbing parquet paths

2015-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064059#comment-15064059
 ] 

Apache Spark commented on SPARK-12369:
--

User 'yanakad' has created a pull request for this issue:
https://github.com/apache/spark/pull/10379

> DataFrameReader fails on globbing parquet paths
> ---
>
> Key: SPARK-12369
> URL: https://issues.apache.org/jira/browse/SPARK-12369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yana Kadiyska
>
> Start with a list of parquet paths where some or all do not exist:
> {noformat}
> val paths=List("/foo/month=05/*.parquet","/foo/month=06/*.parquet")
>  sqlContext.read.parquet(paths:_*)
> java.lang.NullPointerException
> at org.apache.hadoop.fs.Globber.glob(Globber.java:218)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:251)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:258)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:264)
> at 
> org.apache.spark.sql.DataFrameReader$$anonfun$3.apply(DataFrameReader.scala:260)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:260)
> {noformat}
> It would be better to produce a dataframe from the paths that do exist and 
> log a warning that a path was missing. Not sure for "all paths are missing 
> case" -- probably return an emptyDF with no schema since that method already 
> does so on empty path list.But I would prefer not to have to pre-validate 
> paths



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12424) The implementation of ParamMap#filter is wrong.

2015-12-18 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-12424:
--

 Summary: The implementation of ParamMap#filter is wrong.
 Key: SPARK-12424
 URL: https://issues.apache.org/jira/browse/SPARK-12424
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.6.0, 2.0.0
Reporter: Kousuke Saruta


ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` 
is collection.Map, not mutable.Map but the result is casted to mutable.Map 
using `asInstanceOf` so we get `ClassCastException`.
Also, the return type of Map#filterKeys is not Serializable. It's the issue of 
Scala (https://issues.scala-lang.org/browse/SI-6654).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12424) The implementation of ParamMap#filter is wrong.

2015-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064113#comment-15064113
 ] 

Apache Spark commented on SPARK-12424:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/10381

> The implementation of ParamMap#filter is wrong.
> ---
>
> Key: SPARK-12424
> URL: https://issues.apache.org/jira/browse/SPARK-12424
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Kousuke Saruta
>
> ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` 
> is collection.Map, not mutable.Map but the result is casted to mutable.Map 
> using `asInstanceOf` so we get `ClassCastException`.
> Also, the return type of Map#filterKeys is not Serializable. It's the issue 
> of Scala (https://issues.scala-lang.org/browse/SI-6654).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12319) Address endian specific problems surfaced in 1.6

2015-12-18 Thread Adam Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Roberts updated SPARK-12319:
-
Environment: Problems apparent on BE, LE could be impacted too  (was: 
Problems are evident on BE)

> Address endian specific problems surfaced in 1.6
> 
>
> Key: SPARK-12319
> URL: https://issues.apache.org/jira/browse/SPARK-12319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Problems apparent on BE, LE could be impacted too
>Reporter: Adam Roberts
>Priority: Critical
>
> JIRA to cover endian specific problems - since testing 1.6 I've noticed 
> problems with DataFrames on BE platforms, e.g. 
> https://issues.apache.org/jira/browse/SPARK-9858
> [~joshrosen] [~yhuai]
> Current progress: using com.google.common.io.LittleEndianDataInputStream and 
> com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer 
> fixes three test failures in ExchangeCoordinatorSuite but I'm concerned 
> around performance/wider functional implications
> "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input 
> with reordering" fails as we expect "one, 1" but instead get "one, 9" - we 
> believe the issue lies within BitSetMethods.java, specifically around: return 
> (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12424:


Assignee: (was: Apache Spark)

> The implementation of ParamMap#filter is wrong.
> ---
>
> Key: SPARK-12424
> URL: https://issues.apache.org/jira/browse/SPARK-12424
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Kousuke Saruta
>
> ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` 
> is collection.Map, not mutable.Map but the result is casted to mutable.Map 
> using `asInstanceOf` so we get `ClassCastException`.
> Also, the return type of Map#filterKeys is not Serializable. It's the issue 
> of Scala (https://issues.scala-lang.org/browse/SI-6654).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12424:


Assignee: Apache Spark

> The implementation of ParamMap#filter is wrong.
> ---
>
> Key: SPARK-12424
> URL: https://issues.apache.org/jira/browse/SPARK-12424
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` 
> is collection.Map, not mutable.Map but the result is casted to mutable.Map 
> using `asInstanceOf` so we get `ClassCastException`.
> Also, the return type of Map#filterKeys is not Serializable. It's the issue 
> of Scala (https://issues.scala-lang.org/browse/SI-6654).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12424:


Assignee: (was: Apache Spark)

> The implementation of ParamMap#filter is wrong.
> ---
>
> Key: SPARK-12424
> URL: https://issues.apache.org/jira/browse/SPARK-12424
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Kousuke Saruta
>
> ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` 
> is collection.Map, not mutable.Map but the result is casted to mutable.Map 
> using `asInstanceOf` so we get `ClassCastException`.
> Also, the return type of Map#filterKeys is not Serializable. It's the issue 
> of Scala (https://issues.scala-lang.org/browse/SI-6654).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12424:


Assignee: Apache Spark

> The implementation of ParamMap#filter is wrong.
> ---
>
> Key: SPARK-12424
> URL: https://issues.apache.org/jira/browse/SPARK-12424
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` 
> is collection.Map, not mutable.Map but the result is casted to mutable.Map 
> using `asInstanceOf` so we get `ClassCastException`.
> Also, the return type of Map#filterKeys is not Serializable. It's the issue 
> of Scala (https://issues.scala-lang.org/browse/SI-6654).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12424:


Assignee: (was: Apache Spark)

> The implementation of ParamMap#filter is wrong.
> ---
>
> Key: SPARK-12424
> URL: https://issues.apache.org/jira/browse/SPARK-12424
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Kousuke Saruta
>
> ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` 
> is collection.Map, not mutable.Map but the result is casted to mutable.Map 
> using `asInstanceOf` so we get `ClassCastException`.
> Also, the return type of Map#filterKeys is not Serializable. It's the issue 
> of Scala (https://issues.scala-lang.org/browse/SI-6654).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12424) The implementation of ParamMap#filter is wrong.

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12424:


Assignee: Apache Spark

> The implementation of ParamMap#filter is wrong.
> ---
>
> Key: SPARK-12424
> URL: https://issues.apache.org/jira/browse/SPARK-12424
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` 
> is collection.Map, not mutable.Map but the result is casted to mutable.Map 
> using `asInstanceOf` so we get `ClassCastException`.
> Also, the return type of Map#filterKeys is not Serializable. It's the issue 
> of Scala (https://issues.scala-lang.org/browse/SI-6654).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11701) YARN - dynamic allocation and speculation active task accounting wrong

2015-12-18 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064200#comment-15064200
 ] 

Thomas Graves commented on SPARK-11701:
---

I ran into another instance of this and its when the job has multiple stages, 
if its not the last stage and both speculative tasks finish, they are both 
marked as success.  One of them gets ignored which can leave counts wrong and 
it shows that an executor still has a task.

15/12/18 16:01:08 INFO scheduler.TaskSetManager: Ignoring task-finished event 
for 8.1 in stage 0.0 because task 8 has already completed successfully

In this case the TaskCommit code and DAG scheduler won't handle it, the 
TaskSetManager.handleSuccessfulTask needs to handle it.

> YARN - dynamic allocation and speculation active task accounting wrong
> --
>
> Key: SPARK-11701
> URL: https://issues.apache.org/jira/browse/SPARK-11701
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> I am using dynamic container allocation and speculation and am seeing issues 
> with the active task accounting.  The Executor UI still shows active tasks on 
> the an executor but the job/stage is all completed.  I think its also 
> affecting the dynamic allocation being able to release containers because it 
> thinks there are still tasks.
> Its easily reproduce by using spark-shell, turn on dynamic allocation, then 
> run just a wordcount on decent sized file and set the speculation parameters 
> low: 
>  spark.dynamicAllocation.enabled true
>  spark.shuffle.service.enabled true
>  spark.dynamicAllocation.maxExecutors 10
>  spark.dynamicAllocation.minExecutors 2
>  spark.dynamicAllocation.initialExecutors 10
>  spark.dynamicAllocation.executorIdleTimeout 40s
> $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf 
> spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 
> --master yarn --deploy-mode client  --executor-memory 4g --driver-memory 4g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10291) Add statsByKey method to compute StatCounters for each key in an RDD

2015-12-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064266#comment-15064266
 ] 

Sean Owen commented on SPARK-10291:
---

My POV is that this isn't likely worth adding a method for. I appreciate the 
value of utility methods but have to weight it against adding another item to a 
core API and how often it'd be used. This is also straightforward to express in 
Spark SQL on a dataframe, no?

> Add statsByKey method to compute StatCounters for each key in an RDD
> 
>
> Key: SPARK-10291
> URL: https://issues.apache.org/jira/browse/SPARK-10291
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Erik Shilts
>Priority: Minor
>
> A common task is to summarize numerical data for different groups. Having a 
> `statsByKey` method would simplify this so the user would not have to write 
> the aggregators for all the statistics or manage collecting by key and 
> computing individual StatCounters.
> This should be a straightforward addition to PySpark. I can look into adding 
> to Scala and R if we want to maintain feature parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12409) JDBC AND operator push down

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12409:


Assignee: Apache Spark

> JDBC AND operator push down 
> 
>
> Key: SPARK-12409
> URL: https://issues.apache.org/jira/browse/SPARK-12409
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> For simple AND such as 
> select * from test where THEID = 1 AND NAME = 'fred', 
> The filters pushed down to JDBC layers are EqualTo(THEID,1), 
> EqualTo(Name,fred). These are handled OK by the current code. 
> For query such as 
> SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" ,
> the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2)))
> So need to add And filter in JDBC layer.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12409) JDBC AND operator push down

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12409:


Assignee: (was: Apache Spark)

> JDBC AND operator push down 
> 
>
> Key: SPARK-12409
> URL: https://issues.apache.org/jira/browse/SPARK-12409
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Priority: Minor
>
> For simple AND such as 
> select * from test where THEID = 1 AND NAME = 'fred', 
> The filters pushed down to JDBC layers are EqualTo(THEID,1), 
> EqualTo(Name,fred). These are handled OK by the current code. 
> For query such as 
> SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" ,
> the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2)))
> So need to add And filter in JDBC layer.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12350) VectorAssembler#transform() initially throws an exception

2015-12-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12350.

   Resolution: Fixed
 Assignee: Marcelo Vanzin  (was: Apache Spark)
Fix Version/s: 2.0.0

> VectorAssembler#transform() initially throws an exception
> -
>
> Key: SPARK-12350
> URL: https://issues.apache.org/jira/browse/SPARK-12350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
> Environment: sparkShell command from sbt
>Reporter: Jakob Odersky
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> Calling VectorAssembler.transform() initially throws an exception, subsequent 
> calls work.
> h3. Steps to reproduce
> In spark-shell,
> 1. Create a dummy dataframe and define an assembler
> {code}
> import org.apache.spark.ml.feature.VectorAssembler
> val df = sc.parallelize(List((1,2), (3,4))).toDF
> val assembler = new VectorAssembler().setInputCols(Array("_1", 
> "_2")).setOutputCol("features")
> {code}
> 2. Run
> {code}
> assembler.transform(df).show
> {code}
> Initially the following exception is thrown:
> {code}
> 15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream 
> /classes/org/apache/spark/sql/catalyst/expressions/Object.class for request 
> from /9.72.139.102:60610
> java.lang.IllegalArgumentException: requirement failed: File not found: 
> /classes/org/apache/spark/sql/catalyst/expressions/Object.class
>   at scala.Predef$.require(Predef.scala:233)
>   at 
> org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Subsequent calls work:
> {code}
> +---+---+-+
> | _1| _2| features|
> +---+---+-+
> |  1|  2|[1.0,2.0]|
> |  3|  4|[3.0,4.0]|
> +---+---+-+
> {code}
> It seems as though there is some internal state that is not initialized.
> [~iyounus] originally found this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11619) cannot use UDTF in DataFrame.selectExpr

2015-12-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11619.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9981
[https://github.com/apache/spark/pull/9981]

> cannot use UDTF in DataFrame.selectExpr
> ---
>
> Key: SPARK-11619
> URL: https://issues.apache.org/jira/browse/SPARK-11619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently if use UDTF like `explode`, `json_tuple` in `DataFrame.selectExpr`, 
> it will be parsed into `UnresolvedFunction` first, and then alias it with 
> `expr.prettyString`. However, UDTF may need MultiAlias so we will get error 
> if we run:
> {code}
> val df = Seq((Map("1" -> 1), 1)).toDF("a", "b")
> df.selectExpr("explode(a)").show()
> {code}
> [info]   org.apache.spark.sql.AnalysisException: Expect multiple names given 
> for org.apache.spark.sql.catalyst.expressions.Explode,
> [info] but only single name ''explode(a)' specified;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11619) cannot use UDTF in DataFrame.selectExpr

2015-12-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11619:
-
Assignee: Dilip Biswal

> cannot use UDTF in DataFrame.selectExpr
> ---
>
> Key: SPARK-11619
> URL: https://issues.apache.org/jira/browse/SPARK-11619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently if use UDTF like `explode`, `json_tuple` in `DataFrame.selectExpr`, 
> it will be parsed into `UnresolvedFunction` first, and then alias it with 
> `expr.prettyString`. However, UDTF may need MultiAlias so we will get error 
> if we run:
> {code}
> val df = Seq((Map("1" -> 1), 1)).toDF("a", "b")
> df.selectExpr("explode(a)").show()
> {code}
> [info]   org.apache.spark.sql.AnalysisException: Expect multiple names given 
> for org.apache.spark.sql.catalyst.expressions.Explode,
> [info] but only single name ''explode(a)' specified;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12353) wrong output for countByValue and countByValueAndWindow

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12353:


Assignee: Apache Spark

> wrong output for countByValue and countByValueAndWindow
> ---
>
> Key: SPARK-12353
> URL: https://issues.apache.org/jira/browse/SPARK-12353
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Input/Output, PySpark, Streaming
>Affects Versions: 1.5.2
> Environment: Ubuntu 14.04, Python 2.7.6
>Reporter: Bo Jin
>Assignee: Apache Spark
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> http://stackoverflow.com/q/34114585/4698425
> In PySpark Streaming, function countByValue and countByValueAndWindow return 
> one single number which is the count of distinct elements, instead of a list 
> of (k,v) pairs.
> It's inconsistent with the documentation: 
> countByValue: When called on a DStream of elements of type K, return a new 
> DStream of (K, Long) pairs where the value of each key is its frequency in 
> each RDD of the source DStream.
> countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a 
> new DStream of (K, Long) pairs where the value of each key is its frequency 
> within a sliding window. Like in reduceByKeyAndWindow, the number of reduce 
> tasks is configurable through an optional argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12353) wrong output for countByValue and countByValueAndWindow

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12353:


Assignee: (was: Apache Spark)

> wrong output for countByValue and countByValueAndWindow
> ---
>
> Key: SPARK-12353
> URL: https://issues.apache.org/jira/browse/SPARK-12353
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Input/Output, PySpark, Streaming
>Affects Versions: 1.5.2
> Environment: Ubuntu 14.04, Python 2.7.6
>Reporter: Bo Jin
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> http://stackoverflow.com/q/34114585/4698425
> In PySpark Streaming, function countByValue and countByValueAndWindow return 
> one single number which is the count of distinct elements, instead of a list 
> of (k,v) pairs.
> It's inconsistent with the documentation: 
> countByValue: When called on a DStream of elements of type K, return a new 
> DStream of (K, Long) pairs where the value of each key is its frequency in 
> each RDD of the source DStream.
> countByValueAndWindow: When called on a DStream of (K, V) pairs, returns a 
> new DStream of (K, Long) pairs where the value of each key is its frequency 
> within a sliding window. Like in reduceByKeyAndWindow, the number of reduce 
> tasks is configurable through an optional argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12425) DStream union optimisation

2015-12-18 Thread Guillaume Poulin (JIRA)
Guillaume Poulin created SPARK-12425:


 Summary: DStream union optimisation
 Key: SPARK-12425
 URL: https://issues.apache.org/jira/browse/SPARK-12425
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Guillaume Poulin
Priority: Minor


Currently, `DStream.union` always use `UnionRDD` on the underlying `RDD`. 
However using `PartitionerAwareUnionRDD` when possible would yield to better 
performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12425) DStream union optimisation

2015-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064376#comment-15064376
 ] 

Apache Spark commented on SPARK-12425:
--

User 'gpoulin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10382

> DStream union optimisation
> --
>
> Key: SPARK-12425
> URL: https://issues.apache.org/jira/browse/SPARK-12425
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Guillaume Poulin
>Priority: Minor
>
> Currently, `DStream.union` always use `UnionRDD` on the underlying `RDD`. 
> However using `PartitionerAwareUnionRDD` when possible would yield to better 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12054) Consider nullable in codegen

2015-12-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12054.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10333
[https://github.com/apache/spark/pull/10333]

> Consider nullable in codegen
> 
>
> Key: SPARK-12054
> URL: https://issues.apache.org/jira/browse/SPARK-12054
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Currently, we always check the nullability for results of expressions, we 
> could skip that if the expression is not nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12391) JDBC OR operator push down

2015-12-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12391:
-
Target Version/s:   (was: 1.6.0)

> JDBC OR operator push down
> --
>
> Key: SPARK-12391
> URL: https://issues.apache.org/jira/browse/SPARK-12391
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> For SQL OR operator such as
> SELECT *
> FROM table_name
> WHERE column_name1  =  value1 OR  column_name2  = value2
> Will push down to JDBC datasource



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12409) JDBC AND operator push down

2015-12-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12409:
-
Target Version/s:   (was: 1.6.0)

> JDBC AND operator push down 
> 
>
> Key: SPARK-12409
> URL: https://issues.apache.org/jira/browse/SPARK-12409
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Priority: Minor
>
> For simple AND such as 
> select * from test where THEID = 1 AND NAME = 'fred', 
> The filters pushed down to JDBC layers are EqualTo(THEID,1), 
> EqualTo(Name,fred). These are handled OK by the current code. 
> For query such as 
> SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" ,
> the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2)))
> So need to add And filter in JDBC layer.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12335) CentralMomentAgg should be nullable

2015-12-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12335:
--

Assignee: Davies Liu  (was: Apache Spark)

> CentralMomentAgg should be nullable
> ---
>
> Key: SPARK-12335
> URL: https://issues.apache.org/jira/browse/SPARK-12335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Davies Liu
>
> According to the {{getStatistics}} method overriden in all its subclasses, 
> {{CentralMomentAgg}} should be nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12335) CentralMomentAgg should be nullable

2015-12-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12335.

   Resolution: Fixed
Fix Version/s: 2.0.0

https://github.com/apache/spark/pull/10333

> CentralMomentAgg should be nullable
> ---
>
> Key: SPARK-12335
> URL: https://issues.apache.org/jira/browse/SPARK-12335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> According to the {{getStatistics}} method overriden in all its subclasses, 
> {{CentralMomentAgg}} should be nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12336) Outer join using multiple columns results in wrong nullability

2015-12-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12336:
--

Assignee: Davies Liu  (was: Cheng Lian)

> Outer join using multiple columns results in wrong nullability
> --
>
> Key: SPARK-12336
> URL: https://issues.apache.org/jira/browse/SPARK-12336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> When joining two DataFrames using multiple columns, a temporary inner join is 
> used to compute join output. Then a real join operator is created and 
> projected. However, the final projection list is based on the inner join 
> rather than real join operator. When the real join operator is an outer join, 
> nullability of the final projection can be wrong, since outer join may alter 
> nullability of its child plan(s).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12336) Outer join using multiple columns results in wrong nullability

2015-12-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12336.

   Resolution: Fixed
Fix Version/s: 2.0.0

https://github.com/apache/spark/pull/10333

> Outer join using multiple columns results in wrong nullability
> --
>
> Key: SPARK-12336
> URL: https://issues.apache.org/jira/browse/SPARK-12336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> When joining two DataFrames using multiple columns, a temporary inner join is 
> used to compute join output. Then a real join operator is created and 
> projected. However, the final projection list is based on the inner join 
> rather than real join operator. When the real join operator is an outer join, 
> nullability of the final projection can be wrong, since outer join may alter 
> nullability of its child plan(s).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12342) Corr (Pearson correlation) should be nullable

2015-12-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12342:
--

Assignee: Davies Liu  (was: Cheng Lian)

> Corr (Pearson correlation) should be nullable
> -
>
> Key: SPARK-12342
> URL: https://issues.apache.org/jira/browse/SPARK-12342
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12341) The "comment" field of DESCRIBE result set should be nullable

2015-12-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12341:
--

Assignee: Davies Liu  (was: Apache Spark)

> The "comment" field of DESCRIBE result set should be nullable
> -
>
> Key: SPARK-12341
> URL: https://issues.apache.org/jira/browse/SPARK-12341
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Davies Liu
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12341) The "comment" field of DESCRIBE result set should be nullable

2015-12-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12341.

   Resolution: Fixed
Fix Version/s: 2.0.0

https://github.com/apache/spark/pull/10333

> The "comment" field of DESCRIBE result set should be nullable
> -
>
> Key: SPARK-12341
> URL: https://issues.apache.org/jira/browse/SPARK-12341
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Davies Liu
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12342) Corr (Pearson correlation) should be nullable

2015-12-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12342.

   Resolution: Fixed
Fix Version/s: 2.0.0

https://github.com/apache/spark/pull/10333

> Corr (Pearson correlation) should be nullable
> -
>
> Key: SPARK-12342
> URL: https://issues.apache.org/jira/browse/SPARK-12342
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12426) Docker JDBC integration tests are failing again

2015-12-18 Thread Mark Grover (JIRA)
Mark Grover created SPARK-12426:
---

 Summary: Docker JDBC integration tests are failing again
 Key: SPARK-12426
 URL: https://issues.apache.org/jira/browse/SPARK-12426
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.6.0
Reporter: Mark Grover


The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to be 
failing again on my machine (Ubuntu Precise). This was the same box that I 
tested my previous commit on. Also, I am not confident this failure has much to 
do with Spark, since a well known commit where the tests were passing, fails 
now, in the same environment.

[~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing 
on his Ubuntu 15 box as well.

Here's the error, fyi:
{code}
15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext
15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
*** RUN ABORTED ***
  com.spotify.docker.client.DockerException: 
java.util.concurrent.ExecutionException: 
com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
java.io.IOException: No such file or directory
  at 
com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)
  at 
com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)
  at 
com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
  at 
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
  at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
  at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
  at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
  ...
  Cause: java.util.concurrent.ExecutionException: 
com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
java.io.IOException: No such file or directory
  at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
  at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
  at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
  at 
com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)
  at 
com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
  at 
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
  at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
  ...
  Cause: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
java.io.IOException: No such file or directory
  at 
org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)
  at 
org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at 
jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)
  at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
  at 
jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50)
  at 
jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:37)
  at 
org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:487)
15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
  at org.glassfish.jersey.client.ClientRuntime$2.run(ClientRuntime.java:177)
  ...
  Cause: java.io.IOException: No such file or directory
  at jnr.unixsocket.UnixSocketChannel.doConnect(UnixSocketChannel.java:94)
  at jnr.unixsocket.UnixSocketChannel.connect(UnixSocketChannel.java:102)
  at 
com.spotify.docker.client.ApacheUnixSocket.connect(ApacheUnixSocket.java:73)
  at 
com.spotify.docker.client.UnixConnectionSocketFactory.c

[jira] [Commented] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule

2015-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064447#comment-15064447
 ] 

Apache Spark commented on SPARK-7142:
-

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10383

> Minor enhancement to BooleanSimplification Optimizer rule
> -
>
> Key: SPARK-7142
> URL: https://issues.apache.org/jira/browse/SPARK-7142
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yash Datta
>Assignee: Yash Datta
>Priority: Minor
> Fix For: 1.6.0
>
>
> Add simplification using these rules :
> A and (not(A) or B) => A and B
> not(A and B) => not(A) or not(B)
> not(A or B) => not(A) and not(B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12218) Invalid splitting of nested AND expressions in Data Source filter API

2015-12-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-12218:


Assignee: Yin Huai

> Invalid splitting of nested AND expressions in Data Source filter API
> -
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0, 2.0.0
>
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12218) Invalid splitting of nested AND expressions in Data Source filter API

2015-12-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12218.
--
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.3
   2.0.0

Issue resolved by pull request 10362
[https://github.com/apache/spark/pull/10362]

> Invalid splitting of nested AND expressions in Data Source filter API
> -
>
> Key: SPARK-12218
> URL: https://issues.apache.org/jira/browse/SPARK-12218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Irakli Machabeli
>Priority: Blocker
> Fix For: 2.0.0, 1.5.3, 1.6.0
>
>
> Two identical queries produce different results
> In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( 
> PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff'))").count()
> Out[2]: 18
> In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( 
> not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 
> 'PreviouslyChargedOff')))").count()
> Out[3]: 28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12372) Document limitations of MLlib local linear algebra

2015-12-18 Thread Christos Iraklis Tsatsoulis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064476#comment-15064476
 ] 

Christos Iraklis Tsatsoulis commented on SPARK-12372:
-

You are very welcome

> Document limitations of MLlib local linear algebra
> --
>
> Key: SPARK-12372
> URL: https://issues.apache.org/jira/browse/SPARK-12372
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Christos Iraklis Tsatsoulis
>
> This JIRA is now for documenting limitations of MLlib's local linear algebra 
> types.  Basically, we should make it clear in the user guide that they 
> provide simple functionality but are not a full-fledged local linear library. 
>  We should also recommend libraries for users to use in the meantime: 
> probably Breeze for Scala (and Java?) and numpy/scipy for Python.
> *Original JIRA title*: Unary operator "-" fails for MLlib vectors
> *Original JIRA text, as an example of the need for better docs*:
> Consider the following snippet in pyspark 1.5.2:
> {code:none}
> >>> from pyspark.mllib.linalg import Vectors
> >>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0])
> >>> x
> DenseVector([0.0, 1.0, 0.0, 7.0, 0.0])
> >>> -x
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: func() takes exactly 2 arguments (1 given)
> >>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0])
> >>> y
> DenseVector([2.0, 0.0, 3.0, 4.0, 5.0])
> >>> x-y
> DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0])
> >>> -y+x
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: func() takes exactly 2 arguments (1 given)
> >>> -1*x
> DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0])
> {code}
> Clearly, the unary operator {{-}} (minus) for vectors fails, giving errors 
> for expressions like {{-x}} and {{-y+x}}, despite the fact that {{x-y}} 
> behaves as expected.
> The last operation, {{-1*x}}, although mathematically "correct", includes 
> minus signs for the zero entries, which again is normally not expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12427) spark builds filling up jenkins' disk

2015-12-18 Thread shane knapp (JIRA)
shane knapp created SPARK-12427:
---

 Summary: spark builds filling up jenkins' disk
 Key: SPARK-12427
 URL: https://issues.apache.org/jira/browse/SPARK-12427
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: shane knapp
Priority: Critical


problem summary:

a few spark builds are filling up the jenkins master's disk with millions of 
little log files as build artifacts.  

currently, we have a raid10 array set up with 5.4T of storage.  we're currently 
using 4.0T, 99.9% of which is spark unit test and junit logs.

the worst offenders, with more than 100G of disk usage per job, are:
193G./Spark-1.6-Maven-with-YARN
194G./Spark-1.5-Maven-with-YARN
205G./Spark-1.6-Maven-pre-YARN
216G./Spark-1.5-Maven-pre-YARN
387G./Spark-Master-Maven-with-YARN
420G./Spark-Master-Maven-pre-YARN
520G./Spark-1.6-SBT
733G./Spark-1.5-SBT
812G./Spark-Master-SBT

i have attached a full report w/all builds listed as well.

each of these builds is keeping their build history for 90 days.

keep in mind that for each new matrix build, we're looking at another 200-500G 
per for the SBT/pre-YARN/with-YARN jobs.

a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk 
usage.

on the hardware config side, we can move from raid10 to raid 5 and get ~3T 
additional storage.  if we ditch raid altogether and put in bigger disks, we 
can get a total of 16-20T storage on master.  another option is to have a NFS 
mount to a deep storage server.  all of these options will require significant 
downtime.

quesitons:
* can we lower the number of days that we keep build information?
* there are other options in jenkins that we can set as well:  max number of 
builds to keep, max # days to keep artifacts, max # of builds to keep 
w/artifacts
* can we make the junit and unit test logs smaller (probably not)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12427) spark builds filling up jenkins' disk

2015-12-18 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-12427:

Attachment: jenkins_disk_usage.txt

> spark builds filling up jenkins' disk
> -
>
> Key: SPARK-12427
> URL: https://issues.apache.org/jira/browse/SPARK-12427
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Priority: Critical
>  Labels: build, jenkins
> Attachments: jenkins_disk_usage.txt
>
>
> problem summary:
> a few spark builds are filling up the jenkins master's disk with millions of 
> little log files as build artifacts.  
> currently, we have a raid10 array set up with 5.4T of storage.  we're 
> currently using 4.0T, 99.9% of which is spark unit test and junit logs.
> the worst offenders, with more than 100G of disk usage per job, are:
> 193G./Spark-1.6-Maven-with-YARN
> 194G./Spark-1.5-Maven-with-YARN
> 205G./Spark-1.6-Maven-pre-YARN
> 216G./Spark-1.5-Maven-pre-YARN
> 387G./Spark-Master-Maven-with-YARN
> 420G./Spark-Master-Maven-pre-YARN
> 520G./Spark-1.6-SBT
> 733G./Spark-1.5-SBT
> 812G./Spark-Master-SBT
> i have attached a full report w/all builds listed as well.
> each of these builds is keeping their build history for 90 days.
> keep in mind that for each new matrix build, we're looking at another 
> 200-500G per for the SBT/pre-YARN/with-YARN jobs.
> a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk 
> usage.
> on the hardware config side, we can move from raid10 to raid 5 and get ~3T 
> additional storage.  if we ditch raid altogether and put in bigger disks, we 
> can get a total of 16-20T storage on master.  another option is to have a NFS 
> mount to a deep storage server.  all of these options will require 
> significant downtime.
> quesitons:
> * can we lower the number of days that we keep build information?
> * there are other options in jenkins that we can set as well:  max number of 
> builds to keep, max # days to keep artifacts, max # of builds to keep 
> w/artifacts
> * can we make the junit and unit test logs smaller (probably not)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12427) spark builds filling up jenkins' disk

2015-12-18 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-12427:

Attachment: graph.png

disk usage over the past year, for lols.

> spark builds filling up jenkins' disk
> -
>
> Key: SPARK-12427
> URL: https://issues.apache.org/jira/browse/SPARK-12427
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Priority: Critical
>  Labels: build, jenkins
> Attachments: graph.png, jenkins_disk_usage.txt
>
>
> problem summary:
> a few spark builds are filling up the jenkins master's disk with millions of 
> little log files as build artifacts.  
> currently, we have a raid10 array set up with 5.4T of storage.  we're 
> currently using 4.0T, 99.9% of which is spark unit test and junit logs.
> the worst offenders, with more than 100G of disk usage per job, are:
> 193G./Spark-1.6-Maven-with-YARN
> 194G./Spark-1.5-Maven-with-YARN
> 205G./Spark-1.6-Maven-pre-YARN
> 216G./Spark-1.5-Maven-pre-YARN
> 387G./Spark-Master-Maven-with-YARN
> 420G./Spark-Master-Maven-pre-YARN
> 520G./Spark-1.6-SBT
> 733G./Spark-1.5-SBT
> 812G./Spark-Master-SBT
> i have attached a full report w/all builds listed as well.
> each of these builds is keeping their build history for 90 days.
> keep in mind that for each new matrix build, we're looking at another 
> 200-500G per for the SBT/pre-YARN/with-YARN jobs.
> a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk 
> usage.
> on the hardware config side, we can move from raid10 to raid 5 and get ~3T 
> additional storage.  if we ditch raid altogether and put in bigger disks, we 
> can get a total of 16-20T storage on master.  another option is to have a NFS 
> mount to a deep storage server.  all of these options will require 
> significant downtime.
> quesitons:
> * can we lower the number of days that we keep build information?
> * there are other options in jenkins that we can set as well:  max number of 
> builds to keep, max # days to keep artifacts, max # of builds to keep 
> w/artifacts
> * can we make the junit and unit test logs smaller (probably not)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12427) spark builds filling up jenkins' disk

2015-12-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064568#comment-15064568
 ] 

Sean Owen commented on SPARK-12427:
---

I doubt we really need build history for more than a week or two. Does reducing 
to 2 weeks help enough to keep out of trouble for a while?

If the next major release in 2.0, and it drops support for most old Hadoop 
variations, at least we have no more separate pre/post YARN builds.

> spark builds filling up jenkins' disk
> -
>
> Key: SPARK-12427
> URL: https://issues.apache.org/jira/browse/SPARK-12427
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Priority: Critical
>  Labels: build, jenkins
> Attachments: graph.png, jenkins_disk_usage.txt
>
>
> problem summary:
> a few spark builds are filling up the jenkins master's disk with millions of 
> little log files as build artifacts.  
> currently, we have a raid10 array set up with 5.4T of storage.  we're 
> currently using 4.0T, 99.9% of which is spark unit test and junit logs.
> the worst offenders, with more than 100G of disk usage per job, are:
> 193G./Spark-1.6-Maven-with-YARN
> 194G./Spark-1.5-Maven-with-YARN
> 205G./Spark-1.6-Maven-pre-YARN
> 216G./Spark-1.5-Maven-pre-YARN
> 387G./Spark-Master-Maven-with-YARN
> 420G./Spark-Master-Maven-pre-YARN
> 520G./Spark-1.6-SBT
> 733G./Spark-1.5-SBT
> 812G./Spark-Master-SBT
> i have attached a full report w/all builds listed as well.
> each of these builds is keeping their build history for 90 days.
> keep in mind that for each new matrix build, we're looking at another 
> 200-500G per for the SBT/pre-YARN/with-YARN jobs.
> a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk 
> usage.
> on the hardware config side, we can move from raid10 to raid 5 and get ~3T 
> additional storage.  if we ditch raid altogether and put in bigger disks, we 
> can get a total of 16-20T storage on master.  another option is to have a NFS 
> mount to a deep storage server.  all of these options will require 
> significant downtime.
> quesitons:
> * can we lower the number of days that we keep build information?
> * there are other options in jenkins that we can set as well:  max number of 
> builds to keep, max # days to keep artifacts, max # of builds to keep 
> w/artifacts
> * can we make the junit and unit test logs smaller (probably not)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12427) spark builds filling up jenkins' disk

2015-12-18 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064572#comment-15064572
 ] 

shane knapp commented on SPARK-12427:
-

[~joshrosen]

> spark builds filling up jenkins' disk
> -
>
> Key: SPARK-12427
> URL: https://issues.apache.org/jira/browse/SPARK-12427
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Priority: Critical
>  Labels: build, jenkins
> Attachments: graph.png, jenkins_disk_usage.txt
>
>
> problem summary:
> a few spark builds are filling up the jenkins master's disk with millions of 
> little log files as build artifacts.  
> currently, we have a raid10 array set up with 5.4T of storage.  we're 
> currently using 4.0T, 99.9% of which is spark unit test and junit logs.
> the worst offenders, with more than 100G of disk usage per job, are:
> 193G./Spark-1.6-Maven-with-YARN
> 194G./Spark-1.5-Maven-with-YARN
> 205G./Spark-1.6-Maven-pre-YARN
> 216G./Spark-1.5-Maven-pre-YARN
> 387G./Spark-Master-Maven-with-YARN
> 420G./Spark-Master-Maven-pre-YARN
> 520G./Spark-1.6-SBT
> 733G./Spark-1.5-SBT
> 812G./Spark-Master-SBT
> i have attached a full report w/all builds listed as well.
> each of these builds is keeping their build history for 90 days.
> keep in mind that for each new matrix build, we're looking at another 
> 200-500G per for the SBT/pre-YARN/with-YARN jobs.
> a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk 
> usage.
> on the hardware config side, we can move from raid10 to raid 5 and get ~3T 
> additional storage.  if we ditch raid altogether and put in bigger disks, we 
> can get a total of 16-20T storage on master.  another option is to have a NFS 
> mount to a deep storage server.  all of these options will require 
> significant downtime.
> quesitons:
> * can we lower the number of days that we keep build information?
> * there are other options in jenkins that we can set as well:  max number of 
> builds to keep, max # days to keep artifacts, max # of builds to keep 
> w/artifacts
> * can we make the junit and unit test logs smaller (probably not)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12428) Write a script to run all PySpark MLlib examples for testing

2015-12-18 Thread holdenk (JIRA)
holdenk created SPARK-12428:
---

 Summary: Write a script to run all PySpark MLlib examples for 
testing
 Key: SPARK-12428
 URL: https://issues.apache.org/jira/browse/SPARK-12428
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, Tests
Reporter: holdenk


See parent for design sketch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12428) Write a script to run all PySpark MLlib examples for testing

2015-12-18 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064610#comment-15064610
 ] 

holdenk commented on SPARK-12428:
-

I can start working on this a bit over the holidays :)

> Write a script to run all PySpark MLlib examples for testing
> 
>
> Key: SPARK-12428
> URL: https://issues.apache.org/jira/browse/SPARK-12428
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Reporter: holdenk
>
> See parent for design sketch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12427) spark builds filling up jenkins' disk

2015-12-18 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064628#comment-15064628
 ] 

shane knapp commented on SPARK-12427:
-

if we NEED to store for longer than 2 weeks, we can absolutely rejigger storage.

> spark builds filling up jenkins' disk
> -
>
> Key: SPARK-12427
> URL: https://issues.apache.org/jira/browse/SPARK-12427
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: shane knapp
>Priority: Critical
>  Labels: build, jenkins
> Attachments: graph.png, jenkins_disk_usage.txt
>
>
> problem summary:
> a few spark builds are filling up the jenkins master's disk with millions of 
> little log files as build artifacts.  
> currently, we have a raid10 array set up with 5.4T of storage.  we're 
> currently using 4.0T, 99.9% of which is spark unit test and junit logs.
> the worst offenders, with more than 100G of disk usage per job, are:
> 193G./Spark-1.6-Maven-with-YARN
> 194G./Spark-1.5-Maven-with-YARN
> 205G./Spark-1.6-Maven-pre-YARN
> 216G./Spark-1.5-Maven-pre-YARN
> 387G./Spark-Master-Maven-with-YARN
> 420G./Spark-Master-Maven-pre-YARN
> 520G./Spark-1.6-SBT
> 733G./Spark-1.5-SBT
> 812G./Spark-Master-SBT
> i have attached a full report w/all builds listed as well.
> each of these builds is keeping their build history for 90 days.
> keep in mind that for each new matrix build, we're looking at another 
> 200-500G per for the SBT/pre-YARN/with-YARN jobs.
> a straw man, back of napkin estimate for spark 1.7 is 2T of additional disk 
> usage.
> on the hardware config side, we can move from raid10 to raid 5 and get ~3T 
> additional storage.  if we ditch raid altogether and put in bigger disks, we 
> can get a total of 16-20T storage on master.  another option is to have a NFS 
> mount to a deep storage server.  all of these options will require 
> significant downtime.
> quesitons:
> * can we lower the number of days that we keep build information?
> * there are other options in jenkins that we can set as well:  max number of 
> builds to keep, max # days to keep artifacts, max # of builds to keep 
> w/artifacts
> * can we make the junit and unit test logs smaller (probably not)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming

2015-12-18 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-12429:


 Summary: Update documentation to show how to use accumulators and 
broadcasts with Spark Streaming
 Key: SPARK-12429
 URL: https://issues.apache.org/jira/browse/SPARK-12429
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


 Accumulators and Broadcasts with Spark Streaming cannot work perfectly when 
restarting on driver failures. We need to add some example to guide the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-12-18 Thread Matt Pollock (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064664#comment-15064664
 ] 

Matt Pollock commented on SPARK-6817:
-

Will this only support UDFs that operate on a full DataFrame? A solution to 
operate on columns would perhaps be more useful. E.g., being able to use R 
package functions within filter and mutate

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12429:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Update documentation to show how to use accumulators and broadcasts with 
> Spark Streaming
> 
>
> Key: SPARK-12429
> URL: https://issues.apache.org/jira/browse/SPARK-12429
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
>  Accumulators and Broadcasts with Spark Streaming cannot work perfectly when 
> restarting on driver failures. We need to add some example to guide the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12429:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Update documentation to show how to use accumulators and broadcasts with 
> Spark Streaming
> 
>
> Key: SPARK-12429
> URL: https://issues.apache.org/jira/browse/SPARK-12429
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
>  Accumulators and Broadcasts with Spark Streaming cannot work perfectly when 
> restarting on driver failures. We need to add some example to guide the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.

2015-12-18 Thread Fede Bar (JIRA)
Fede Bar created SPARK-12430:


 Summary: Temporary folders do not get deleted after Task completes 
causing problems with disk space.
 Key: SPARK-12430
 URL: https://issues.apache.org/jira/browse/SPARK-12430
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Shuffle, Spark Submit
Affects Versions: 1.5.2, 1.5.1
 Environment: Ubuntu server
Reporter: Fede Bar
 Fix For: 1.4.1


We are experiencing an issue with automatic /tmp folder deletion after 
framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as 
Spark 1.5.1) over Mesos will not delete some temporary folders causing free 
disk space on server to exhaust. 

Behavior of M/R job using Spark 1.4.1 over Mesos cluster:
- Launched using spark-submit on one cluster node.
- Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/*  ,  
*/tmp/spark-#/blockmgr-#*
- When task is completed */tmp/spark-#/* gets deleted along with 
*/tmp/spark-#/blockmgr-#* sub-folder.

Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job):
- Launched using spark-submit on one cluster node.
- Following folders are created: */tmp/mesos/mesos/slaves/id** * , 
*/tmp/spark-***/ *  ,{color:red} /tmp/blockmgr-***{color}
- When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle 
container folder {color:red} /tmp/blockmgr-***{color}

Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several GB 
depending on the job that ran. Over time this causes disk space to become full 
with consequences that we all know. 

Running a shell script would probably work but it is difficult to identify 
folders in use by a running M/R or stale folders. I did notice similar issues 
opened by other users marked as "resolved", but none seems to exactly match the 
above behavior. 

I really hope someone has insights on how to fix it.
Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12409) JDBC AND operator push down

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12409:


Assignee: Apache Spark

> JDBC AND operator push down 
> 
>
> Key: SPARK-12409
> URL: https://issues.apache.org/jira/browse/SPARK-12409
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> For simple AND such as 
> select * from test where THEID = 1 AND NAME = 'fred', 
> The filters pushed down to JDBC layers are EqualTo(THEID,1), 
> EqualTo(Name,fred). These are handled OK by the current code. 
> For query such as 
> SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" ,
> the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2)))
> So need to add And filter in JDBC layer.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12409) JDBC AND operator push down

2015-12-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064742#comment-15064742
 ] 

Apache Spark commented on SPARK-12409:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/10386

> JDBC AND operator push down 
> 
>
> Key: SPARK-12409
> URL: https://issues.apache.org/jira/browse/SPARK-12409
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Priority: Minor
>
> For simple AND such as 
> select * from test where THEID = 1 AND NAME = 'fred', 
> The filters pushed down to JDBC layers are EqualTo(THEID,1), 
> EqualTo(Name,fred). These are handled OK by the current code. 
> For query such as 
> SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" ,
> the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2)))
> So need to add And filter in JDBC layer.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12425) DStream union optimisation

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12425:


Assignee: (was: Apache Spark)

> DStream union optimisation
> --
>
> Key: SPARK-12425
> URL: https://issues.apache.org/jira/browse/SPARK-12425
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Guillaume Poulin
>Priority: Minor
>
> Currently, `DStream.union` always use `UnionRDD` on the underlying `RDD`. 
> However using `PartitionerAwareUnionRDD` when possible would yield to better 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2015-12-18 Thread Jo Voordeckers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064744#comment-15064744
 ] 

Jo Voordeckers commented on SPARK-11327:


This PR is now superseded by this one against master:

https://github.com/apache/spark/pull/10370

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Alan Braithwaite
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image...
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar";,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar";,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "QUEUED",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver 
> driver-20151026220353-0011 is now QUEUED.
> 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> {code}
> 

[jira] [Assigned] (SPARK-12409) JDBC AND operator push down

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12409:


Assignee: (was: Apache Spark)

> JDBC AND operator push down 
> 
>
> Key: SPARK-12409
> URL: https://issues.apache.org/jira/browse/SPARK-12409
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Huaxin Gao
>Priority: Minor
>
> For simple AND such as 
> select * from test where THEID = 1 AND NAME = 'fred', 
> The filters pushed down to JDBC layers are EqualTo(THEID,1), 
> EqualTo(Name,fred). These are handled OK by the current code. 
> For query such as 
> SELECT * FROM foobar WHERE THEID = 1 OR NAME = 'mary' AND THEID = 2" ,
> the filter is Or(EqualTo(THEID,1),And(EqualTo(NAME,mary),EqualTo(THEID,2)))
> So need to add And filter in JDBC layer.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12425) DStream union optimisation

2015-12-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12425:


Assignee: Apache Spark

> DStream union optimisation
> --
>
> Key: SPARK-12425
> URL: https://issues.apache.org/jira/browse/SPARK-12425
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Guillaume Poulin
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, `DStream.union` always use `UnionRDD` on the underlying `RDD`. 
> However using `PartitionerAwareUnionRDD` when possible would yield to better 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12365) Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called

2015-12-18 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12365:
--
Target Version/s: 2.0.0  (was: 1.6.1, 2.0.0)

> Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called
> 
>
> Key: SPARK-12365
> URL: https://issues.apache.org/jira/browse/SPARK-12365
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 2.0.0
>
>
> SPARK-9886 fixed call to Runtime.getRuntime.addShutdownHook() in 
> ExternalBlockStore.scala
> This issue intends to address remaining usage of 
> Runtime.getRuntime.addShutdownHook()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12365) Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called

2015-12-18 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12365:
--
Fix Version/s: (was: 1.6.1)

> Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called
> 
>
> Key: SPARK-12365
> URL: https://issues.apache.org/jira/browse/SPARK-12365
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 2.0.0
>
>
> SPARK-9886 fixed call to Runtime.getRuntime.addShutdownHook() in 
> ExternalBlockStore.scala
> This issue intends to address remaining usage of 
> Runtime.getRuntime.addShutdownHook()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-12-18 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203280#comment-14203280
 ] 

Nicholas Chammas edited comment on SPARK-3821 at 12/18/15 9:08 PM:
---

After much dilly-dallying, I am happy to present:
* A brief proposal / design doc ([fixed JIRA attachment | 
https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html],
 [md file on GitHub | 
https://github.com/nchammas/spark-ec2/blob/packer/image-build/proposal.md])
* [Initial implementation | 
https://github.com/nchammas/spark-ec2/tree/packer/image-build] and [README | 
https://github.com/nchammas/spark-ec2/blob/packer/image-build/README.md]
* New AMIs generated by this implementation: [Base AMIs | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 
Pre-Installed | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0]

To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47]
 [two | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593]
 lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on 
the {{packer}} branch | 
https://github.com/nchammas/spark-ec2/tree/packer/image-build].

Your candid feedback and/or improvements are most welcome!


was (Author: nchammas):
After much dilly-dallying, I am happy to present:
* A brief proposal / design doc ([fixed JIRA attachment | 
https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html],
 [md file on GitHub | 
https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md])
* [Initial implementation | 
https://github.com/nchammas/spark-ec2/tree/packer/packer] and [README | 
https://github.com/nchammas/spark-ec2/blob/packer/packer/README.md]
* New AMIs generated by this implementation: [Base AMIs | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 
Pre-Installed | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0]

To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47]
 [two | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593]
 lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on 
the {{packer}} branch | 
https://github.com/nchammas/spark-ec2/tree/packer/packer].

Your candid feedback and/or improvements are most welcome!

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12431) add local checkpointing to GraphX

2015-12-18 Thread Edward Seidl (JIRA)
Edward Seidl created SPARK-12431:


 Summary: add local checkpointing to GraphX
 Key: SPARK-12431
 URL: https://issues.apache.org/jira/browse/SPARK-12431
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.5.2
Reporter: Edward Seidl


local checkpointing was added to RDD to speed up iterative spark jobs, but this 
capability hasn't been added to GraphX.  Adding localCheckpoint to GraphImpl, 
EdgeRDDImpl, and VertexRDDImpl greatly improved the speed of a k-core algorithm 
I'm using (at the cost of fault tolerance, of course).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12404) Ensure objects passed to StaticInvoke is Serializable

2015-12-18 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12404.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 10357
[https://github.com/apache/spark/pull/10357]

> Ensure objects passed to StaticInvoke is Serializable
> -
>
> Key: SPARK-12404
> URL: https://issues.apache.org/jira/browse/SPARK-12404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Critical
> Fix For: 1.6.0
>
>
> Now `StaticInvoke` receives Any as a object and StaticInvoke can be 
> serialized but sometimes the object passed is not serializable.
> For example, following code raises Exception because RowEncoder#extractorsFor 
> invoked indirectly makes `StaticInvoke`.
> {code}
> case class TimestampContainer(timestamp: java.sql.Timestamp)
> val rdd = sc.parallelize(1 to 2).map(_ => 
> TimestampContainer(System.currentTimeMillis))
> val df = rdd.toDF
> val ds = df.as[TimestampContainer]
> val rdd2 = ds.rdd <- invokes 
> extractorsFor indirectory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >