date:20150810

Hyukjin Kwon created SPARK-9814:
---

 Summary: EqualNotNull not passing to data sources
 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
 Environment: Centos 6.6
Reporter: Hyukjin Kwon
Priority: Minor


When data sources (such as Parquet) tries to filter data when reading from HDFS 
(not in memory), Physical planing phase passes the filter objects in 
`org.apache.spark.sql.sources`, which are appropriately built and picked up by 
`selectFilters()` in  `org.apache.spark.sql.sources.DataSourceStrategy`. 

On the other hand, it does not pass `EqualNullSafe` filter in 
`org.apache.spark.sql.catalyst.expressions` even though this seems possible to 
pass for other datasources such as Parquet and JSON. In more detail, it does 
not pass to (below) `buildScan` in `PrunedFilteredScan` and `PrunedScan`,

```
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
```
even though the binary capability issue is 
solved.(https://issues.apache.org/jira/browse/SPARK-8747).

I understand that  `CatalystScan` can take the all raw expressions accessing to 
the query planner. However, it is experimental and also it needs different 
interfaces (as well as unstable for the reasons such as binary capability). 


In general, the problem below can happen.

1. 
```
SELECT * 
FROM table
WHERE field = 1;
```

2. 
```
SELECT * 
FROM table
WHERE field = 1;
```

The second query can be hugely slow although the functionally is almost 
identical because of the possible large network traffic (etc.) by not filtered 
data from the source RDD.  






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-9340.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8070
[https://github.com/apache/spark/pull/8070]

 CatalystSchemaConverter and CatalystRowConverter don't handle unannotated 
 repeated fields correctly
 ---

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
Assignee: Cheng Lian
 Fix For: 1.5.0

 Attachments: ParquetTypesConverterTest.scala


 SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement 
 backwards-compatibility rules defined in {{parquet-format}} spec. However, 
 both Spark SQL and {{parquet-avro}} neglected the following statement in 
 {{parquet-format}}:
 {quote}
 This does not affect repeated fields that are not annotated: A repeated field 
 that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor 
 annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of 
 required elements where the element type is the type of the field.
 {quote}
 One of the consequences is that, Parquet files generated by 
 {{parquet-protobuf}} containing unannotated repeated fields are not correctly 
 converted to Catalyst arrays.
 For example, the following Parquet schema
 {noformat}
 message root {
   repeated int32 f1
 }
 {noformat}
  should be converted to
 {noformat}
 StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), 
 nullable = false) :: Nil)
 {noformat}
 But now it triggers an {{AnalysisException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources


 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9814:

Description: 
When data sources (such as Parquet) tries to filter data when reading from HDFS 
(not in memory), Physical planing phase passes the filter objects in 
org.apache.spark.sql.sources, which are appropriately built and picked up by 
selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy.

On the other hand, it does not pass EqualNullSafe filter in 
org.apache.spark.sql.catalyst.expressions even though this seems possible to 
pass for other datasources such as Parquet and JSON. In more detail, it does 
not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, even 
though the binary capability issue is 
solved.(https://issues.apache.org/jira/browse/SPARK-8747).

I understand that CatalystScan can take the all raw expressions accessing to 
the query planner. However, it is experimental and also it needs different 
interfaces (as well as unstable for the reasons such as binary capability).

In general, the problem below can happen.

1. SELECT * FROM table WHERE field = 1;

2. SELECT * FROM table WHERE field = 1;

The second query can be hugely slow although the functionally is almost 
identical because of the possible large network traffic (etc.) by not filtered 
data from the source RDD.


  was:
When data sources (such as Parquet) tries to filter data when reading from HDFS 
(not in memory), Physical planing phase passes the filter objects in 
`org.apache.spark.sql.sources`, which are appropriately built and picked up by 
`selectFilters()` in  `org.apache.spark.sql.sources.DataSourceStrategy`. 

On the other hand, it does not pass `EqualNullSafe` filter in 
`org.apache.spark.sql.catalyst.expressions` even though this seems possible to 
pass for other datasources such as Parquet and JSON. In more detail, it does 
not pass to (below) `buildScan` in `PrunedFilteredScan` and `PrunedScan`,

```
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
```
even though the binary capability issue is 
solved.(https://issues.apache.org/jira/browse/SPARK-8747).

I understand that  `CatalystScan` can take the all raw expressions accessing to 
the query planner. However, it is experimental and also it needs different 
interfaces (as well as unstable for the reasons such as binary capability). 


In general, the problem below can happen.

1. 
```
SELECT * 
FROM table
WHERE field = 1;
```

2. 
```
SELECT * 
FROM table
WHERE field = 1;
```

The second query can be hugely slow although the functionally is almost 
identical because of the possible large network traffic (etc.) by not filtered 
data from the source RDD.  





 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
 Environment: Centos 6.6
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 org.apache.spark.sql.sources, which are appropriately built and picked up by 
 selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy.
 On the other hand, it does not pass EqualNullSafe filter in 
 org.apache.spark.sql.catalyst.expressions even though this seems possible to 
 pass for other datasources such as Parquet and JSON. In more detail, it does 
 not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, 
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that CatalystScan can take the all raw expressions accessing to 
 the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1. SELECT * FROM table WHERE field = 1;
 2. SELECT * FROM table WHERE field = 1;
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9813) Incorrect UNION ALL behavior

2015-08-10 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681246#comment-14681246
 ] 

Herman van Hovell commented on SPARK-9813:
--

So I am not to sure if we want to maintain that level of Hive compatibility. It 
seems a bit too strict. Any kind of union should be fine as long as the data 
types match (IMHO).

Is there a realistic use case for this?

 Incorrect UNION ALL behavior
 

 Key: SPARK-9813
 URL: https://issues.apache.org/jira/browse/SPARK-9813
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.4.1
 Environment: Ubuntu on AWS
Reporter: Simeon Simeonov
  Labels: sql, union

 According to the [Hive Language 
 Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] 
 for UNION ALL:
 {quote}
 The number and names of columns returned by each select_statement have to be 
 the same. Otherwise, a schema error is thrown.
 {quote}
 Spark SQL silently swallows an error when the tables being joined with UNION 
 ALL have the same number of columns but different names.
 Reproducible example:
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note category vs. cat names of first column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {cat : A, num : 5})
 //  ++---+
 //  |category|num|
 //  ++---+
 //  |   A|  5|
 //  |   A|  5|
 //  ++---+
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 When the number of columns is different, Spark can even mix in datatypes. 
 Reproducible example (requires a new spark-shell session):
 {code}
 // This test is meant to run in spark-shell
 import java.io.File
 import java.io.PrintWriter
 import org.apache.spark.sql.hive.HiveContext
 import org.apache.spark.sql.SaveMode
 val ctx = sqlContext.asInstanceOf[HiveContext]
 import ctx.implicits._
 def dataPath(name:String) = sys.env(HOME) + / + name + .jsonlines
 def tempTable(name: String, json: String) = {
   val path = dataPath(name)
   new PrintWriter(path) { write(json); close }
   ctx.read.json(file:// + path).registerTempTable(name)
 }
 // Note test_another is missing category column
 tempTable(test_one, {category : A, num : 5})
 tempTable(test_another, {num : 5})
 //  ++
 //  |category|
 //  ++
 //  |   A|
 //  |   5| 
 //  ++
 //
 //  Instead, an error should have been generated due to incompatible schema
 ctx.sql(select * from test_one union all select * from test_another).show
 // Cleanup
 new File(dataPath(test_one)).delete()
 new File(dataPath(test_another)).delete()
 {code}
 At other times, when the schema are complex, Spark SQL produces a misleading 
 error about an unresolved Union operator:
 {code}
 scala ctx.sql(select * from view_clicks
  | union all
  | select * from view_clicks_aug
  | )
 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
 union all
 select * from view_clicks_aug
 15/08/11 02:40:25 INFO ParseDriver: Parse Completed
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks_aug
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks_aug
 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default 
 tbl=view_clicks_aug
 15/08/11 02:40:25 INFO audit: ugi=ubuntu  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=view_clicks_aug
 org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
   at

[jira] [Assigned] (SPARK-9815) Rename PlatformDependent.UNSAFE - Platform


 [ 
https://issues.apache.org/jira/browse/SPARK-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9815:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Rename PlatformDependent.UNSAFE - Platform
 ---

 Key: SPARK-9815
 URL: https://issues.apache.org/jira/browse/SPARK-9815
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 PlatformDependent.UNSAFE is way too verbose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9815) Rename PlatformDependent.UNSAFE - Platform

2015-08-10 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-9815:
--

 Summary: Rename PlatformDependent.UNSAFE - Platform
 Key: SPARK-9815
 URL: https://issues.apache.org/jira/browse/SPARK-9815
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


PlatformDependent.UNSAFE is way too verbose.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9815) Rename PlatformDependent.UNSAFE - Platform


[ 
https://issues.apache.org/jira/browse/SPARK-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681250#comment-14681250
 ] 

Apache Spark commented on SPARK-9815:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8094

 Rename PlatformDependent.UNSAFE - Platform
 ---

 Key: SPARK-9815
 URL: https://issues.apache.org/jira/browse/SPARK-9815
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 PlatformDependent.UNSAFE is way too verbose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9815) Rename PlatformDependent.UNSAFE - Platform


 [ 
https://issues.apache.org/jira/browse/SPARK-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9815:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Rename PlatformDependent.UNSAFE - Platform
 ---

 Key: SPARK-9815
 URL: https://issues.apache.org/jira/browse/SPARK-9815
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 PlatformDependent.UNSAFE is way too verbose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources


 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9814:

Description: 
When data sources (such as Parquet) tries to filter data when reading from HDFS 
(not in memory), Physical planing phase passes the filter objects in 
{{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.

On the other hand, it does not pass {{EqualNullSafe}} filter in 
{{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
to pass for other datasources such as Parquet and JSON. In more detail, it does 
not pass {{EqualNullSafe}} to (below) {{buildScan()}} in {{PrunedFilteredScan}} 
and {{PrunedScan}}, 

{code}
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
{code}

even though the binary capability issue is 
solved.(https://issues.apache.org/jira/browse/SPARK-8747).

I understand that {{CatalystScan}} can take the all raw expressions accessing 
to the query planner. However, it is experimental and also it needs different 
interfaces (as well as unstable for the reasons such as binary capability).


In general, the problem below can happen.

1.
{code:sql}
SELECT * FROM table WHERE field = 1;
{code}
 
2. 
{code:sql}
SELECT * FROM table WHERE field = 1;
{code}

The second query can be hugely slow although the functionally is almost 
identical because of the possible large network traffic (etc.) by not filtered 
data from the source RDD.


  was:
When data sources (such as Parquet) tries to filter data when reading from HDFS 
(not in memory), Physical planing phase passes the filter objects in 
org.apache.spark.sql.sources, which are appropriately built and picked up by 
selectFilters() in org.apache.spark.sql.sources.DataSourceStrategy.

On the other hand, it does not pass EqualNullSafe filter in 
org.apache.spark.sql.catalyst.expressions even though this seems possible to 
pass for other datasources such as Parquet and JSON. In more detail, it does 
not pass EqualNullSafe to buildScan in PrunedFilteredScan and PrunedScan, even 
though the binary capability issue is 
solved.(https://issues.apache.org/jira/browse/SPARK-8747).

I understand that CatalystScan can take the all raw expressions accessing to 
the query planner. However, it is experimental and also it needs different 
interfaces (as well as unstable for the reasons such as binary capability).

In general, the problem below can happen.

1. SELECT * FROM table WHERE field = 1;

2. SELECT * FROM table WHERE field = 1;

The second query can be hugely slow although the functionally is almost 
identical because of the possible large network traffic (etc.) by not filtered 
data from the source RDD.



 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
 Environment: Centos 6.6
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.


[ 
https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681249#comment-14681249
 ] 

Apache Spark commented on SPARK-9790:
-

User 'markgrover' has created a pull request for this issue:
https://github.com/apache/spark/pull/8093

 [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
 --

 Key: SPARK-9790
 URL: https://issues.apache.org/jira/browse/SPARK-9790
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.1
Reporter: Mark Grover

 When an executor is killed by yarn because it exceeds the memory overhead, 
 the only thing spark knows is that the executor is lost. The user has to go 
 track search through the NM logs to figure out that its been killed by yarn.
 It would be much nicer if the spark-driver could be notified why the executor 
 was killed. Ideally it could both log an explanatory message, and update the 
 UI (and the eventLog) so that it was clear why the executor was lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.


 [ 
https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9790:
---

Assignee: (was: Apache Spark)

 [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
 --

 Key: SPARK-9790
 URL: https://issues.apache.org/jira/browse/SPARK-9790
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.1
Reporter: Mark Grover

 When an executor is killed by yarn because it exceeds the memory overhead, 
 the only thing spark knows is that the executor is lost. The user has to go 
 track search through the NM logs to figure out that its been killed by yarn.
 It would be much nicer if the spark-driver could be notified why the executor 
 was killed. Ideally it could both log an explanatory message, and update the 
 UI (and the eventLog) so that it was clear why the executor was lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.


 [ 
https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9790:
---

Assignee: Apache Spark

 [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
 --

 Key: SPARK-9790
 URL: https://issues.apache.org/jira/browse/SPARK-9790
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.1
Reporter: Mark Grover
Assignee: Apache Spark

 When an executor is killed by yarn because it exceeds the memory overhead, 
 the only thing spark knows is that the executor is lost. The user has to go 
 track search through the NM logs to figure out that its been killed by yarn.
 It would be much nicer if the spark-driver could be notified why the executor 
 was killed. Ideally it could both log an explanatory message, and update the 
 UI (and the eventLog) so that it was clear why the executor was lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9363) SortMergeJoin operator should support UnsafeRow

2015-08-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9363.

   Resolution: Fixed
Fix Version/s: 1.5.0

 SortMergeJoin operator should support UnsafeRow
 ---

 Key: SPARK-9363
 URL: https://issues.apache.org/jira/browse/SPARK-9363
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.5.0


 The SortMergeJoin operator should implement the suppotsUnsafeRow and 
 outputsUnsafeRow settings when appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9729) Sort Merge Join for Left and Right Outer Join

2015-08-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9729.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Sort Merge Join for Left and Right Outer Join
 -

 Key: SPARK-9729
 URL: https://issues.apache.org/jira/browse/SPARK-9729
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.


 [ 
https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-9790:
---
Attachment: error_showing_in_UI.png

Attaching an image of what the error message in the UI would now look like.

 [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
 --

 Key: SPARK-9790
 URL: https://issues.apache.org/jira/browse/SPARK-9790
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.1
Reporter: Mark Grover
 Attachments: error_showing_in_UI.png


 When an executor is killed by yarn because it exceeds the memory overhead, 
 the only thing spark knows is that the executor is lost. The user has to go 
 track search through the NM logs to figure out that its been killed by yarn.
 It would be much nicer if the spark-driver could be notified why the executor 
 was killed. Ideally it could both log an explanatory message, and update the 
 UI (and the eventLog) so that it was clear why the executor was lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.


[ 
https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681263#comment-14681263
 ] 

Mark Grover edited comment on SPARK-9790 at 8/11/15 5:17 AM:
-

Attaching an 
[image|https://issues.apache.org/jira/secure/attachment/12749771/error_showing_in_UI.png]
 of what the error message in the UI would now look like.


was (Author: mgrover):
Attaching an image of what the error message in the UI would now look like.

 [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
 --

 Key: SPARK-9790
 URL: https://issues.apache.org/jira/browse/SPARK-9790
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.1
Reporter: Mark Grover
 Attachments: error_showing_in_UI.png


 When an executor is killed by yarn because it exceeds the memory overhead, 
 the only thing spark knows is that the executor is lost. The user has to go 
 track search through the NM logs to figure out that its been killed by yarn.
 It would be much nicer if the spark-driver could be notified why the executor 
 was killed. Ideally it could both log an explanatory message, and update the 
 UI (and the eventLog) so that it was clear why the executor was lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources


 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9814:

Environment: (was: Centos 6.6)

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9814) EqualNotNull not passing to data sources


 [ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9814:

Component/s: (was: Input/Output)
 SQL

 EqualNotNull not passing to data sources
 

 Key: SPARK-9814
 URL: https://issues.apache.org/jira/browse/SPARK-9814
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Minor

 When data sources (such as Parquet) tries to filter data when reading from 
 HDFS (not in memory), Physical planing phase passes the filter objects in 
 {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
 by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
 On the other hand, it does not pass {{EqualNullSafe}} filter in 
 {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
 to pass for other datasources such as Parquet and JSON. In more detail, it 
 does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
 {{PrunedFilteredScan}} and {{PrunedScan}}, 
 {code}
 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]
 {code}
 even though the binary capability issue is 
 solved.(https://issues.apache.org/jira/browse/SPARK-8747).
 I understand that {{CatalystScan}} can take the all raw expressions accessing 
 to the query planner. However, it is experimental and also it needs different 
 interfaces (as well as unstable for the reasons such as binary capability).
 In general, the problem below can happen.
 1.
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
  
 2. 
 {code:sql}
 SELECT * FROM table WHERE field = 1;
 {code}
 The second query can be hugely slow although the functionally is almost 
 identical because of the possible large network traffic (etc.) by not 
 filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7726) Maven Install Breaks When Upgrading Scala 2.11.2--[2.11.3 or higher]


[ 
https://issues.apache.org/jira/browse/SPARK-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681277#comment-14681277
 ] 

Apache Spark commented on SPARK-7726:
-

User 'pwendell' has created a pull request for this issue:
https://github.com/apache/spark/pull/8095

 Maven Install Breaks When Upgrading Scala 2.11.2--[2.11.3 or higher]
 -

 Key: SPARK-7726
 URL: https://issues.apache.org/jira/browse/SPARK-7726
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Iulian Dragos
Priority: Blocker
 Fix For: 1.4.0


 This one took a long time to track down. The Maven install phase is part of 
 our release process. It runs the scala:doc target to generate doc jars. 
 Between Scala 2.11.2 and Scala 2.11.3, the behavior of this plugin changed in 
 a way that breaks our build. In both cases, it returned an error (there has 
 been a long running error here that we've always ignored), however in 2.11.3 
 that error became fatal and failed the entire build process. The upgrade 
 occurred in SPARK-7092. Here is a simple reproduction:
 {code}
 ./dev/change-version-to-2.11.sh
 mvn clean install -pl network/common -pl network/shuffle -DskipTests 
 -Dscala-2.11
 {code} 
 This command exits success when Spark is at Scala 2.11.2 and fails with 
 2.11.3 or higher. In either case an error is printed:
 {code}
 [INFO] 
 [INFO] --- scala-maven-plugin:3.2.0:doc-jar (attach-scaladocs) @ 
 spark-network-shuffle_2.11 ---
 /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56:
  error: not found: type Type
   protected Type type() { return Type.UPLOAD_BLOCK; }
 ^
 /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:37:
  error: not found: type Type
   protected Type type() { return Type.STREAM_HANDLE; }
 ^
 /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44:
  error: not found: type Type
   protected Type type() { return Type.REGISTER_EXECUTOR; }
 ^
 /Users/pwendell/Documents/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40:
  error: not found: type Type
   protected Type type() { return Type.OPEN_BLOCKS; }
 ^
 model contains 22 documentable templates
 four errors found
 {code}
 Ideally we'd just dig in and fix this error. Unfortunately it's a very 
 confusing error and I have no idea why it is appearing. I'd propose reverting 
 SPARK-7092 in the mean time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9743) Scanning a HadoopFsRelation shouldn't requrire refreshing

2015-08-10 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9743.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8035
[https://github.com/apache/spark/pull/8035]

 Scanning a HadoopFsRelation shouldn't requrire refreshing
 -

 Key: SPARK-9743
 URL: https://issues.apache.org/jira/browse/SPARK-9743
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.5.0


 PR #7969 added {{HadoopFsRelation.refresh()}} calls in {{DataSourceStrategy}} 
 to make test case {{InsertSuite.save directly to the path of a JSON table}} 
 pass. However, this forces every {{HadoopFsRelation}} table scan to do a 
 refreshing, which can be super expensive for tables with large number of 
 partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9714) Cannot insert into a table using pySpark

2015-08-10 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9714:

Sprint: Spark 1.5 doc/QA sprint

 Cannot insert into a table using pySpark
 

 Key: SPARK-9714
 URL: https://issues.apache.org/jira/browse/SPARK-9714
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yun Park
Assignee: Yin Huai
Priority: Blocker

 This is a bug on the master branch. After creating the table (yun is the 
 table name) with the corresponding fields, I ran the following command.
 {code}
 from pyspark.sql import *
 sc.parallelize([Row(id=1, name=test, 
 description=)]).toDF().write.mode(append).saveAsTable(yun)
 {code}
 I get the following error:
 {code}
 Py4JJavaError: An error occurred while calling o100.saveAsTable.
 : org.apache.spark.SparkException: Task not serializable
 Caused by: java.io.NotSerializableException: org.apache.hadoop.fs.Path
 Serialization stack:
   - object not serializable (class: org.apache.hadoop.fs.Path, value: 
 /user/hive/warehouse/yun)
   - field (class: org.apache.hadoop.hive.ql.metadata.Table, name: path, 
 type: class org.apache.hadoop.fs.Path)
   - object (class org.apache.hadoop.hive.ql.metadata.Table, yun)
   - field (class: org.apache.hadoop.hive.ql.metadata.Partition, name: 
 table, type: class org.apache.hadoop.hive.ql.metadata.Table)
   - object (class org.apache.hadoop.hive.ql.metadata.Partition, yun())
   - field (class: scala.collection.immutable.Stream$Cons, name: hd, type: 
 class java.lang.Object)
   - object (class scala.collection.immutable.Stream$Cons, Stream(yun()))
   - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
 $outer, type: class scala.collection.immutable.Stream)
   - object (class scala.collection.immutable.Stream$$anonfun$map$1, 
 function0)
   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
 interface scala.Function0)
   - object (class scala.collection.immutable.Stream$Cons, 
 Stream(HivePartition(List(),HiveStorageDescriptor(/user/hive/warehouse/yun,org.apache.hadoop.mapred.TextInputFormat,org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,Map(serialization.format
  - 1)
   - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
 $outer, type: class scala.collection.immutable.Stream)
   - object (class scala.collection.immutable.Stream$$anonfun$map$1, 
 function0)
   - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
 interface scala.Function0)
   - object (class scala.collection.immutable.Stream$Cons, 
 Stream(/user/hive/warehouse/yun))
   - field (class: org.apache.spark.sql.hive.MetastoreRelation, name: 
 paths, type: interface scala.collection.Seq)
   - object (class org.apache.spark.sql.hive.MetastoreRelation, 
 MetastoreRelation default, yun, None
 )
   - field (class: 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable, name: table, type: 
 class org.apache.spark.sql.hive.MetastoreRelation)
   - object (class 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable, InsertIntoHiveTable 
 (MetastoreRelation default, yun, None), Map(), false, false
  ConvertToSafe
   TungstenProject [CAST(description#10, FloatType) AS 
 description#16,CAST(id#11L, StringType) AS id#17,name#12]
PhysicalRDD [description#10,id#11L,name#12], MapPartitionsRDD[17] at 
 applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2
 )
   - field (class: 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3,
  name: $outer, type: class 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable)
   - object (class 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3,
  function2)
   at 
 org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
   at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
   ... 30 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9781) KCL Workers should be configurable from Spark configuration

2015-08-10 Thread Anton Nekhaev (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-9781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Anton Nekhaev updated SPARK-9781:
-
Description:
Currently the KinesisClientLibConfiguration for KCL Workers is created within
the KinesisReceiver and user is allowed to change only basic settings such as
endpoint URL, stream name, credentials, etc.

However, there is no way to tune some advanced settings, e.g. MaxRecords,
IdleTimeBetweenReads, FailoverTime, etc.

We can add these settings to the Spark configuration and parametrize
KinesisClientLibConfiguration with them in KinesisReceiver.

was:
Currently the KinesisClientLibConfiguration for KCL Workers is created withing
the KinesisReceiver and user is allowed to change only basic settings such as
endpoint URL, stream name, credentials, etc.

However, there is no way to tune some advanced settings, e.g. MaxRecords,
IdleTimeBetweenReads, FailoverTime, etc.

We can add this settings to the Spark configuration and parametrize
KinesisClientLibConfiguration with them in KinesisReceiver.

KCL Workers should be configurable from Spark configuration
---

Key: SPARK-9781
URL: https://issues.apache.org/jira/browse/SPARK-9781
Project: Spark
Issue Type: Improvement
Components: Streaming
Affects Versions: 1.4.1
Reporter: Anton Nekhaev
Labels: kinesis

Currently the KinesisClientLibConfiguration for KCL Workers is created within
the KinesisReceiver and user is allowed to change only basic settings such as
endpoint URL, stream name, credentials, etc.
However, there is no way to tune some advanced settings, e.g. MaxRecords,
IdleTimeBetweenReads, FailoverTime, etc.
We can add these settings to the Spark configuration and parametrize
KinesisClientLibConfiguration with them in KinesisReceiver.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-10 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680355#comment-14680355
 ] 

Ryan Blue commented on SPARK-9340:
--

Sorry to jump in late on this issue... I think you're on the right track here, 
but just to be sure I'll clarify things as I see them.

The specs written for PARQUET-113 allow non-LIST/MAP repeated fields because 
that's what parquet-protobuf uses. But, we didn't implement support for 
unannotated repeated groups because we wanted to address the compatibility 
issues between Hive, Thrift, and Avro as quickly as possible (which are still 
being cleaned up). So for now, unannotated repeated groups throw the 
AnalysisException noted above. Those should eventually map to required lists of 
required elements to give the exact same view of the data that you have in 
parquet-protobuf.

I believe [~damianguy], would like to discuss a different mapping from the 
protobuf schema to a parquet schema, which is a great discussion to have in the 
upstream Parquet project. That sounds like a reasonable extension to me, but I 
want to see what the protobuf model maintainers think of it.

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9600) DataFrameWriter.saveAsTable always writes data to /user/hive/warehouse

2015-08-10 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9600:

Priority: Blocker  (was: Critical)

 DataFrameWriter.saveAsTable always writes data to /user/hive/warehouse
 

 Key: SPARK-9600
 URL: https://issues.apache.org/jira/browse/SPARK-9600
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1, 1.5.0
Reporter: Cheng Lian
Assignee: Sudhakar Thota
Priority: Blocker
 Attachments: SPARK-9600-fl1.txt


 Get a clean Spark 1.4.1 build:
 {noformat}
 $ git checkout v1.4.1
 $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
 clean assembly/assembly
 {noformat}
 Stop any running local Hadoop instance and unset all Hadoop environment 
 variables, so that we force Spark run with local file system only:
 {noformat}
 $ unset HADOOP_CONF_DIR
 $ unset HADOOP_PREFIX
 $ unset HADOOP_LIBEXEC_DIR
 $ unset HADOOP_CLASSPATH
 {noformat}
 In this way we also ensure that the default Hive warehouse location points to 
 local file system {{file:///user/hive/warehouse}}.  Now we create warehouse 
 directories for testing:
 {noformat}
 $ sudo rm -rf /user  # !! WARNING: IT'S /user RATHER THAN /usr !!
 $ sudo mkdir -p /user/hive/{warehouse,warehouse_hive13}
 $ sudo chown -R lian:staff /user
 $ tree /user
 /user
 └── hive
 ├── warehouse
 └── warehouse_hive13
 {noformat}
 Create a minimal {{hive-site.xml}}, only override the warehouse location, put 
 it under {{$SPARK_HOME/conf}}:
 {noformat}
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 configuration
   property
 namehive.metastore.warehouse.dir/name
 valuefile:///user/hive/warehouse_hive13/value
   /property
 /configuration
 {noformat}
 Now run our test snippets with {{pyspark}}:
 {noformat}
 $ ./bin/pyspark
 In [1]: sqlContext.range(10).coalesce(1).write.saveAsTable(ds)
 {noformat}
 Check warehouse directories:
 {noformat}
 $ tree /user
 /user
 └── hive
 ├── warehouse
 │   └── ds
 │   ├── _SUCCESS
 │   ├── _common_metadata
 │   ├── _metadata
 │   └── part-r-0-46e4b32a-5c4d-4dba-b8d6-8d30ae910dc9.gz.parquet
 └── warehouse_hive13
 └── ds
 {noformat}
 Here you may notice the weird part: we have {{ds}} under both {{warehouse}} 
 and {{warehouse_hive13}}, but data are only written into the former.
 Now let's try HiveQl:
 {noformat}
 In [2]: sqlContext.range(10).coalesce(1).registerTempTable(t)
 In [3]: sqlContext.sql(CREATE TABLE ds_ctas AS SELECT * FROM t)
 {noformat}
 Check the directories again:
 {noformat}
 $ tree /user
 /user
 └── hive
 ├── warehouse
 │   └── ds
 │   ├── _SUCCESS
 │   ├── _common_metadata
 │   ├── _metadata
 │   └── part-r-0-46e4b32a-5c4d-4dba-b8d6-8d30ae910dc9.gz.parquet
 └── warehouse_hive13
 ├── ds
 └── ds_ctas
 ├── _SUCCESS
 └── part-0
 {noformat}
 So HiveQl works fine.  (Hive never writes Parquet summary files, so 
 {{_common_metadata}} and {{_metadata}} are missing in {{ds_ctas}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9782) Add support for YARN application tags running Spark on YARN

2015-08-10 Thread Dennis Huo (JIRA)

Dennis Huo created SPARK-9782:
-

 Summary: Add support for YARN application tags running Spark on 
YARN
 Key: SPARK-9782
 URL: https://issues.apache.org/jira/browse/SPARK-9782
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.1
Reporter: Dennis Huo


https://issues.apache.org/jira/browse/YARN-1390 originally added the new 
“Application Tags” feature to YARN to help track the sources of applications 
among many possible YARN clients. 
https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a set 
of tags to be applied, and for comparison, 
https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for 
MapReduce to easily propagate tags through to YARN via Configuration settings.

Since the ApplicationSubmissionContext.setApplicationTags method was only added 
in Hadoop 2.4+, Spark support will invoke the method via reflection the same 
way other such version-specific methods are called in elsewhere in the YARN 
client. Since the usage of tags is generally not critical to the functionality 
of older YARN setups, it should be safe to handle NoSuchMethodException with 
just a logWarning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680423#comment-14680423
 ] 

Joseph K. Bradley commented on SPARK-7751:
--

[~mengxr] I haven't reviewed these PRs, but are people copying docs whenever 
they add since tags to overridden methods?  Before, the overridden methods 
would inherit documentation, but with a one-line since tag added, they no 
longer inherit docs.  This PR brought the problem to my attention: 
[https://github.com/apache/spark/pull/8045/files].

Adding since tags to all methods in MLlib will mean we always copy 
documentation and never rely on it being inherited.

 Add @since to stable and experimental methods in MLlib
 --

 Key: SPARK-7751
 URL: https://issues.apache.org/jira/browse/SPARK-7751
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
  Labels: starter

 This is useful to check whether a feature exists in some version of Spark. 
 This is an umbrella JIRA to track the progress. We want to have @since tag 
 for both stable (those without any Experimental/DeveloperApi/AlphaComponent 
 annotations) and experimental methods in MLlib:
 (Do NOT tag private or package private classes or methods.)
 * an example PR for Scala: https://github.com/apache/spark/pull/6101
 * an example PR for Python: https://github.com/apache/spark/pull/6295
 We need to dig the history of git commit to figure out what was the Spark 
 version when a method was first introduced. Take `NaiveBayes.setModelType` as 
 an example. We can grep `def setModelType` at different version git tags.
 {code}
 meng@xm:~/src/spark
 $ git show 
 v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
 meng@xm:~/src/spark
 $ git show 
 v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
   def setModelType(modelType: String): NaiveBayes = {
 {code}
 If there are better ways, please let us know.
 We cannot add all @since tags in a single PR, which is hard to review. So we 
 made some subtasks for each package, for example 
 `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
 and the `spark.ml` package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call


 [ 
https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9783:
--
 Sprint: Spark 1.5 doc/QA sprint
Environment: (was: PR #8035 made a quick fix for SPARK-9743 by 
introducing an extra {{refresh()}} call in {{JSONRelation.buildScan}}. 
Obviously it hurts performance. To overcome this, we can use 
{{SqlNewHadoopRDD}} there and override {{listStatus()}} to inject cached 
{{FileStatus}} instances, similar as what we did in {{ParquetRelation}}.)
Description: PR #8035 made a quick fix for SPARK-9743 by introducing an 
extra {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
override {{listStatus()}} to inject cached {{FileStatus}} instances, similar as 
what we did in {{ParquetRelation}}.

 Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
 -

 Key: SPARK-9783
 URL: https://issues.apache.org/jira/browse/SPARK-9783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 PR #8035 made a quick fix for SPARK-9743 by introducing an extra 
 {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
 performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
 override {{listStatus()}} to inject cached {{FileStatus}} instances, similar 
 as what we did in {{ParquetRelation}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9622) DecisionTreeRegressor: provide variance of prediction


[ 
https://issues.apache.org/jira/browse/SPARK-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680510#comment-14680510
 ] 

Joseph K. Bradley commented on SPARK-9622:
--

OK, before you do though, it'd be worth discussing how those variances should 
be returned.  E.g., just a Double column of variances?  Pros: Simple, 
applicable to other distributions if we ever move beyond Variance (=Gaussian) 
as an impurity.  Cons: Not extensible if we use other distributions and want to 
return more details about the distribution.

Those are my thoughts.  Currently, a Double column of variances seems best to 
me.  But it'd be nice to hear your thoughts.

 DecisionTreeRegressor: provide variance of prediction
 -

 Key: SPARK-9622
 URL: https://issues.apache.org/jira/browse/SPARK-9622
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor

 Variance of predicted value, as estimated from training data.
 Analogous to class probabilities for classification.
 See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9340:
--
  Sprint: Spark 1.5 doc/QA sprint
Target Version/s: 1.5.0

 CatalystSchemaConverter and CatalystRowConverter don't handle unannotated 
 repeated fields correctly
 ---

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
Assignee: Cheng Lian
 Attachments: ParquetTypesConverterTest.scala


 SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement 
 backwards-compatibility rules defined in {{parquet-format}} spec. However, 
 both Spark SQL and {{parquet-avro}} neglected the following statement in 
 {{parquet-format}}:
 {quote}
 This does not affect repeated fields that are not annotated: A repeated field 
 that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor 
 annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of 
 required elements where the element type is the type of the field.
 {quote}
 One of the consequences is that, Parquet files generated by 
 {{parquet-protobuf}} containing unannotated repeated fields are not correctly 
 converted to Catalyst arrays.
 For example, the following Parquet schema
 {noformat}
 message root {
   repeated int32 f1
 }
 {noformat}
  should be converted to
 {noformat}
 StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), 
 nullable = false) :: Nil)
 {noformat}
 But now it triggers an {{AnalysisException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9785) HashPartitioning guarantees / compatibleWith violate those methods' contracts

Josh Rosen created SPARK-9785:
-

 Summary: HashPartitioning guarantees / compatibleWith violate 
those methods' contracts
 Key: SPARK-9785
 URL: https://issues.apache.org/jira/browse/SPARK-9785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker


HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but 
in other contexts the ordering of those expressions matters.  This is 
illustrated by the following regression test:

{code}
  test(HashPartitioning compatibility) {
val expressions = Seq(Literal(2), Literal(3))
// Consider two HashPartitionings that have the same _set_ of hash 
expressions but which are
// created with different orderings of those expressions:
val partitioningA = HashPartitioning(expressions, 100)
val partitioningB = HashPartitioning(expressions.reverse, 100)
// These partitionings are not considered equal:
assert(partitioningA != partitioningB)
// However, they both satisfy the same clustered distribution:
val distribution = ClusteredDistribution(expressions)
assert(partitioningA.satisfies(distribution))
assert(partitioningB.satisfies(distribution))
// Both partitionings are compatible with and guarantee each other:
assert(partitioningA.compatibleWith(partitioningB))
assert(partitioningB.compatibleWith(partitioningA))
assert(partitioningA.guarantees(partitioningB))
assert(partitioningB.guarantees(partitioningA))
// Given all of this, we would expect these partitionings to compute the 
same hashcode for
// any given row:
def computeHashCode(partitioning: HashPartitioning): Int = {
  val hashExprProj = new 
InterpretedMutableProjection(partitioning.expressions, Seq.empty)
  hashExprProj.apply(InternalRow.empty).hashCode()
}
assert(computeHashCode(partitioningA) === computeHashCode(partitioningB))
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9788) LDA docConcentration, gammaShape 1.5 binary incompatibility fixes

Joseph K. Bradley created SPARK-9788:


 Summary: LDA docConcentration, gammaShape 1.5 binary 
incompatibility fixes
 Key: SPARK-9788
 URL: https://issues.apache.org/jira/browse/SPARK-9788
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley


From [SPARK-9658]:

1. LDA.docConcentration

It will be nice to keep the old APIs unchanged.  Proposal:
* Add “asymmetricDocConcentration” and revert docConcentration changes.
* If the (internal) doc concentration vector is a single value, 
“getDocConcentration returns it.  If it is a constant vector, 
getDocConcentration returns the first item, and fails otherwise.

2. LDAModel.gammaShape

This should be given a default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9782) Add support for YARN application tags running Spark on YARN


[ 
https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680413#comment-14680413
 ] 

Apache Spark commented on SPARK-9782:
-

User 'dennishuo' has created a pull request for this issue:
https://github.com/apache/spark/pull/8072

 Add support for YARN application tags running Spark on YARN
 ---

 Key: SPARK-9782
 URL: https://issues.apache.org/jira/browse/SPARK-9782
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.1
Reporter: Dennis Huo

 https://issues.apache.org/jira/browse/YARN-1390 originally added the new 
 “Application Tags” feature to YARN to help track the sources of applications 
 among many possible YARN clients. 
 https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a 
 set of tags to be applied, and for comparison, 
 https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for 
 MapReduce to easily propagate tags through to YARN via Configuration settings.
 Since the ApplicationSubmissionContext.setApplicationTags method was only 
 added in Hadoop 2.4+, Spark support will invoke the method via reflection the 
 same way other such version-specific methods are called in elsewhere in the 
 YARN client. Since the usage of tags is generally not critical to the 
 functionality of older YARN setups, it should be safe to handle 
 NoSuchMethodException with just a logWarning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9755) Add method documentation to MultivariateOnlineSummarizer


 [ 
https://issues.apache.org/jira/browse/SPARK-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9755:
-
Shepherd: Joseph K. Bradley
Assignee: Feynman Liang

 Add method documentation to MultivariateOnlineSummarizer
 

 Key: SPARK-9755
 URL: https://issues.apache.org/jira/browse/SPARK-9755
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor

 Docs present in 1.4 are lost in current 1.5 branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call


[ 
https://issues.apache.org/jira/browse/SPARK-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680457#comment-14680457
 ] 

Cheng Lian commented on SPARK-9783:
---

cc [~yhuai]

 Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call
 -

 Key: SPARK-9783
 URL: https://issues.apache.org/jira/browse/SPARK-9783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 PR #8035 made a quick fix for SPARK-9743 by introducing an extra 
 {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
 performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
 override {{listStatus()}} to inject cached {{FileStatus}} instances, similar 
 as what we did in {{ParquetRelation}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods


 [ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9720:
-
Description: It would be nice to include the UID (instance name) in 
toString methods.  That's the default behavior for Identifiable, but some types 
override the default toString and do not include the UID.  (was: It would be 
nice to print the UID (instance name) in toString methods.  That's the default 
behavior for Identifiable, but some types override the default toString and do 
not print the UID.)

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to include the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not include the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods


 [ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9720:
-
Assignee: Bertrand Dechoux

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Bertrand Dechoux
Priority: Minor
  Labels: starter

 It would be nice to include the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not include the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods


[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680520#comment-14680520
 ] 

Joseph K. Bradley commented on SPARK-9720:
--

Oh sorry!  I shouldn't have said print.

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to include the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not include the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9785) HashPartitioning compatibility should consider expression ordering


 [ 
https://issues.apache.org/jira/browse/SPARK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9785:
---

Assignee: Apache Spark  (was: Josh Rosen)

 HashPartitioning compatibility should consider expression ordering
 --

 Key: SPARK-9785
 URL: https://issues.apache.org/jira/browse/SPARK-9785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Apache Spark
Priority: Blocker

 HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but 
 in other contexts the ordering of those expressions matters.  This is 
 illustrated by the following regression test:
 {code}
   test(HashPartitioning compatibility) {
 val expressions = Seq(Literal(2), Literal(3))
 // Consider two HashPartitionings that have the same _set_ of hash 
 expressions but which are
 // created with different orderings of those expressions:
 val partitioningA = HashPartitioning(expressions, 100)
 val partitioningB = HashPartitioning(expressions.reverse, 100)
 // These partitionings are not considered equal:
 assert(partitioningA != partitioningB)
 // However, they both satisfy the same clustered distribution:
 val distribution = ClusteredDistribution(expressions)
 assert(partitioningA.satisfies(distribution))
 assert(partitioningB.satisfies(distribution))
 // Both partitionings are compatible with and guarantee each other:
 assert(partitioningA.compatibleWith(partitioningB))
 assert(partitioningB.compatibleWith(partitioningA))
 assert(partitioningA.guarantees(partitioningB))
 assert(partitioningB.guarantees(partitioningA))
 // Given all of this, we would expect these partitionings to compute the 
 same hashcode for
 // any given row:
 def computeHashCode(partitioning: HashPartitioning): Int = {
   val hashExprProj = new 
 InterpretedMutableProjection(partitioning.expressions, Seq.empty)
   hashExprProj.apply(InternalRow.empty).hashCode()
 }
 assert(computeHashCode(partitioningA) === computeHashCode(partitioningB))
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.

Mark Grover created SPARK-9790:
--

 Summary: [YARN] Expose in WebUI if NodeManager is the reason why 
executors were killed.
 Key: SPARK-9790
 URL: https://issues.apache.org/jira/browse/SPARK-9790
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.1
Reporter: Mark Grover


When an executor is killed by yarn because it exceeds the memory overhead, the 
only thing spark knows is that the executor is lost. The user has to go track 
search through the NM logs to figure out that its been killed by yarn.
It would be much nicer if the spark-driver could be notified why the executor 
was killed. Ideally it could both log an explanatory message, and update the UI 
(and the eventLog) so that it was clear why the executor was lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9570) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.

2015-08-10 Thread Guru Medasani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680379#comment-14680379
 ] 

Guru Medasani commented on SPARK-9570:
--

I won't have time to look at this. Neelesh can you look at it?

 Consistent recommendation for submitting spark apps to YARN, -master yarn 
 --deploy-mode x vs -master yarn-x'.
 -

 Key: SPARK-9570
 URL: https://issues.apache.org/jira/browse/SPARK-9570
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Submit, YARN
Affects Versions: 1.4.1
Reporter: Neelesh Srinivas Salian
Priority: Minor
  Labels: starter

 There are still some inconsistencies in the documentation regarding 
 submission of the applications for yarn.
 SPARK-3629 was done to correct the same but 
 http://spark.apache.org/docs/latest/submitting-applications.html#master-urls
 still has yarn-client and yarn-client as opposed to the nor of having 
 --master yarn and --deploy-mode cluster / client
 Need to change this appropriately (if needed) to avoid confusion:
 https://spark.apache.org/docs/latest/running-on-yarn.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch


[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680385#comment-14680385
 ] 

Apache Spark commented on SPARK-9340:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8070

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9570) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.


 [ 
https://issues.apache.org/jira/browse/SPARK-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9570:
---

Assignee: (was: Apache Spark)

 Consistent recommendation for submitting spark apps to YARN, -master yarn 
 --deploy-mode x vs -master yarn-x'.
 -

 Key: SPARK-9570
 URL: https://issues.apache.org/jira/browse/SPARK-9570
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Submit, YARN
Affects Versions: 1.4.1
Reporter: Neelesh Srinivas Salian
Priority: Minor
  Labels: starter

 There are still some inconsistencies in the documentation regarding 
 submission of the applications for yarn.
 SPARK-3629 was done to correct the same but 
 http://spark.apache.org/docs/latest/submitting-applications.html#master-urls
 still has yarn-client and yarn-client as opposed to the nor of having 
 --master yarn and --deploy-mode cluster / client
 Need to change this appropriately (if needed) to avoid confusion:
 https://spark.apache.org/docs/latest/running-on-yarn.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9570) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.


[ 
https://issues.apache.org/jira/browse/SPARK-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680392#comment-14680392
 ] 

Apache Spark commented on SPARK-9570:
-

User 'nssalian' has created a pull request for this issue:
https://github.com/apache/spark/pull/8071

 Consistent recommendation for submitting spark apps to YARN, -master yarn 
 --deploy-mode x vs -master yarn-x'.
 -

 Key: SPARK-9570
 URL: https://issues.apache.org/jira/browse/SPARK-9570
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Submit, YARN
Affects Versions: 1.4.1
Reporter: Neelesh Srinivas Salian
Priority: Minor
  Labels: starter

 There are still some inconsistencies in the documentation regarding 
 submission of the applications for yarn.
 SPARK-3629 was done to correct the same but 
 http://spark.apache.org/docs/latest/submitting-applications.html#master-urls
 still has yarn-client and yarn-client as opposed to the nor of having 
 --master yarn and --deploy-mode cluster / client
 Need to change this appropriately (if needed) to avoid confusion:
 https://spark.apache.org/docs/latest/running-on-yarn.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9570) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.


 [ 
https://issues.apache.org/jira/browse/SPARK-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9570:
---

Assignee: Apache Spark

 Consistent recommendation for submitting spark apps to YARN, -master yarn 
 --deploy-mode x vs -master yarn-x'.
 -

 Key: SPARK-9570
 URL: https://issues.apache.org/jira/browse/SPARK-9570
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Submit, YARN
Affects Versions: 1.4.1
Reporter: Neelesh Srinivas Salian
Assignee: Apache Spark
Priority: Minor
  Labels: starter

 There are still some inconsistencies in the documentation regarding 
 submission of the applications for yarn.
 SPARK-3629 was done to correct the same but 
 http://spark.apache.org/docs/latest/submitting-applications.html#master-urls
 still has yarn-client and yarn-client as opposed to the nor of having 
 --master yarn and --deploy-mode cluster / client
 Need to change this appropriately (if needed) to avoid confusion:
 https://spark.apache.org/docs/latest/running-on-yarn.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch


[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680418#comment-14680418
 ] 

Cheng Lian commented on SPARK-9340:
---

Thanks for the clarification.  In [PR 
#8070|https://github.com/apache/spark/pull/8070] I just try to do the required 
list of required elements conversion.

I understand that cleaning up all those compatibility stuff can be super time 
consuming, and making sure the most common scenarios work first totally makes 
sense.  I'm so glad that all the backwards-compatibility rules had already been 
figured out there when I started to investigate these issues.  These rules 
definitely saved my world!

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9340:
--
Summary: CatalystSchemaConverter and CatalystRowConverter don't handle 
unannotated repeated fields correctly  (was: ParquetTypeConverter incorrectly 
handling of repeated types results in schema mismatch)

 CatalystSchemaConverter and CatalystRowConverter don't handle unannotated 
 repeated fields correctly
 ---

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9783) Use SqlNewHadoopRDD in JSONRelation to eliminate extra refresh() call

Cheng Lian created SPARK-9783:
-

 Summary: Use SqlNewHadoopRDD in JSONRelation to eliminate extra 
refresh() call
 Key: SPARK-9783
 URL: https://issues.apache.org/jira/browse/SPARK-9783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: PR #8035 made a quick fix for SPARK-9743 by introducing 
an extra {{refresh()}} call in {{JSONRelation.buildScan}}. Obviously it hurts 
performance. To overcome this, we can use {{SqlNewHadoopRDD}} there and 
override {{listStatus()}} to inject cached {{FileStatus}} instances, similar as 
what we did in {{ParquetRelation}}.
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9784) Exchange.isUnsafe should check whether codegen and unsafe are enabled


 [ 
https://issues.apache.org/jira/browse/SPARK-9784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9784:
---

Assignee: Apache Spark  (was: Josh Rosen)

 Exchange.isUnsafe should check whether codegen and unsafe are enabled
 -

 Key: SPARK-9784
 URL: https://issues.apache.org/jira/browse/SPARK-9784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Apache Spark

 Exchange needs to check whether unsafe mode is enabled in its 
 {{tungstenMode}} method:
 {code}
   override def nodeName: String = if (tungstenMode) TungstenExchange else 
 Exchange
   /**
* Returns true iff we can support the data type, and we are not doing 
 range partitioning.
*/
   private lazy val tungstenMode: Boolean = {
 GenerateUnsafeProjection.canSupport(child.schema) 
   !newPartitioning.isInstanceOf[RangePartitioning]
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly


[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680523#comment-14680523
 ] 

Cheng Lian commented on SPARK-9340:
---

Great, would you mind to leave a LGTM on the GitHub PR page? Appreciated!

 CatalystSchemaConverter and CatalystRowConverter don't handle unannotated 
 repeated fields correctly
 ---

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
Assignee: Cheng Lian
 Attachments: ParquetTypesConverterTest.scala


 SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement 
 backwards-compatibility rules defined in {{parquet-format}} spec. However, 
 both Spark SQL and {{parquet-avro}} neglected the following statement in 
 {{parquet-format}}:
 {quote}
 This does not affect repeated fields that are not annotated: A repeated field 
 that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor 
 annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of 
 required elements where the element type is the type of the field.
 {quote}
 One of the consequences is that, Parquet files generated by 
 {{parquet-protobuf}} containing unannotated repeated fields are not correctly 
 converted to Catalyst arrays.
 For example, the following Parquet schema
 {noformat}
 message root {
   repeated int32 f1
 }
 {noformat}
  should be converted to
 {noformat}
 StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), 
 nullable = false) :: Nil)
 {noformat}
 But now it triggers an {{AnalysisException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9785) HashPartitioning compatibility should be sensitive to expression ordering


 [ 
https://issues.apache.org/jira/browse/SPARK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9785:
--
Summary: HashPartitioning compatibility should be sensitive to expression 
ordering  (was: HashPartitioning guarantees / compatibleWith violate those 
methods' contracts)

 HashPartitioning compatibility should be sensitive to expression ordering
 -

 Key: SPARK-9785
 URL: https://issues.apache.org/jira/browse/SPARK-9785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but 
 in other contexts the ordering of those expressions matters.  This is 
 illustrated by the following regression test:
 {code}
   test(HashPartitioning compatibility) {
 val expressions = Seq(Literal(2), Literal(3))
 // Consider two HashPartitionings that have the same _set_ of hash 
 expressions but which are
 // created with different orderings of those expressions:
 val partitioningA = HashPartitioning(expressions, 100)
 val partitioningB = HashPartitioning(expressions.reverse, 100)
 // These partitionings are not considered equal:
 assert(partitioningA != partitioningB)
 // However, they both satisfy the same clustered distribution:
 val distribution = ClusteredDistribution(expressions)
 assert(partitioningA.satisfies(distribution))
 assert(partitioningB.satisfies(distribution))
 // Both partitionings are compatible with and guarantee each other:
 assert(partitioningA.compatibleWith(partitioningB))
 assert(partitioningB.compatibleWith(partitioningA))
 assert(partitioningA.guarantees(partitioningB))
 assert(partitioningB.guarantees(partitioningA))
 // Given all of this, we would expect these partitionings to compute the 
 same hashcode for
 // any given row:
 def computeHashCode(partitioning: HashPartitioning): Int = {
   val hashExprProj = new 
 InterpretedMutableProjection(partitioning.expressions, Seq.empty)
   hashExprProj.apply(InternalRow.empty).hashCode()
 }
 assert(computeHashCode(partitioningA) === computeHashCode(partitioningB))
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9574) Review the contents of uber JARs spark-streaming-XXX-assembly


 [ 
https://issues.apache.org/jira/browse/SPARK-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9574:
---

Assignee: Apache Spark  (was: Shixiong Zhu)

 Review the contents of uber JARs spark-streaming-XXX-assembly
 -

 Key: SPARK-9574
 URL: https://issues.apache.org/jira/browse/SPARK-9574
 Project: Spark
  Issue Type: Task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Apache Spark

 It should not contain Spark core and its dependencies, especially the 
 following.
 - Hadoop and its dependencies
 - Scala libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9574) Review the contents of uber JARs spark-streaming-XXX-assembly


[ 
https://issues.apache.org/jira/browse/SPARK-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680362#comment-14680362
 ] 

Apache Spark commented on SPARK-9574:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/8069

 Review the contents of uber JARs spark-streaming-XXX-assembly
 -

 Key: SPARK-9574
 URL: https://issues.apache.org/jira/browse/SPARK-9574
 Project: Spark
  Issue Type: Task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Shixiong Zhu

 It should not contain Spark core and its dependencies, especially the 
 following.
 - Hadoop and its dependencies
 - Scala libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9782) Add support for YARN application tags running Spark on YARN


 [ 
https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9782:
---

Assignee: (was: Apache Spark)

 Add support for YARN application tags running Spark on YARN
 ---

 Key: SPARK-9782
 URL: https://issues.apache.org/jira/browse/SPARK-9782
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.1
Reporter: Dennis Huo

 https://issues.apache.org/jira/browse/YARN-1390 originally added the new 
 “Application Tags” feature to YARN to help track the sources of applications 
 among many possible YARN clients. 
 https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a 
 set of tags to be applied, and for comparison, 
 https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for 
 MapReduce to easily propagate tags through to YARN via Configuration settings.
 Since the ApplicationSubmissionContext.setApplicationTags method was only 
 added in Hadoop 2.4+, Spark support will invoke the method via reflection the 
 same way other such version-specific methods are called in elsewhere in the 
 YARN client. Since the usage of tags is generally not critical to the 
 functionality of older YARN setups, it should be safe to handle 
 NoSuchMethodException with just a logWarning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9782) Add support for YARN application tags running Spark on YARN


 [ 
https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9782:
---

Assignee: Apache Spark

 Add support for YARN application tags running Spark on YARN
 ---

 Key: SPARK-9782
 URL: https://issues.apache.org/jira/browse/SPARK-9782
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.1
Reporter: Dennis Huo
Assignee: Apache Spark

 https://issues.apache.org/jira/browse/YARN-1390 originally added the new 
 “Application Tags” feature to YARN to help track the sources of applications 
 among many possible YARN clients. 
 https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a 
 set of tags to be applied, and for comparison, 
 https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for 
 MapReduce to easily propagate tags through to YARN via Configuration settings.
 Since the ApplicationSubmissionContext.setApplicationTags method was only 
 added in Hadoop 2.4+, Spark support will invoke the method via reflection the 
 same way other such version-specific methods are called in elsewhere in the 
 YARN client. Since the usage of tags is generally not critical to the 
 functionality of older YARN setups, it should be safe to handle 
 NoSuchMethodException with just a logWarning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9782) Add support for YARN application tags running Spark on YARN

2015-08-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680415#comment-14680415
 ] 

Sean Owen commented on SPARK-9782:
--

This is different from https://issues.apache.org/jira/browse/SPARK-7173 ?
I also doubt this will go in any time soon if it needs Hadoop 2.x, since even 
1.x is still supported, even with reflection -- the complexity may not be worth 
it.

 Add support for YARN application tags running Spark on YARN
 ---

 Key: SPARK-9782
 URL: https://issues.apache.org/jira/browse/SPARK-9782
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.1
Reporter: Dennis Huo

 https://issues.apache.org/jira/browse/YARN-1390 originally added the new 
 “Application Tags” feature to YARN to help track the sources of applications 
 among many possible YARN clients. 
 https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a 
 set of tags to be applied, and for comparison, 
 https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for 
 MapReduce to easily propagate tags through to YARN via Configuration settings.
 Since the ApplicationSubmissionContext.setApplicationTags method was only 
 added in Hadoop 2.4+, Spark support will invoke the method via reflection the 
 same way other such version-specific methods are called in elsewhere in the 
 YARN client. Since the usage of tags is generally not critical to the 
 functionality of older YARN setups, it should be safe to handle 
 NoSuchMethodException with just a logWarning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-9340:
-

Assignee: Cheng Lian

 CatalystSchemaConverter and CatalystRowConverter don't handle unannotated 
 repeated fields correctly
 ---

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
Assignee: Cheng Lian
 Attachments: ParquetTypesConverterTest.scala


 SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement 
 backwards-compatibility rules defined in {{parquet-format}} spec. However, 
 both Spark SQL and {{parquet-avro}} neglected the following statement in 
 {{parquet-format}}:
 {quote}
 This does not affect repeated fields that are not annotated: A repeated field 
 that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor 
 annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of 
 required elements where the element type is the type of the field.
 {quote}
 One of the consequences is that, Parquet files generated by 
 {{parquet-protobuf}} containing unannotated repeated fields are not correctly 
 converted to Catalyst arrays.
 For example, the following Parquet schema
 {noformat}
 message root {
   repeated int32 f1
 }
 {noformat}
  should be converted to
 {noformat}
 StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), 
 nullable = false) :: Nil)
 {noformat}
 But now it triggers an {{AnalysisException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9450) HashedRelation.get() could return an Iterator[Row] instead of Seq[Row]


 [ 
https://issues.apache.org/jira/browse/SPARK-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-9450.
---
Resolution: Invalid

I'm going to resolve this as Invalid, since it turns out that we need to 
return an Iterable of Rows in order to support full outer join.

 HashedRelation.get() could return an Iterator[Row] instead of Seq[Row]
 --

 Key: SPARK-9450
 URL: https://issues.apache.org/jira/browse/SPARK-9450
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Andrew Or

 While looking through some HashedRelation code, [~andrewor14] and I noticed 
 that it looks like HashedRelation.get() could return an Iterator of rows 
 instead of a sequence.  If we do this, we can reduce object allocation in 
 UnsafeHashedRelation.get().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9784) Exchange.isUnsafe should check whether codegen and unsafe are enabled

Josh Rosen created SPARK-9784:
-

 Summary: Exchange.isUnsafe should check whether codegen and unsafe 
are enabled
 Key: SPARK-9784
 URL: https://issues.apache.org/jira/browse/SPARK-9784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen


Exchange needs to check whether unsafe mode is enabled in its {{tungstenMode}} 
method:

{code}

  override def nodeName: String = if (tungstenMode) TungstenExchange else 
Exchange

  /**
   * Returns true iff we can support the data type, and we are not doing range 
partitioning.
   */
  private lazy val tungstenMode: Boolean = {
GenerateUnsafeProjection.canSupport(child.schema) 
  !newPartitioning.isInstanceOf[RangePartitioning]
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9658) ML 1.5 QA: API: Binary incompatible changes


[ 
https://issues.apache.org/jira/browse/SPARK-9658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659182#comment-14659182
 ] 

Joseph K. Bradley edited comment on SPARK-9658 at 8/10/15 6:49 PM:
---

1 was intentional.  I'm OK with creating a new param though.

2. gammaShape could be given a default.  topicConcentration is necessary, I'd 
say.

3. AFAIK this is a Scala compiler bug, not something we can fix easily.

4. That was intentional.  We could put it back, though that would create 
duplicate parameters with sort of confusing semantics.

5. I like having it in a single place to share the implementation.  I know it's 
simple, but it's easy to mess up by swapping the 2 values.


was (Author: josephkb):
1 was intentional.  I'm OK with creating a new param though.

2. gammaShape could be given a default.  topicConcentration is necessary, I'd 
say.

3. AFAIK this is a Scala compiler bug, not something we can fix easily.

4. That was intentional.  We could put it back, though that would create 
duplicate parameters with sort of confusing semantics.

5. Sounds good.

 ML 1.5 QA: API: Binary incompatible changes
 ---

 Key: SPARK-9658
 URL: https://issues.apache.org/jira/browse/SPARK-9658
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Blocker

 Generated a list of binary incompatible changes using MiMa and filter out 
 some false positives:
 1. LDA.docConcentration
 It will be nice to keep the old APIs unchanged. For example, we can use 
 “asymmetricDocConcentration”. Then “getDocConcentration would return the 
 first item if the concentration vector is a constant vector.
 2. LDAModel.gammaShape / topicConcentration
 Should be okay if we assume that no one extends LDAModel.
 3. Params.setDefault
 If we have time to investigate this issue. We should put it back.
 4. LogisticRegressionModel.threshold is missing.
 5. LogisticRegression.setThreshold shouldn't be in the Params trait. We need 
 to override it anyway.
 Will create sub-tasks for each item.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9785) HashPartitioning compatibility should consider expression ordering


 [ 
https://issues.apache.org/jira/browse/SPARK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9785:
--
Summary: HashPartitioning compatibility should consider expression ordering 
 (was: HashPartitioning compatibility should be sensitive to expression 
ordering)

 HashPartitioning compatibility should consider expression ordering
 --

 Key: SPARK-9785
 URL: https://issues.apache.org/jira/browse/SPARK-9785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but 
 in other contexts the ordering of those expressions matters.  This is 
 illustrated by the following regression test:
 {code}
   test(HashPartitioning compatibility) {
 val expressions = Seq(Literal(2), Literal(3))
 // Consider two HashPartitionings that have the same _set_ of hash 
 expressions but which are
 // created with different orderings of those expressions:
 val partitioningA = HashPartitioning(expressions, 100)
 val partitioningB = HashPartitioning(expressions.reverse, 100)
 // These partitionings are not considered equal:
 assert(partitioningA != partitioningB)
 // However, they both satisfy the same clustered distribution:
 val distribution = ClusteredDistribution(expressions)
 assert(partitioningA.satisfies(distribution))
 assert(partitioningB.satisfies(distribution))
 // Both partitionings are compatible with and guarantee each other:
 assert(partitioningA.compatibleWith(partitioningB))
 assert(partitioningB.compatibleWith(partitioningA))
 assert(partitioningA.guarantees(partitioningB))
 assert(partitioningB.guarantees(partitioningA))
 // Given all of this, we would expect these partitionings to compute the 
 same hashcode for
 // any given row:
 def computeHashCode(partitioning: HashPartitioning): Int = {
   val hashExprProj = new 
 InterpretedMutableProjection(partitioning.expressions, Seq.empty)
   hashExprProj.apply(InternalRow.empty).hashCode()
 }
 assert(computeHashCode(partitioningA) === computeHashCode(partitioningB))
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9789) Reinstate LogisticRegression threshold Param

Joseph K. Bradley created SPARK-9789:


 Summary: Reinstate LogisticRegression threshold Param
 Key: SPARK-9789
 URL: https://issues.apache.org/jira/browse/SPARK-9789
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley


From [SPARK-9658]:

LogisticRegression.threshold was replaced by thresholds, but we could keep 
threshold for backwards compatibility.  We should add it back, but we should 
maintain the current semantics whereby thresholds overrides threshold.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9710) RPackageUtilsSuite fails if R is not installed

2015-08-10 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9710.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8008
[https://github.com/apache/spark/pull/8008]

 RPackageUtilsSuite fails if R is not installed
 --

 Key: SPARK-9710
 URL: https://issues.apache.org/jira/browse/SPARK-9710
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin
 Fix For: 1.5.0


 That's because there's a bug in RUtils.scala. PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9710) RPackageUtilsSuite fails if R is not installed

2015-08-10 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9710:
-
Assignee: Marcelo Vanzin

 RPackageUtilsSuite fails if R is not installed
 --

 Key: SPARK-9710
 URL: https://issues.apache.org/jira/browse/SPARK-9710
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.5.0


 That's because there's a bug in RUtils.scala. PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7536) Audit MLlib Python API for 1.4

[
https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley resolved SPARK-7536.
--
Resolution: Done

Thanks [~yanboliang] for putting this (very successful) list together and
copying over the few remaining items to the next release list!

Audit MLlib Python API for 1.4
--

Key: SPARK-7536
URL: https://issues.apache.org/jira/browse/SPARK-7536
Project: Spark
Issue Type: Sub-task
Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

**NOTE: This is targeted at 1.5.0 because it has so many useful links for
JIRAs targeted for 1.5.0. In the future, we should create a _new_ JIRA for
linking future items.**
For new public APIs added to MLlib, we need to check the generated HTML doc
and compare the Scala Python versions. We need to track:
* Inconsistency: Do class/method/parameter names match? SPARK-7667
* Docs: Is the Python doc missing or just a stub? We want the Python doc to
be as complete as the Scala doc. [SPARK-7666], [SPARK-6173]
* API breaking changes: These should be very rare but are occasionally either
necessary (intentional) or accidental. These must be recorded and added in
the Migration Guide for this release. SPARK-7665
** Note: If the API change is for an Alpha/Experimental/DeveloperApi
component, please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for
functionality missing from Python.
** classification
*** StreamingLogisticRegressionWithSGD SPARK-7633
** clustering
*** GaussianMixture SPARK-6258
*** LDA SPARK-6259
*** Power Iteration Clustering SPARK-5962
*** StreamingKMeans SPARK-4118
** evaluation
*** MultilabelMetrics SPARK-6094
** feature
*** ElementwiseProduct SPARK-7605
*** PCA SPARK-7604
** linalg
*** Distributed linear algebra SPARK-6100
** pmml.export SPARK-7638
** regression
*** StreamingLinearRegressionWithSGD SPARK-4127
** stat
*** KernelDensity SPARK-7639
** util
*** MLUtils SPARK-6263

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9784) Exchange.isUnsafe should check whether codegen and unsafe are enabled


[ 
https://issues.apache.org/jira/browse/SPARK-9784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680482#comment-14680482
 ] 

Apache Spark commented on SPARK-9784:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8073

 Exchange.isUnsafe should check whether codegen and unsafe are enabled
 -

 Key: SPARK-9784
 URL: https://issues.apache.org/jira/browse/SPARK-9784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 Exchange needs to check whether unsafe mode is enabled in its 
 {{tungstenMode}} method:
 {code}
   override def nodeName: String = if (tungstenMode) TungstenExchange else 
 Exchange
   /**
* Returns true iff we can support the data type, and we are not doing 
 range partitioning.
*/
   private lazy val tungstenMode: Boolean = {
 GenerateUnsafeProjection.canSupport(child.schema) 
   !newPartitioning.isInstanceOf[RangePartitioning]
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9784) Exchange.isUnsafe should check whether codegen and unsafe are enabled


 [ 
https://issues.apache.org/jira/browse/SPARK-9784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9784:
---

Assignee: Josh Rosen  (was: Apache Spark)

 Exchange.isUnsafe should check whether codegen and unsafe are enabled
 -

 Key: SPARK-9784
 URL: https://issues.apache.org/jira/browse/SPARK-9784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 Exchange needs to check whether unsafe mode is enabled in its 
 {{tungstenMode}} method:
 {code}
   override def nodeName: String = if (tungstenMode) TungstenExchange else 
 Exchange
   /**
* Returns true iff we can support the data type, and we are not doing 
 range partitioning.
*/
   private lazy val tungstenMode: Boolean = {
 GenerateUnsafeProjection.canSupport(child.schema) 
   !newPartitioning.isInstanceOf[RangePartitioning]
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7778) Add standard deviation aggregate expression

2015-08-10 Thread Rakesh Chalasani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680491#comment-14680491
 ] 

Rakesh Chalasani commented on SPARK-7778:
-

Hi SsuTing:

The aggregate expression interface has changed in 1.5 and the above PR is 
obsolete.  SPARK-6548 which is tracking this now is still open. I guess that is 
a better place to keep track of it.

Rakesh

 Add standard deviation aggregate expression 
 

 Key: SPARK-7778
 URL: https://issues.apache.org/jira/browse/SPARK-7778
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Rakesh Chalasani

 Add standard deviation aggregate expression over data frame columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9711) Unable to run spark after restarting cluster with spark-ec2

2015-08-10 Thread Guangyang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangyang Li updated SPARK-9711:

Description: 
With Spark 1.4.1 and YARN client mode, my application works at the first time 
the cluster is built. While if I stop and start the cluster with using 
spark-ec2, the same command fails. At the end of the spark logs, it's shown 
that it just keeps trying to connect to master node repeatedly:

INFO Client: Retrying connect to server: 
ec2-54-174-232-129.compute-1.amazonaws.com/172.31.36.29:8032. Already tried 0 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)

I restarted YARN and dfs manually after restarting the cluster, however, I was 
unable to restart Tachyon and it fails when running ./bin/tachyon runTests, 
which might be the possible reason.

  was:
With Spark 1.4.1 and YARN client mode, my application works at the first time 
the cluster is built. While if I stop and start the cluster with using 
spark-ec2, the same command fails. At the end of the spark logs, it's shown 
that it just keeps trying to connect to master node repeatedly:

INFO Client: Retrying connect to server: 
ec2-54-174-232-129.compute-1.amazonaws.com/172.31.36.29:8032. Already tried 0 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1000 MILLISECONDS)


 Unable to run spark after restarting cluster with spark-ec2
 ---

 Key: SPARK-9711
 URL: https://issues.apache.org/jira/browse/SPARK-9711
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.1
Reporter: Guangyang Li

 With Spark 1.4.1 and YARN client mode, my application works at the first time 
 the cluster is built. While if I stop and start the cluster with using 
 spark-ec2, the same command fails. At the end of the spark logs, it's shown 
 that it just keeps trying to connect to master node repeatedly:
 INFO Client: Retrying connect to server: 
 ec2-54-174-232-129.compute-1.amazonaws.com/172.31.36.29:8032. Already tried 0 
 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
 sleepTime=1000 MILLISECONDS)
 I restarted YARN and dfs manually after restarting the cluster, however, I 
 was unable to restart Tachyon and it fails when running ./bin/tachyon 
 runTests, which might be the possible reason.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9663) ML Python API coverage issues found during 1.5 QA


[ 
https://issues.apache.org/jira/browse/SPARK-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659207#comment-14659207
 ] 

Joseph K. Bradley edited comment on SPARK-9663 at 8/10/15 6:32 PM:
---

(complete): Linked unfinished items from previous release [SPARK-7536] here.


was (Author: josephkb):
*TODO: We need to link unfinished items from [SPARK-7536] here (linked as 
contains those items).*

 ML Python API coverage issues found during 1.5 QA
 -

 Key: SPARK-9663
 URL: https://issues.apache.org/jira/browse/SPARK-9663
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley

 This umbrella is for a list of Python API coverage issues which we should fix 
 for the 1.6 release cycle.  This list is to be generated from issues found in 
 [SPARK-9662] and from remaining issues from 1.4: [SPARK-7536].
 Here we check and compare the Python and Scala API of MLlib/ML,
 add missing classes/methods/parameters for PySpark. 
 * Missing classes for PySpark(ML):
 ** feature
 *** CountVectorizerModel SPARK-9769
 *** DCT SPARK-9770
 *** ElementwiseProduct SPARK-9768
 *** MinMaxScaler SPARK-9771
 *** StopWordsRemover SPARK-9679
 *** VectorSlicer SPARK-9772
 ** classification
 *** OneVsRest SPARK-7861
 *** MultilayerPerceptronClassifier SPARK-9773
 ** regression
 *** IsotonicRegression SPARK-9774
 * Missing User Guide documents for PySpark SPARK-8757



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9574) Review the contents of uber JARs spark-streaming-XXX-assembly


 [ 
https://issues.apache.org/jira/browse/SPARK-9574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9574:
---

Assignee: Shixiong Zhu  (was: Apache Spark)

 Review the contents of uber JARs spark-streaming-XXX-assembly
 -

 Key: SPARK-9574
 URL: https://issues.apache.org/jira/browse/SPARK-9574
 Project: Spark
  Issue Type: Task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Shixiong Zhu

 It should not contain Spark core and its dependencies, especially the 
 following.
 - Hadoop and its dependencies
 - Scala libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch


[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680389#comment-14680389
 ] 

Cheng Lian commented on SPARK-9340:
---

[~damianguy] Would you mind to help reviewing [PR 
#8070|https://github.com/apache/spark/pull/8070] and check whether it works for 
your case? Thanks in advance!

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9782) Add support for YARN application tags running Spark on YARN

2015-08-10 Thread Dennis Huo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680431#comment-14680431
 ] 

Dennis Huo commented on SPARK-9782:
---

Correct, from what I understand, the node labels JIRA is a more heavyweight 
behavioral-change feature, for being able to control packing of requested 
containers onto machines based on node labels.

YARN application tags are distinct from node labels, and are only used by 
workflow orchestrators on top of YARN, without affecting how YARN does packing 
at all.

 Add support for YARN application tags running Spark on YARN
 ---

 Key: SPARK-9782
 URL: https://issues.apache.org/jira/browse/SPARK-9782
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.1
Reporter: Dennis Huo

 https://issues.apache.org/jira/browse/YARN-1390 originally added the new 
 “Application Tags” feature to YARN to help track the sources of applications 
 among many possible YARN clients. 
 https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a 
 set of tags to be applied, and for comparison, 
 https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for 
 MapReduce to easily propagate tags through to YARN via Configuration settings.
 Since the ApplicationSubmissionContext.setApplicationTags method was only 
 added in Hadoop 2.4+, Spark support will invoke the method via reflection the 
 same way other such version-specific methods are called in elsewhere in the 
 YARN client. Since the usage of tags is generally not critical to the 
 functionality of older YARN setups, it should be safe to handle 
 NoSuchMethodException with just a logWarning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9340:
--
Description: 
SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement 
backwards-compatibility rules defined in {{parquet-format}} spec. However, both 
Spark SQL and {{parquet-avro}} neglected the following statement in 
{{parquet-format}}:
{quote}
This does not affect repeated fields that are not annotated: A repeated field 
that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor 
annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of 
required elements where the element type is the type of the field.
{quote}
One of the consequences is that, Parquet files generated by 
{{parquet-protobuf}} containing unannotated repeated fields are not correctly 
converted to Catalyst arrays.

For example, the following Parquet schema
{noformat}
message root {
  repeated int32 f1
}
{noformat}
 should be converted to
{noformat}
StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), 
nullable = false) :: Nil)
{noformat}
But now it triggers an {{AnalysisException}}.

  was:
The way ParquetTypesConverter handles primitive repeated types results in an 
incompatible schema being used for querying data. For example, given a schema 
like so:
message root {
   repeated int32 repeated_field;
 }

Spark produces a read schema like:
message root {
   optional int32 repeated_field;
 }

These are incompatible and all attempts to read fail.
In ParquetTypesConverter.toDataType:

 if (parquetType.isPrimitive) {
  toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
isInt96AsTimestamp)
} else {...}

The if condition should also have !parquetType.isRepetition(Repetition.REPEATED)
 
And then this case will need to be handled in the else 


 CatalystSchemaConverter and CatalystRowConverter don't handle unannotated 
 repeated fields correctly
 ---

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement 
 backwards-compatibility rules defined in {{parquet-format}} spec. However, 
 both Spark SQL and {{parquet-avro}} neglected the following statement in 
 {{parquet-format}}:
 {quote}
 This does not affect repeated fields that are not annotated: A repeated field 
 that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor 
 annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of 
 required elements where the element type is the type of the field.
 {quote}
 One of the consequences is that, Parquet files generated by 
 {{parquet-protobuf}} containing unannotated repeated fields are not correctly 
 converted to Catalyst arrays.
 For example, the following Parquet schema
 {noformat}
 message root {
   repeated int32 f1
 }
 {noformat}
  should be converted to
 {noformat}
 StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), 
 nullable = false) :: Nil)
 {noformat}
 But now it triggers an {{AnalysisException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9450) [INVALID] HashedRelation.get() could return an Iterator[Row] instead of Seq[Row]


 [ 
https://issues.apache.org/jira/browse/SPARK-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9450:
--
Summary: [INVALID] HashedRelation.get() could return an Iterator[Row] 
instead of Seq[Row]  (was: HashedRelation.get() could return an Iterator[Row] 
instead of Seq[Row])

 [INVALID] HashedRelation.get() could return an Iterator[Row] instead of 
 Seq[Row]
 

 Key: SPARK-9450
 URL: https://issues.apache.org/jira/browse/SPARK-9450
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Andrew Or

 While looking through some HashedRelation code, [~andrewor14] and I noticed 
 that it looks like HashedRelation.get() could return an Iterator of rows 
 instead of a sequence.  If we do this, we can reduce object allocation in 
 UnsafeHashedRelation.get().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9755) Add method documentation to MultivariateOnlineSummarizer


 [ 
https://issues.apache.org/jira/browse/SPARK-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9755.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8045
[https://github.com/apache/spark/pull/8045]

 Add method documentation to MultivariateOnlineSummarizer
 

 Key: SPARK-9755
 URL: https://issues.apache.org/jira/browse/SPARK-9755
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor
 Fix For: 1.5.0


 Docs present in 1.4 are lost in current 1.5 branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly

2015-08-10 Thread Damian Guy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680492#comment-14680492
 ] 

Damian Guy commented on SPARK-9340:
---

Code looks good and it works as expected. Tests pass. Thanks for your 
assistance with this.

 CatalystSchemaConverter and CatalystRowConverter don't handle unannotated 
 repeated fields correctly
 ---

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
Assignee: Cheng Lian
 Attachments: ParquetTypesConverterTest.scala


 SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement 
 backwards-compatibility rules defined in {{parquet-format}} spec. However, 
 both Spark SQL and {{parquet-avro}} neglected the following statement in 
 {{parquet-format}}:
 {quote}
 This does not affect repeated fields that are not annotated: A repeated field 
 that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor 
 annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of 
 required elements where the element type is the type of the field.
 {quote}
 One of the consequences is that, Parquet files generated by 
 {{parquet-protobuf}} containing unannotated repeated fields are not correctly 
 converted to Catalyst arrays.
 For example, the following Parquet schema
 {noformat}
 message root {
   repeated int32 f1
 }
 {noformat}
  should be converted to
 {noformat}
 StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), 
 nullable = false) :: Nil)
 {noformat}
 But now it triggers an {{AnalysisException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9786) Test backpressure

Tathagata Das created SPARK-9786:


 Summary: Test backpressure
 Key: SPARK-9786
 URL: https://issues.apache.org/jira/browse/SPARK-9786
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9787) Test for memory leaks using the streaming tests in spark-perf


 [ 
https://issues.apache.org/jira/browse/SPARK-9787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-9787:
-
Assignee: Shixiong Zhu

 Test for memory leaks using the streaming tests in spark-perf
 -

 Key: SPARK-9787
 URL: https://issues.apache.org/jira/browse/SPARK-9787
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9787) Test for memory leaks using the streaming tests in spark-perf

Tathagata Das created SPARK-9787:


 Summary: Test for memory leaks using the streaming tests in 
spark-perf
 Key: SPARK-9787
 URL: https://issues.apache.org/jira/browse/SPARK-9787
 Project: Spark
  Issue Type: Sub-task
Reporter: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9785) HashPartitioning compatibility should consider expression ordering


[ 
https://issues.apache.org/jira/browse/SPARK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680550#comment-14680550
 ] 

Apache Spark commented on SPARK-9785:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8074

 HashPartitioning compatibility should consider expression ordering
 --

 Key: SPARK-9785
 URL: https://issues.apache.org/jira/browse/SPARK-9785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but 
 in other contexts the ordering of those expressions matters.  This is 
 illustrated by the following regression test:
 {code}
   test(HashPartitioning compatibility) {
 val expressions = Seq(Literal(2), Literal(3))
 // Consider two HashPartitionings that have the same _set_ of hash 
 expressions but which are
 // created with different orderings of those expressions:
 val partitioningA = HashPartitioning(expressions, 100)
 val partitioningB = HashPartitioning(expressions.reverse, 100)
 // These partitionings are not considered equal:
 assert(partitioningA != partitioningB)
 // However, they both satisfy the same clustered distribution:
 val distribution = ClusteredDistribution(expressions)
 assert(partitioningA.satisfies(distribution))
 assert(partitioningB.satisfies(distribution))
 // Both partitionings are compatible with and guarantee each other:
 assert(partitioningA.compatibleWith(partitioningB))
 assert(partitioningB.compatibleWith(partitioningA))
 assert(partitioningA.guarantees(partitioningB))
 assert(partitioningB.guarantees(partitioningA))
 // Given all of this, we would expect these partitionings to compute the 
 same hashcode for
 // any given row:
 def computeHashCode(partitioning: HashPartitioning): Int = {
   val hashExprProj = new 
 InterpretedMutableProjection(partitioning.expressions, Seq.empty)
   hashExprProj.apply(InternalRow.empty).hashCode()
 }
 assert(computeHashCode(partitioningA) === computeHashCode(partitioningB))
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9785) HashPartitioning compatibility should consider expression ordering


 [ 
https://issues.apache.org/jira/browse/SPARK-9785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9785:
---

Assignee: Josh Rosen  (was: Apache Spark)

 HashPartitioning compatibility should consider expression ordering
 --

 Key: SPARK-9785
 URL: https://issues.apache.org/jira/browse/SPARK-9785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 HashPartitioning compatibility is defined w.r.t the _set_ of expressions, but 
 in other contexts the ordering of those expressions matters.  This is 
 illustrated by the following regression test:
 {code}
   test(HashPartitioning compatibility) {
 val expressions = Seq(Literal(2), Literal(3))
 // Consider two HashPartitionings that have the same _set_ of hash 
 expressions but which are
 // created with different orderings of those expressions:
 val partitioningA = HashPartitioning(expressions, 100)
 val partitioningB = HashPartitioning(expressions.reverse, 100)
 // These partitionings are not considered equal:
 assert(partitioningA != partitioningB)
 // However, they both satisfy the same clustered distribution:
 val distribution = ClusteredDistribution(expressions)
 assert(partitioningA.satisfies(distribution))
 assert(partitioningB.satisfies(distribution))
 // Both partitionings are compatible with and guarantee each other:
 assert(partitioningA.compatibleWith(partitioningB))
 assert(partitioningB.compatibleWith(partitioningA))
 assert(partitioningA.guarantees(partitioningB))
 assert(partitioningB.guarantees(partitioningA))
 // Given all of this, we would expect these partitionings to compute the 
 same hashcode for
 // any given row:
 def computeHashCode(partitioning: HashPartitioning): Int = {
   val hashExprProj = new 
 InterpretedMutableProjection(partitioning.expressions, Seq.empty)
   hashExprProj.apply(InternalRow.empty).hashCode()
 }
 assert(computeHashCode(partitioningA) === computeHashCode(partitioningB))
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9750) SparseMatrix should override equals


 [ 
https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9750:
-
Assignee: Feynman Liang

 SparseMatrix should override equals
 ---

 Key: SPARK-9750
 URL: https://issues.apache.org/jira/browse/SPARK-9750
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Blocker

 [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479]
  should override equals to ensure that two instances of the same matrix are 
 equal.
 This implementation should take into account the {{isTransposed}} flag and 
 {{values}} may not be in the same order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch


[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680277#comment-14680277
 ] 

Cheng Lian commented on SPARK-9340:
---

Ah, thanks a lot!  I see the problem now.  {{parquet-avro}} doesn't allow 
{{repeated}} fields outside {{LIST}} or {{MAP}}, and I was following 
{{parquet-avro}} when implementing all the compatibility rules.

So I think the real problematic position is 
[here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/CatalystSchemaConverter.scala#L102-L104]
 (and 
[here|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L217]
 in {{parquet-avro}}).

This issue could have a simpler solution, especially the schema conversion 
part. Row converter needs bigger changes though.  I'm working on a simplified 
version of PR #8063.  Will attribute this issue to you since you spot this 
issue and #8063 inspired me a lot!

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9781) KCL Workers should be configurable from Spark configuration

2015-08-10 Thread Anton Nekhaev (JIRA)

Anton Nekhaev created SPARK-9781:


 Summary: KCL Workers should be configurable from Spark 
configuration
 Key: SPARK-9781
 URL: https://issues.apache.org/jira/browse/SPARK-9781
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Anton Nekhaev


Currently the KinesisClientLibConfiguration for KCL Workers is created withing 
the KinesisReceiver and user is allowed to change only basic settings such as 
endpoint URL, stream name, credentials, etc.

However, there is no way to tune some advanced settings, e.g. MaxRecords, 
IdleTimeBetweenReads, FailoverTime, etc.

We can add this settings to the Spark configuration and parametrize  
KinesisClientLibConfiguration with them in KinesisReceiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-10 Thread Damian Guy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680281#comment-14680281
 ] 

Damian Guy commented on SPARK-9340:
---

Thanks. I'm sure there is a simpler solution to someone more familiar with the 
code! ;-) Thanks for looking further into it, appreciated.

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9788) LDA docConcentration, gammaShape 1.5 binary incompatibility fixes

2015-08-10 Thread Feynman Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680568#comment-14680568
 ] 

Feynman Liang commented on SPARK-9788:
--

Assign to me

 LDA docConcentration, gammaShape 1.5 binary incompatibility fixes
 -

 Key: SPARK-9788
 URL: https://issues.apache.org/jira/browse/SPARK-9788
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 From [SPARK-9658]:
 1. LDA.docConcentration
 It will be nice to keep the old APIs unchanged.  Proposal:
 * Add “asymmetricDocConcentration” and revert docConcentration changes.
 * If the (internal) doc concentration vector is a single value, 
 “getDocConcentration returns it.  If it is a constant vector, 
 getDocConcentration returns the first item, and fails otherwise.
 2. LDAModel.gammaShape
 This should be given a default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.


[ 
https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680570#comment-14680570
 ] 

Mark Grover commented on SPARK-9790:


I am working on this, will file a Work-In-Progress pull request soon.

 [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
 --

 Key: SPARK-9790
 URL: https://issues.apache.org/jira/browse/SPARK-9790
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.1
Reporter: Mark Grover

 When an executor is killed by yarn because it exceeds the memory overhead, 
 the only thing spark knows is that the executor is lost. The user has to go 
 track search through the NM logs to figure out that its been killed by yarn.
 It would be much nicer if the spark-driver could be notified why the executor 
 was killed. Ideally it could both log an explanatory message, and update the 
 UI (and the eventLog) so that it was clear why the executor was lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9793) PySpark DenseVector, SparseVector should override eq

Joseph K. Bradley created SPARK-9793:


 Summary: PySpark DenseVector, SparseVector should override __eq__
 Key: SPARK-9793
 URL: https://issues.apache.org/jira/browse/SPARK-9793
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Priority: Critical


See [SPARK-9750].
PySpark DenseVector and SparseVector do not override the equality operator 
properly.  They should use semantics, not representation, for comparison.  
(This is what Scala currently does.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9795) Dynamic allocation: avoid double counting when killing same executor

2015-08-10 Thread Andrew Or (JIRA)

Andrew Or created SPARK-9795:


 Summary: Dynamic allocation: avoid double counting when killing 
same executor
 Key: SPARK-9795
 URL: https://issues.apache.org/jira/browse/SPARK-9795
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical


Currently, if we kill the same executor twice in rapid succession, we will 
lower the executor target by 2 instead of 1. In cases where we don't re-adjust 
the target upwards frequently, this will result in jobs hanging.

This may or may not be the same as SPARK-9745. Until we can verify the 
correlation, however, this will remain a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9788) LDA docConcentration, gammaShape 1.5 binary incompatibility fixes


[ 
https://issues.apache.org/jira/browse/SPARK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680666#comment-14680666
 ] 

Joseph K. Bradley commented on SPARK-9788:
--

Yeah, I guess we should revert getAlpha and setAlpha as well.  We can add 
asymmetric versions.  We can fix this duplication for the Pipelines API.

 LDA docConcentration, gammaShape 1.5 binary incompatibility fixes
 -

 Key: SPARK-9788
 URL: https://issues.apache.org/jira/browse/SPARK-9788
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Feynman Liang

 From [SPARK-9658]:
 1. LDA.docConcentration
 It will be nice to keep the old APIs unchanged.  Proposal:
 * Add “asymmetricDocConcentration” and revert docConcentration changes.
 * If the (internal) doc concentration vector is a single value, 
 “getDocConcentration returns it.  If it is a constant vector, 
 getDocConcentration returns the first item, and fails otherwise.
 2. LDAModel.gammaShape
 This should be given a default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9784) Exchange.isUnsafe should check whether codegen and unsafe are enabled


 [ 
https://issues.apache.org/jira/browse/SPARK-9784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-9784.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8073
[https://github.com/apache/spark/pull/8073]

 Exchange.isUnsafe should check whether codegen and unsafe are enabled
 -

 Key: SPARK-9784
 URL: https://issues.apache.org/jira/browse/SPARK-9784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Fix For: 1.5.0


 Exchange needs to check whether unsafe mode is enabled in its 
 {{tungstenMode}} method:
 {code}
   override def nodeName: String = if (tungstenMode) TungstenExchange else 
 Exchange
   /**
* Returns true iff we can support the data type, and we are not doing 
 range partitioning.
*/
   private lazy val tungstenMode: Boolean = {
 GenerateUnsafeProjection.canSupport(child.schema) 
   !newPartitioning.isInstanceOf[RangePartitioning]
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9794) ISO DateTime parser is too strict


 [ 
https://issues.apache.org/jira/browse/SPARK-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9794:
--
Affects Version/s: 1.2.2
   1.3.1
   1.4.1

 ISO DateTime parser is too strict
 -

 Key: SPARK-9794
 URL: https://issues.apache.org/jira/browse/SPARK-9794
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0
Reporter: Alex Angelini

 The DateTime parser requires 3 millisecond digits, but that is not part of 
 the official ISO8601 spec.
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L132
 https://en.wikipedia.org/wiki/ISO_8601
 This results in the following exception when trying to parse datetime columns
 {code}
 java.text.ParseException: Unparseable date: 0001-01-01T00:00:00GMT-00:00
 {code}
 [~joshrosen] [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9794) ISO DateTime parser is too strict


[ 
https://issues.apache.org/jira/browse/SPARK-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680688#comment-14680688
 ] 

Josh Rosen commented on SPARK-9794:
---

The same code exists in 1.4.0 and 1.4.1: 
https://github.com/apache/spark/blob/v1.4.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateUtils.scala#L86

It's also present in 1.3.0 / 1.3.1: 
https://github.com/apache/spark/blob/v1.3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66

And in 1.2.x: 
https://github.com/apache/spark/blob/v1.2.2/sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala#L158

Here's the pull request that originally added that line: 
https://github.com/apache/spark/pull/3012

 ISO DateTime parser is too strict
 -

 Key: SPARK-9794
 URL: https://issues.apache.org/jira/browse/SPARK-9794
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0
Reporter: Alex Angelini

 The DateTime parser requires 3 millisecond digits, but that is not part of 
 the official ISO8601 spec.
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L132
 https://en.wikipedia.org/wiki/ISO_8601
 This results in the following exception when trying to parse datetime columns
 {code}
 java.text.ParseException: Unparseable date: 0001-01-01T00:00:00GMT-00:00
 {code}
 [~joshrosen] [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9795) Dynamic allocation: avoid double counting when killing same executor twice

2015-08-10 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9795:
-
Summary: Dynamic allocation: avoid double counting when killing same 
executor twice  (was: Dynamic allocation: avoid double counting when killing 
same executor)

 Dynamic allocation: avoid double counting when killing same executor twice
 --

 Key: SPARK-9795
 URL: https://issues.apache.org/jira/browse/SPARK-9795
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 Currently, if we kill the same executor twice in rapid succession, we will 
 lower the executor target by 2 instead of 1. In cases where we don't 
 re-adjust the target upwards frequently, this will result in jobs hanging.
 This may or may not be the same as SPARK-9745. Until we can verify the 
 correlation, however, this will remain a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9795) Dynamic allocation: avoid double counting when killing same executor


 [ 
https://issues.apache.org/jira/browse/SPARK-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9795:
---

Assignee: Andrew Or  (was: Apache Spark)

 Dynamic allocation: avoid double counting when killing same executor
 

 Key: SPARK-9795
 URL: https://issues.apache.org/jira/browse/SPARK-9795
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 Currently, if we kill the same executor twice in rapid succession, we will 
 lower the executor target by 2 instead of 1. In cases where we don't 
 re-adjust the target upwards frequently, this will result in jobs hanging.
 This may or may not be the same as SPARK-9745. Until we can verify the 
 correlation, however, this will remain a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9795) Dynamic allocation: avoid double counting when killing same executor


[ 
https://issues.apache.org/jira/browse/SPARK-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680738#comment-14680738
 ] 

Apache Spark commented on SPARK-9795:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/8078

 Dynamic allocation: avoid double counting when killing same executor
 

 Key: SPARK-9795
 URL: https://issues.apache.org/jira/browse/SPARK-9795
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 Currently, if we kill the same executor twice in rapid succession, we will 
 lower the executor target by 2 instead of 1. In cases where we don't 
 re-adjust the target upwards frequently, this will result in jobs hanging.
 This may or may not be the same as SPARK-9745. Until we can verify the 
 correlation, however, this will remain a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9791) Review API for developer and experimental tags

Tathagata Das created SPARK-9791:


 Summary: Review API for developer and experimental tags
 Key: SPARK-9791
 URL: https://issues.apache.org/jira/browse/SPARK-9791
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9750) DenseMatrix, SparseMatrix should override equals


 [ 
https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9750:
-
Description: 
[SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479]
 should override equals to ensure that two instances of the same matrix are 
equal.
Same for DenseMatrix.

This implementation should take into account the {{isTransposed}} flag and 
{{values}} may not be in the same order.

  was:
[SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479]
 should override equals to ensure that two instances of the same matrix are 
equal.

This implementation should take into account the {{isTransposed}} flag and 
{{values}} may not be in the same order.


 DenseMatrix, SparseMatrix should override equals
 

 Key: SPARK-9750
 URL: https://issues.apache.org/jira/browse/SPARK-9750
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Blocker

 [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479]
  should override equals to ensure that two instances of the same matrix are 
 equal.
 Same for DenseMatrix.
 This implementation should take into account the {{isTransposed}} flag and 
 {{values}} may not be in the same order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9750) DenseMatrix, SparseMatrix should override equals


 [ 
https://issues.apache.org/jira/browse/SPARK-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9750:
-
Summary: DenseMatrix, SparseMatrix should override equals  (was: 
SparseMatrix should override equals)

 DenseMatrix, SparseMatrix should override equals
 

 Key: SPARK-9750
 URL: https://issues.apache.org/jira/browse/SPARK-9750
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Blocker

 [SparseMatrix|https://github.com/apache/spark/blob/9897cc5e3d6c70f7e45e887e2c6fc24dfa1adada/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L479]
  should override equals to ensure that two instances of the same matrix are 
 equal.
 This implementation should take into account the {{isTransposed}} flag and 
 {{values}} may not be in the same order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9766) check and add missing docs for PySpark ML