date:20150116

[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2015-01-16 Thread Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: SparkSQLOnHBase_v2.docx

Version 2

> HBase as data source to SparkSQL
> 
>
> Key: SPARK-3880
> URL: https://issues.apache.org/jira/browse/SPARK-3880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yan
>Assignee: Yan
> Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.docx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3880) HBase as data source to SparkSQL

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281236#comment-14281236
 ] 

Apache Spark commented on SPARK-3880:
-

User 'yzhou2001' has created a pull request for this issue:
https://github.com/apache/spark/pull/4084

> HBase as data source to SparkSQL
> 
>
> Key: SPARK-3880
> URL: https://issues.apache.org/jira/browse/SPARK-3880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yan
>Assignee: Yan
> Attachments: HBaseOnSpark.docx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5296) Predicate Pushdown (BaseRelation) to have an interface that will accept OR filters

2015-01-16 Thread Corey J. Nolet (JIRA)

Corey J. Nolet created SPARK-5296:
-

 Summary: Predicate Pushdown (BaseRelation) to have an interface 
that will accept OR filters
 Key: SPARK-5296
 URL: https://issues.apache.org/jira/browse/SPARK-5296
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet


Currently, the BaseRelation API allows a FilteredRelation to handle an 
Array[Filter] which represents filter expressions that are applied as an AND 
operator.

We should support OR operations in a BaseRelation as well. I'm not sure what 
this would look like in terms of API changes, but it almost seems like a 
FilteredUnionedScan BaseRelation (the name stinks but you get the idea) would 
be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5295) Only expose leaf data types

2015-01-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5295:
---
Description: 
1. We expose all the stuff in data types right now, including NumericTypes, 
etc. These should be hidden from users. We should only expose the leaf types.

2. Remove DeveloperAPI tag from the common types.

  was:We expose all the stuff in data types right now, including NumericTypes, 
etc. These should be hidden from users.


> Only expose leaf data types
> ---
>
> Key: SPARK-5295
> URL: https://issues.apache.org/jira/browse/SPARK-5295
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> 1. We expose all the stuff in data types right now, including NumericTypes, 
> etc. These should be hidden from users. We should only expose the leaf types.
> 2. Remove DeveloperAPI tag from the common types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5295) Stabilize data types

2015-01-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5295:
---
Summary: Stabilize data types  (was: Only expose leaf data types)

> Stabilize data types
> 
>
> Key: SPARK-5295
> URL: https://issues.apache.org/jira/browse/SPARK-5295
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> 1. We expose all the stuff in data types right now, including NumericTypes, 
> etc. These should be hidden from users. We should only expose the leaf types.
> 2. Remove DeveloperAPI tag from the common types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5295) Only expose leaf data types

2015-01-16 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-5295:
--

 Summary: Only expose leaf data types
 Key: SPARK-5295
 URL: https://issues.apache.org/jira/browse/SPARK-5295
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We expose all the stuff in data types right now, including NumericTypes, etc. 
These should be hidden from users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API

2015-01-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5193.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Make Spark SQL API usable in Java and remove the Java-specific API
> --
>
> Key: SPARK-5193
> URL: https://issues.apache.org/jira/browse/SPARK-5193
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.0
>
>
> Java version of the SchemaRDD API causes high maintenance burden for Spark 
> SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support 
> both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it 
> usable for Java, and then we can remove the Java specific version. 
> Things to remove include (Java version of):
> - data type
> - Row
> - SQLContext
> - HiveContext
> Things to consider:
> - Scala and Java have a different collection library.
> - Scala and Java (8) have different closure interface.
> - Scala and Java can have duplicate definitions of common classes, such as 
> BigDecimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5278) check ambiguous reference to fields in Spark SQL is incompleted

2015-01-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-5278:
---
Summary: check ambiguous reference to fields in Spark SQL is incompleted  
(was: ambiguous reference to fields in Spark SQL is incompleted)

> check ambiguous reference to fields in Spark SQL is incompleted
> ---
>
> Key: SPARK-5278
> URL: https://issues.apache.org/jira/browse/SPARK-5278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>
> at hive context
> for json string like
> {code}{"a": {"b": 1, "B": 2}}{code}
> The SQL `SELECT a.b from t` will report error for ambiguous reference to 
> fields.
> But for json string like
> {code}{"a": [{"b": 1, "B": 2}]}{code}
> The SQL `SELECT a[0].b from t` will pass and pick the first `b`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5294) Hide tables in AllStagePages for "Active Stages, Completed Stages and Failed Stages" when they are empty

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281173#comment-14281173
 ] 

Apache Spark commented on SPARK-5294:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/4083

> Hide tables in AllStagePages for "Active Stages, Completed Stages and Failed 
> Stages" when they are empty
> 
>
> Key: SPARK-5294
> URL: https://issues.apache.org/jira/browse/SPARK-5294
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> Related to SPARK-5228, AllStagesPage also should hide the table for  
> ActiveStages, CompleteStages and FailedStages when they are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5293) Enable Spark user applications to use different versions of Akka

2015-01-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5293:
---
Description: 
A lot of Spark user applications are using (or want to use) Akka. Akka as a 
whole can contribute great architectural simplicity and uniformity. However, 
because Spark depends on Akka, it is not possible for users to rely on 
different versions, and we have received many requests in the past asking for 
help about this specific issue. For example, Spark Streaming might be used as 
the receiver of Akka messages - but our dependency on Akka requires the 
upstream Akka actors to also use the identical version of Akka.

Since our usage of Akka is limited (mainly for RPC and single-threaded event 
loop), we can replace it with alternative RPC implementations and a common 
event loop in Spark.



  was:
A lot of Spark user applications are using (or want to use) Akka. Akka as a 
whole can contribute great architectural simplicity and uniformity. However, 
because Spark depends on Akka, it is not possible for users to rely on 
different versions, and we have received many requests in the past asking for 
help about this specific issue.

Since our usage of Akka is limited (mainly for RPC and single-threaded event 
loop), we can replace it with alternative RPC implementations and a common 
event loop in Spark.




> Enable Spark user applications to use different versions of Akka
> 
>
> Key: SPARK-5293
> URL: https://issues.apache.org/jira/browse/SPARK-5293
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Reynold Xin
>
> A lot of Spark user applications are using (or want to use) Akka. Akka as a 
> whole can contribute great architectural simplicity and uniformity. However, 
> because Spark depends on Akka, it is not possible for users to rely on 
> different versions, and we have received many requests in the past asking for 
> help about this specific issue. For example, Spark Streaming might be used as 
> the receiver of Akka messages - but our dependency on Akka requires the 
> upstream Akka actors to also use the identical version of Akka.
> Since our usage of Akka is limited (mainly for RPC and single-threaded event 
> loop), we can replace it with alternative RPC implementations and a common 
> event loop in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5293) Enable Spark user applications to use different versions of Akka

2015-01-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5293:
---
Description: 
A lot of Spark user applications are using (or want to use) Akka. Akka as a 
whole can contribute great architectural simplicity and uniformity. However, 
because Spark depends on Akka, it is not possible for users to rely on 
different versions, and we have received many requests in the past asking for 
help about this specific issue.

Since our usage of Akka is limited (mainly for RPC and single-threaded event 
loop), we can replace it with alternative RPC implementations and a common 
event loop in Spark.



  was:
A lot of Spark user applications are using (or want to use) Akka. Akka as a 
whole can contribute great architectural simplicity and unification. However, 
because Spark depends on Akka, it is not possible for users to rely on 
different versions, and we have received many requests in the past asking for 
help about this specific issue.

Since our usage of Akka is limited (mainly for RPC and single-threaded event 
loop), we can replace it with alternative RPC implementations and a common 
event loop in Spark.




> Enable Spark user applications to use different versions of Akka
> 
>
> Key: SPARK-5293
> URL: https://issues.apache.org/jira/browse/SPARK-5293
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Reynold Xin
>
> A lot of Spark user applications are using (or want to use) Akka. Akka as a 
> whole can contribute great architectural simplicity and uniformity. However, 
> because Spark depends on Akka, it is not possible for users to rely on 
> different versions, and we have received many requests in the past asking for 
> help about this specific issue.
> Since our usage of Akka is limited (mainly for RPC and single-threaded event 
> loop), we can replace it with alternative RPC implementations and a common 
> event loop in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5294) Hide tables in AllStagePages for "Active Stages, Completed Stages and Failed Stages" when they are empty

2015-01-16 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-5294:
-

 Summary: Hide tables in AllStagePages for "Active Stages, 
Completed Stages and Failed Stages" when they are empty
 Key: SPARK-5294
 URL: https://issues.apache.org/jira/browse/SPARK-5294
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta


Related to SPARK-5228, AllStagesPage also should hide the table for  
ActiveStages, CompleteStages and FailedStages when they are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5251) Using `tableIdentifier` in hive metastore

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281170#comment-14281170
 ] 

Apache Spark commented on SPARK-5251:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/4062

> Using `tableIdentifier` in hive metastore 
> --
>
> Key: SPARK-5251
> URL: https://issues.apache.org/jira/browse/SPARK-5251
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: wangfei
>
> Using `tableIdentifier` in hive metastore 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2015-01-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281168#comment-14281168
 ] 

Xuefu Zhang commented on SPARK-1021:


This problem also occurred on Hive on Spark (HIVE-9370. Could we take this 
forward?

> sortByKey() launches a cluster job when it shouldn't
> 
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0
>Reporter: Andrew Ash
>Assignee: Erik Erlandson
>  Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5214) Add EventLoop and change DAGScheduler to an EventLoop

2015-01-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5214:
---
Assignee: Shixiong Zhu

> Add EventLoop and change DAGScheduler to an EventLoop
> -
>
> Key: SPARK-5214
> URL: https://issues.apache.org/jira/browse/SPARK-5214
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> As per discussion in SPARK-5124, DAGScheduler can simply use a queue & event 
> loop to process events. It would be great when we want to decouple Akka in 
> the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5293) Enable Spark user applications to use different versions of Akka

2015-01-16 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-5293:
--

 Summary: Enable Spark user applications to use different versions 
of Akka
 Key: SPARK-5293
 URL: https://issues.apache.org/jira/browse/SPARK-5293
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Reynold Xin


A lot of Spark user applications are using (or want to use) Akka. Akka as a 
whole can contribute great architectural simplicity and unification. However, 
because Spark depends on Akka, it is not possible for users to rely on 
different versions, and we have received many requests in the past asking for 
help about this specific issue.

Since our usage of Akka is limited (mainly for RPC and single-threaded event 
loop), we can replace it with alternative RPC implementations and a common 
event loop in Spark.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5292) optimize join for table that are already sharded/support for hive bucket

2015-01-16 Thread gagan taneja (JIRA)

gagan taneja created SPARK-5292:
---

 Summary: optimize join for table that are already sharded/support 
for hive bucket
 Key: SPARK-5292
 URL: https://issues.apache.org/jira/browse/SPARK-5292
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.2.0
Reporter: gagan taneja


Currently join do not consider the locality of the data and perform the shuffle 
anyway
If the user takes the responsilbity of distributing the data based on some hash 
or shared the data, spark join should be able to leverage sharding to optimize 
join calculation/eliminate shuffle



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5291) Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281143#comment-14281143
 ] 

Apache Spark commented on SPARK-5291:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/4082

> Add timestamp and reason why an executor is removed to 
> SparkListenerExecutorAdded and SparkListenerExecutorRemoved
> --
>
> Key: SPARK-5291
> URL: https://issues.apache.org/jira/browse/SPARK-5291
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are 
> added.
> I think it's useful if they have timestamp and the reason why an executor is 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5291) Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved

2015-01-16 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-5291:
--
Description: 
Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are added.
I think it's useful if they have timestamp and the reason why an executor is 
removed.

  was:Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are 
added but I think it's useful if they have timestamp and the reason why an 
executor is removed.


> Add timestamp and reason why an executor is removed to 
> SparkListenerExecutorAdded and SparkListenerExecutorRemoved
> --
>
> Key: SPARK-5291
> URL: https://issues.apache.org/jira/browse/SPARK-5291
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are 
> added.
> I think it's useful if they have timestamp and the reason why an executor is 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5291) Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved

2015-01-16 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-5291:
--
Summary: Add timestamp and reason why an executor is removed to 
SparkListenerExecutorAdded and SparkListenerExecutorRemoved  (was: Add 
timestamp and reason why an executor is removed)

> Add timestamp and reason why an executor is removed to 
> SparkListenerExecutorAdded and SparkListenerExecutorRemoved
> --
>
> Key: SPARK-5291
> URL: https://issues.apache.org/jira/browse/SPARK-5291
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are 
> added but I think it's useful if they have timestamp and the reason why an 
> executor is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5291) Add timestamp and reason why an executor is removed

2015-01-16 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-5291:
-

 Summary: Add timestamp and reason why an executor is removed
 Key: SPARK-5291
 URL: https://issues.apache.org/jira/browse/SPARK-5291
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Kousuke Saruta


Recently SparkListenerExecutorAdded and SparkListenerExecutorRemoved are added 
but I think it's useful if they have timestamp and the reason why an executor 
is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning

2015-01-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5282:
-
Assignee: yuhao yang

> RowMatrix easily gets int overflow in the memory size warning
> -
>
> Key: SPARK-5282
> URL: https://issues.apache.org/jira/browse/SPARK-5282
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: centos, others should be similar
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The warning in the RowMatrix will easily get int overflow when the cols is 
> larger than 16385.
> minor issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5287) NativeType.defaultSizeOf should have default sizes of all NativeTypes.

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280986#comment-14280986
 ] 

Apache Spark commented on SPARK-5287:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4081

> NativeType.defaultSizeOf should have default sizes of all NativeTypes.
> --
>
> Key: SPARK-5287
> URL: https://issues.apache.org/jira/browse/SPARK-5287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Otherwise, we will failed to do stats estimation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280927#comment-14280927
 ] 

Apache Spark commented on SPARK-5289:
-

User 'pwendell' has created a pull request for this issue:
https://github.com/apache/spark/pull/4079

> Backport publishing of repl, yarn into branch-1.2
> -
>
> Key: SPARK-5289
> URL: https://issues.apache.org/jira/browse/SPARK-5289
> Project: Spark
>  Issue Type: Improvement
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
>
> In SPARK-3452 we did some clean-up of published artifacts that turned out to 
> adversely affect some users. This has been mostly patched up in master via 
> SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
> modules, they were fixed in SPARK-4048 as part of a larger change that only 
> went into master.
> Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5290) Executing functions in sparkSQL registered in sqlcontext gives scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection

2015-01-16 Thread Manoj Samel (JIRA)

Manoj Samel created SPARK-5290:
--

 Summary: Executing functions in sparkSQL registered in sqlcontext 
gives scala.reflect.internal.MissingRequirementError: class 
org.apache.spark.sql.catalyst.ScalaReflection
 Key: SPARK-5290
 URL: https://issues.apache.org/jira/browse/SPARK-5290
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: Spark 1.2 on centos or Mac
Reporter: Manoj Samel


Register a function using sqlContext.registerFunction and then use that 
function in sparkSQL. The execution gives following stack trace in Spark 1.2 - 
this works in Spark 1.1.1

at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
at scala.reflect.api.Universe.typeOf(Universe.scala:59)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
at 
org.apache.spark.sql.UDFRegistration$class.builder$2(UdfRegistration.scala:91)
at 
org.apache.spark.sql.UDFRegistration$$anonfun$registerFunction$1.apply(UdfRegistration.scala:92)
at 
org.apache.spark.sql.UDFRegistration$$anonfun$registerFunction$1.apply(UdfRegistration.scala:92)
at 
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:53)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:220)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:218)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:71)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:85)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:84)
at scala.

[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5289:
---
Description: 
In SPARK-3452 we did some clean-up of published artifacts that turned out to 
adversely affect some users. This has been mostly patched up in master via 
SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
modules, they were fixed in SPARK-4048 as part of a larger change that only 
went into master.

Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 
release.

  was:
In SPARK-3452 we did some clean-up of published artifacts that turned out to 
adversely affect some users. This has been mostly patched up in master via 
SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
modules, they were fixed in SPARK-4048 as part of a larger change that only 
went into master.

Those pieces should be backported.


> Backport publishing of repl, yarn into branch-1.2
> -
>
> Key: SPARK-5289
> URL: https://issues.apache.org/jira/browse/SPARK-5289
> Project: Spark
>  Issue Type: Improvement
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
>
> In SPARK-3452 we did some clean-up of published artifacts that turned out to 
> adversely affect some users. This has been mostly patched up in master via 
> SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
> modules, they were fixed in SPARK-4048 as part of a larger change that only 
> went into master.
> Those pieces should be backported to Spark 1.2 to allow publishing in a 1.2.1 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5289:
---
Description: In SPARK-3452 we did some clean-up of published artifacts that 
turned out to adversely affect some users. This has been mostly patched up in 
master via SPARK-4925 (hive-thritserver) which was backported. For the repl and 
yarn modules, they were fixed in SPARK-4048 as part of a larger change that 
only went into master.  (was: In SPARK-3452 we did some clean-up of published 
artifacts that turned out to adversely affect some users. This has been mostly 
patched up in master via SPARK-4925 (hive-thritserver), SPARK-4048 (which 
inadvertently did this for yarn and repl). But we should go in branch 1.2 and 
fix this as well so that we can do a 1.2.1 release with these artifacts.)

> Backport publishing of repl, yarn into branch-1.2
> -
>
> Key: SPARK-5289
> URL: https://issues.apache.org/jira/browse/SPARK-5289
> Project: Spark
>  Issue Type: Improvement
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
>
> In SPARK-3452 we did some clean-up of published artifacts that turned out to 
> adversely affect some users. This has been mostly patched up in master via 
> SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
> modules, they were fixed in SPARK-4048 as part of a larger change that only 
> went into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5289:
---
Summary: Backport publishing of repl, yarn into branch-1.2  (was: Backport 
publishing of repl, yarn, and hive-thriftserver into branch-1.2)

> Backport publishing of repl, yarn into branch-1.2
> -
>
> Key: SPARK-5289
> URL: https://issues.apache.org/jira/browse/SPARK-5289
> Project: Spark
>  Issue Type: Improvement
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
>
> In SPARK-3452 we did some clean-up of published artifacts that turned out to 
> adversely affect some users. This has been mostly patched up in master via 
> SPARK-4925 (hive-thritserver), SPARK-4048 (which inadvertently did this for 
> yarn and repl). But we should go in branch 1.2 and fix this as well so that 
> we can do a 1.2.1 release with these artifacts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5289) Backport publishing of repl, yarn into branch-1.2

2015-01-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5289:
---
Description: 
In SPARK-3452 we did some clean-up of published artifacts that turned out to 
adversely affect some users. This has been mostly patched up in master via 
SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
modules, they were fixed in SPARK-4048 as part of a larger change that only 
went into master.

Those pieces should be backported.

  was:In SPARK-3452 we did some clean-up of published artifacts that turned out 
to adversely affect some users. This has been mostly patched up in master via 
SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
modules, they were fixed in SPARK-4048 as part of a larger change that only 
went into master.


> Backport publishing of repl, yarn into branch-1.2
> -
>
> Key: SPARK-5289
> URL: https://issues.apache.org/jira/browse/SPARK-5289
> Project: Spark
>  Issue Type: Improvement
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
>
> In SPARK-3452 we did some clean-up of published artifacts that turned out to 
> adversely affect some users. This has been mostly patched up in master via 
> SPARK-4925 (hive-thritserver) which was backported. For the repl and yarn 
> modules, they were fixed in SPARK-4048 as part of a larger change that only 
> went into master.
> Those pieces should be backported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5289) Backport publishing of repl, yarn, and hive-thriftserver into branch-1.2

2015-01-16 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-5289:
--

 Summary: Backport publishing of repl, yarn, and hive-thriftserver 
into branch-1.2
 Key: SPARK-5289
 URL: https://issues.apache.org/jira/browse/SPARK-5289
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker


In SPARK-3452 we did some clean-up of published artifacts that turned out to 
adversely affect some users. This has been mostly patched up in master via 
SPARK-4925 (hive-thritserver), SPARK-4048 (which inadvertently did this for 
yarn and repl). But we should go in branch 1.2 and fix this as well so that we 
can do a 1.2.1 release with these artifacts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5287) NativeType.defaultSizeOf should have default sizes of all NativeTypes.

2015-01-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5287:

Summary: NativeType.defaultSizeOf should have default sizes of all 
NativeTypes.  (was: NativeType.defaultSizeOf should have all data types.)

> NativeType.defaultSizeOf should have default sizes of all NativeTypes.
> --
>
> Key: SPARK-5287
> URL: https://issues.apache.org/jira/browse/SPARK-5287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Otherwise, we will failed to do stats estimation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5288) Stabilize Spark SQL data type API followup

2015-01-16 Thread Yin Huai (JIRA)

Yin Huai created SPARK-5288:
---

 Summary: Stabilize Spark SQL data type API followup 
 Key: SPARK-5288
 URL: https://issues.apache.org/jira/browse/SPARK-5288
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai


Several issues we need to address before release 1.3

* Do we want to make all classes in org.apache.spark.sql.types.dataTypes.scala 
public? Seems we do not need to make those abstract classes public.

* Seems NativeType is not a very clear and useful concept. Should we just 
remove it?

* We need to Stabilize the type hierarchy of our data types. Seems StringType 
and Decimal Type should not be primitive types. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5287) NativeType.defaultSizeOf should have all data types.

2015-01-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5287:

Description: Otherwise, we will failed to do stats estimation.   (was: 
NativeType.all and NativeType.defaultSizeOf are missing DecimalType, 
BinaryType, DateType, and TimestampType. )

> NativeType.defaultSizeOf should have all data types.
> 
>
> Key: SPARK-5287
> URL: https://issues.apache.org/jira/browse/SPARK-5287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> Otherwise, we will failed to do stats estimation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5287) NativeType.defaultSizeOf should have all data types.

2015-01-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5287:

Summary: NativeType.defaultSizeOf should have all data types.  (was: 
NativeType's companion object should include all native types.)

> NativeType.defaultSizeOf should have all data types.
> 
>
> Key: SPARK-5287
> URL: https://issues.apache.org/jira/browse/SPARK-5287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> NativeType.all and NativeType.defaultSizeOf are missing DecimalType, 
> BinaryType, DateType, and TimestampType. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-733) Add documentation on use of accumulators in lazy transformation

2015-01-16 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid closed SPARK-733.
--
Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/4022

> Add documentation on use of accumulators in lazy transformation
> ---
>
> Key: SPARK-733
> URL: https://issues.apache.org/jira/browse/SPARK-733
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Josh Rosen
> Fix For: 1.3.0, 1.2.1
>
>
> Accumulators updates are side-effects of RDD computations.  Unlike RDDs, 
> accumulators do not carry lineage that would allow them to be computed when 
> their values are accessed on the master.
> This can lead to confusion when accumulators are used in lazy transformations 
> like `map`:
> {code}
> val acc = sc.accumulator(0)
> data.map(x => acc += x; f(x))
> // Here, acc is 0 because no actions have cause the `map` to be computed.
> {code}
> As far as I can tell, our  documentation only includes examples of using 
> accumulators in `foreach`, for which this problem does not occur.
> This pattern of using accumulators in map() occurs in Bagel and other Spark 
> code found in the wild.
> It might be nice to document this behavior in the accumulators section of the 
> Spark programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-733) Add documentation on use of accumulators in lazy transformation

2015-01-16 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-733:
---
Fix Version/s: 1.2.1
   1.3.0

> Add documentation on use of accumulators in lazy transformation
> ---
>
> Key: SPARK-733
> URL: https://issues.apache.org/jira/browse/SPARK-733
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Josh Rosen
> Fix For: 1.3.0, 1.2.1
>
>
> Accumulators updates are side-effects of RDD computations.  Unlike RDDs, 
> accumulators do not carry lineage that would allow them to be computed when 
> their values are accessed on the master.
> This can lead to confusion when accumulators are used in lazy transformations 
> like `map`:
> {code}
> val acc = sc.accumulator(0)
> data.map(x => acc += x; f(x))
> // Here, acc is 0 because no actions have cause the `map` to be computed.
> {code}
> As far as I can tell, our  documentation only includes examples of using 
> accumulators in `foreach`, for which this problem does not occur.
> This pattern of using accumulators in map() occurs in Bagel and other Spark 
> code found in the wild.
> It might be nice to document this behavior in the accumulators section of the 
> Spark programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5287) NativeType's companion object should include all native types.

2015-01-16 Thread Yin Huai (JIRA)

Yin Huai created SPARK-5287:
---

 Summary: NativeType's companion object should include all native 
types.
 Key: SPARK-5287
 URL: https://issues.apache.org/jira/browse/SPARK-5287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


NativeType.all and NativeType.defaultSizeOf are missing DecimalType, 
BinaryType, DateType, and TimestampType. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5284) Insert into Hive throws NPE when a inner complex type field has a null value

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280835#comment-14280835
 ] 

Apache Spark commented on SPARK-5284:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4077

> Insert into Hive throws NPE when a inner complex type field has a null value
> 
>
> Key: SPARK-5284
> URL: https://issues.apache.org/jira/browse/SPARK-5284
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> For  a table like the following one, 
> {code}
> CREATE TABLE nullValuesInInnerComplexTypes
>   (s struct,
> innerArray:array,
> innerMap: map>)
> {code}
> When we want to insert a row like this 
> {code}
> Row(Row(null, null, null))
> {code}
> Will get a NPE
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in 
> stage 0.0 (TID 1, localhost): java.lang.NullPointerException
> [info]at 
> scala.runtime.Tuple3Zipped$.foreach$extension(Tuple3Zipped.scala:105)
> [info]at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3.apply(HiveInspectors.scala:351)
> [info]at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3$$anonfun$apply$4.apply(HiveInspectors.scala:351)
> [info]at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3$$anonfun$apply$4.apply(HiveInspectors.scala:351)
> [info]at 
> scala.runtime.Tuple3Zipped$$anonfun$foreach$extension$1.apply(Tuple3Zipped.scala:109)
> [info]at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> [info]at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> [info]at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> [info]at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> [info]at 
> scala.runtime.Tuple3Zipped$.foreach$extension(Tuple3Zipped.scala:107)
> [info]at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3.apply(HiveInspectors.scala:351)
> [info]at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:108)
> [info]at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105)
> [info]at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> [info]at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
> [info]at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105)
> [info]at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
> [info]at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
> [info]at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> [info]at org.apache.spark.scheduler.Task.run(Task.scala:64)
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [info]at java.lang.Thread.run(Thread.java:745)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> [info]   at scala.Option.foreach(Option.scala:236)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> [info]

[jira] [Commented] (SPARK-5286) Fail to drop an invalid table when using the data source API

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280827#comment-14280827
 ] 

Apache Spark commented on SPARK-5286:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4076

> Fail to drop an invalid table when using the data source API
> 
>
> Key: SPARK-5286
> URL: https://issues.apache.org/jira/browse/SPARK-5286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Example
> {code}
> CREATE TABLE jsonTable
> USING org.apache.spark.sql.json.DefaultSource
> OPTIONS (
>   path 'it is not a path at all!'
> )
> DROP TABLE jsonTable
> {code}
> We will get 
> {code}
> [info]   com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all!
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
> [info]   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at scala.Option.getOrElse(Option.scala:120)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241)
> [info]   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332)
> [info]   at 
> org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
> [info]   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
> [info]   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]

[jira] [Updated] (SPARK-5286) Fail to drop an invalid table when using the data source API

2015-01-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5286:

Summary: Fail to drop an invalid table when using the data source API  
(was: Fail to drop a invalid table when using the data source API)

> Fail to drop an invalid table when using the data source API
> 
>
> Key: SPARK-5286
> URL: https://issues.apache.org/jira/browse/SPARK-5286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Example
> {code}
> CREATE TABLE jsonTable
> USING org.apache.spark.sql.json.DefaultSource
> OPTIONS (
>   path 'it is not a path at all!'
> )
> DROP TABLE jsonTable
> {code}
> We will get 
> {code}
> [info]   com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all!
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
> [info]   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
> [info]   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
> [info]   at scala.Option.getOrElse(Option.scala:120)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
> [info]   at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241)
> [info]   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332)
> [info]   at 
> org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53)
> [info]   at 
> org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472)
> [info]   at 
> org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
> [info]   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
> [info]   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
> [info]   at 
> org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalates

[jira] [Created] (SPARK-5286) Fail to drop a invalid table when using the data source API

2015-01-16 Thread Yin Huai (JIRA)

Yin Huai created SPARK-5286:
---

 Summary: Fail to drop a invalid table when using the data source 
API
 Key: SPARK-5286
 URL: https://issues.apache.org/jira/browse/SPARK-5286
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Critical


Example
{code}
CREATE TABLE jsonTable
USING org.apache.spark.sql.json.DefaultSource
OPTIONS (
  path 'it is not a path at all!'
)

DROP TABLE jsonTable
{code}

We will get 
{code}
[info]   com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all!
[info]   at 
com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
[info]   at 
com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
[info]   at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147)
[info]   at 
org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241)
[info]   at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
[info]   at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
[info]   at 
org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241)
[info]   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332)
[info]   at org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57)
[info]   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53)
[info]   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53)
[info]   at 
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61)
[info]   at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472)
[info]   at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472)
[info]   at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
[info]   at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108)
[info]   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73)
[info]   at 
org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258)
[info]   at 
org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
[info]   at 
org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at 
org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36)
[info]   at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
[info]   at 
org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
[info]   at

[jira] [Comment Edited] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-16 Thread Corey J. Nolet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280802#comment-14280802
 ] 

Corey J. Nolet edited comment on SPARK-5260 at 1/16/15 8:52 PM:


bq.  you can make the change and create a pull request.

I'd love to submit a pull request for this. Do you have a proposed name for the 
utility object?

bq. We do not add fix version(s) until it has been merged into our code base.

Noted. We're quite different in Accumulo- we require fix versions for each 
ticket.


was (Author: sonixbp):
bq.  you can make the change and create a pull request.

I've love to submit a pull request for this. Do you have a proposed name for 
the utility object?

bq. We do not add fix version(s) until it has been merged into our code base.

Noted. We're quite different in Accumulo- we require fix versions for each 
ticket.

> Expose JsonRDD.allKeysWithValueTypes() in a utility class 
> --
>
> Key: SPARK-5260
> URL: https://issues.apache.org/jira/browse/SPARK-5260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Corey J. Nolet
>
> I have found this method extremely useful when implementing my own strategy 
> for inferring a schema from parsed json. For now, I've actually copied the 
> method right out of the JsonRDD class into my own project but I think it 
> would be immensely useful to keep the code in Spark and expose it publicly 
> somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2015-01-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4502:
---
Summary: Spark SQL reads unneccesary nested fields from Parquet  (was: 
Spark SQL reads unneccesary fields from Parquet)

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-16 Thread Corey J. Nolet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280802#comment-14280802
 ] 

Corey J. Nolet commented on SPARK-5260:
---

bq.  you can make the change and create a pull request.

I've love to submit a pull request for this. Do you have a proposed name for 
the utility object?

bq. We do not add fix version(s) until it has been merged into our code base.

Noted, we're quite different in Accumulo- we require fix versions for each 
ticket.

> Expose JsonRDD.allKeysWithValueTypes() in a utility class 
> --
>
> Key: SPARK-5260
> URL: https://issues.apache.org/jira/browse/SPARK-5260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Corey J. Nolet
>
> I have found this method extremely useful when implementing my own strategy 
> for inferring a schema from parsed json. For now, I've actually copied the 
> method right out of the JsonRDD class into my own project but I think it 
> would be immensely useful to keep the code in Spark and expose it publicly 
> somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-16 Thread Corey J. Nolet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280802#comment-14280802
 ] 

Corey J. Nolet edited comment on SPARK-5260 at 1/16/15 8:48 PM:


bq.  you can make the change and create a pull request.

I've love to submit a pull request for this. Do you have a proposed name for 
the utility object?

bq. We do not add fix version(s) until it has been merged into our code base.

Noted. We're quite different in Accumulo- we require fix versions for each 
ticket.


was (Author: sonixbp):
bq.  you can make the change and create a pull request.

I've love to submit a pull request for this. Do you have a proposed name for 
the utility object?

bq. We do not add fix version(s) until it has been merged into our code base.

Noted, we're quite different in Accumulo- we require fix versions for each 
ticket.

> Expose JsonRDD.allKeysWithValueTypes() in a utility class 
> --
>
> Key: SPARK-5260
> URL: https://issues.apache.org/jira/browse/SPARK-5260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Corey J. Nolet
>
> I have found this method extremely useful when implementing my own strategy 
> for inferring a schema from parsed json. For now, I've actually copied the 
> method right out of the JsonRDD class into my own project but I think it 
> would be immensely useful to keep the code in Spark and expose it publicly 
> somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5260:
---
Fix Version/s: (was: 1.3.0)

> Expose JsonRDD.allKeysWithValueTypes() in a utility class 
> --
>
> Key: SPARK-5260
> URL: https://issues.apache.org/jira/browse/SPARK-5260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Corey J. Nolet
>
> I have found this method extremely useful when implementing my own strategy 
> for inferring a schema from parsed json. For now, I've actually copied the 
> method right out of the JsonRDD class into my own project but I think it 
> would be immensely useful to keep the code in Spark and expose it publicly 
> somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5270) Elegantly check if RDD is empty

2015-01-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5270:
---
Target Version/s: 1.3.0

> Elegantly check if RDD is empty
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4357) Modify release publishing to work with Scala 2.11

2015-01-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4357.

Resolution: Fixed

Sorry this is actually working now. We now publish artifacts for Scala 2.11. It 
was fixed a while back.

> Modify release publishing to work with Scala 2.11
> -
>
> Key: SPARK-4357
> URL: https://issues.apache.org/jira/browse/SPARK-4357
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> We'll need to do some effort to make our publishing work with 2.11 since the 
> current pipeline assumes a single set of artifacts is published.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5285) Removed GroupExpression in catalyst

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280767#comment-14280767
 ] 

Apache Spark commented on SPARK-5285:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/4075

>  Removed GroupExpression in catalyst
> 
>
> Key: SPARK-5285
> URL: https://issues.apache.org/jira/browse/SPARK-5285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: wangfei
>
>  Removed GroupExpression in catalyst



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5285) Removed GroupExpression in catalyst

2015-01-16 Thread wangfei (JIRA)

wangfei created SPARK-5285:
--

 Summary:  Removed GroupExpression in catalyst
 Key: SPARK-5285
 URL: https://issues.apache.org/jira/browse/SPARK-5285
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei


 Removed GroupExpression in catalyst



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5284) Insert into Hive throws NPE when a inner complex type field has a null value

2015-01-16 Thread Yin Huai (JIRA)

Yin Huai created SPARK-5284:
---

 Summary: Insert into Hive throws NPE when a inner complex type 
field has a null value
 Key: SPARK-5284
 URL: https://issues.apache.org/jira/browse/SPARK-5284
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


For  a table like the following one, 
{code}
CREATE TABLE nullValuesInInnerComplexTypes
  (s struct,
innerArray:array,
innerMap: map>)
{code}

When we want to insert a row like this 
{code}
Row(Row(null, null, null))
{code}

Will get a NPE
{code}
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 
0.0 (TID 1, localhost): java.lang.NullPointerException
[info]  at scala.runtime.Tuple3Zipped$.foreach$extension(Tuple3Zipped.scala:105)
[info]  at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3.apply(HiveInspectors.scala:351)
[info]  at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3$$anonfun$apply$4.apply(HiveInspectors.scala:351)
[info]  at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3$$anonfun$apply$4.apply(HiveInspectors.scala:351)
[info]  at 
scala.runtime.Tuple3Zipped$$anonfun$foreach$extension$1.apply(Tuple3Zipped.scala:109)
[info]  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
[info]  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
[info]  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
[info]  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
[info]  at scala.runtime.Tuple3Zipped$.foreach$extension(Tuple3Zipped.scala:107)
[info]  at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrapperFor$3.apply(HiveInspectors.scala:351)
[info]  at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:108)
[info]  at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105)
[info]  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
[info]  at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
[info]  at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105)
[info]  at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
[info]  at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
[info]  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
[info]  at org.apache.spark.scheduler.Task.run(Task.scala:64)
[info]  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
[info]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[info]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[info]  at java.lang.Thread.run(Thread.java:745)
[info] 
[info] Driver stacktrace:
[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187)
[info]   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[info]   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
[info]   at scala.Option.foreach(Option.scala:236)
[info]   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
[info]   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1399)
[info]   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
[info]   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1360)
[info]   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
[info]   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
[info]   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
[info]   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
[info]   at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
[info]

[jira] [Commented] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2015-01-16 Thread Andrew Musselman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280731#comment-14280731
 ] 

Andrew Musselman commented on SPARK-4259:
-

Thinking of picking this up; has there been any work on this already?

> Add Spectral Clustering Algorithm with Gaussian Similarity Function
> ---
>
> Key: SPARK-4259
> URL: https://issues.apache.org/jira/browse/SPARK-4259
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>
> In recent years, spectral clustering has become one of the most popular 
> modern clustering algorithms. It is simple to implement, can be solved 
> efficiently by standard linear algebra software, and very often outperforms 
> traditional clustering algorithms such as the k-means algorithm.
> We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
> function. A brief design looks like below:
> Unnormalized spectral clustering
> Input: raw data points, number k of clusters to construct: 
> • Comupte Similarity matrix S ∈ Rn×n, .
> • Construct a similarity graph. Let W be its weighted adjacency matrix.
> • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
> diagonal matrix
> • Compute the first k eigenvectors u1, . . . , uk of L.
> • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
> • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
> row of U.
> • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
> clusters C1, . . . , Ck.
> Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file

2015-01-16 Thread Tyler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280705#comment-14280705
 ] 

Tyler commented on SPARK-4520:
--

No rush. Just interested. I figured my problem was something along the lines of 
:
schema !=  my custom serializer != the spark deserializer

But it looks like the problems may lie with the spark deserializer more than my 
own serialization.

> SparkSQL exception when reading certain columns from a parquet file
> ---
>
> Key: SPARK-4520
> URL: https://issues.apache.org/jira/browse/SPARK-4520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sadhan sood
>Assignee: sadhan sood
>Priority: Critical
> Attachments: part-r-0.parquet
>
>
> I am seeing this issue with spark sql throwing an exception when trying to 
> read selective columns from a thrift parquet file and also when caching them.
> On some further digging, I was able to narrow it down to at-least one 
> particular column type: map> to be causing this issue. To 
> reproduce this I created a test thrift file with a very basic schema and 
> stored some sample data in a parquet file:
> Test.thrift
> ===
> {code}
> typedef binary SomeId
> enum SomeExclusionCause {
>   WHITELIST = 1,
>   HAS_PURCHASE = 2,
> }
> struct SampleThriftObject {
>   10: string col_a;
>   20: string col_b;
>   30: string col_c;
>   40: optional map> col_d;
> }
> {code}
> =
> And loading the data in spark through schemaRDD:
> {code}
> import org.apache.spark.sql.SchemaRDD
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> val parquetFile = "/path/to/generated/parquet/file"
> val parquetFileRDD = sqlContext.parquetFile(parquetFile)
> parquetFileRDD.printSchema
> root
>  |-- col_a: string (nullable = true)
>  |-- col_b: string (nullable = true)
>  |-- col_c: string (nullable = true)
>  |-- col_d: map (nullable = true)
>  ||-- key: string
>  ||-- value: array (valueContainsNull = true)
>  |||-- element: string (containsNull = false)
> parquetFileRDD.registerTempTable("test")
> sqlContext.cacheTable("test")
> sqlContext.sql("select col_a from test").collect() <-- see the exception 
> stack here 
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at jav

[jira] [Commented] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file

2015-01-16 Thread sadhan sood (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280698#comment-14280698
 ] 

sadhan sood commented on SPARK-4520:


Tyler, Alex - the problem is not with parquet but how we are reading the 
parquet columns.  Just wanted to make sure that you are seeing this problem 
with thrift generated parquet files as well? I am going to submit my fix this 
weekend now that I have some availability, my apologies for the delay.

> SparkSQL exception when reading certain columns from a parquet file
> ---
>
> Key: SPARK-4520
> URL: https://issues.apache.org/jira/browse/SPARK-4520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sadhan sood
>Assignee: sadhan sood
>Priority: Critical
> Attachments: part-r-0.parquet
>
>
> I am seeing this issue with spark sql throwing an exception when trying to 
> read selective columns from a thrift parquet file and also when caching them.
> On some further digging, I was able to narrow it down to at-least one 
> particular column type: map> to be causing this issue. To 
> reproduce this I created a test thrift file with a very basic schema and 
> stored some sample data in a parquet file:
> Test.thrift
> ===
> {code}
> typedef binary SomeId
> enum SomeExclusionCause {
>   WHITELIST = 1,
>   HAS_PURCHASE = 2,
> }
> struct SampleThriftObject {
>   10: string col_a;
>   20: string col_b;
>   30: string col_c;
>   40: optional map> col_d;
> }
> {code}
> =
> And loading the data in spark through schemaRDD:
> {code}
> import org.apache.spark.sql.SchemaRDD
> val sqlContext = new org.apache.spark.sql.SQLContext(sc);
> val parquetFile = "/path/to/generated/parquet/file"
> val parquetFileRDD = sqlContext.parquetFile(parquetFile)
> parquetFileRDD.printSchema
> root
>  |-- col_a: string (nullable = true)
>  |-- col_b: string (nullable = true)
>  |-- col_c: string (nullable = true)
>  |-- col_d: map (nullable = true)
>  ||-- key: string
>  ||-- value: array (valueContainsNull = true)
>  |||-- element: string (containsNull = false)
> parquetFileRDD.registerTempTable("test")
> sqlContext.cacheTable("test")
> sqlContext.sql("select col_a from test").collect() <-- see the exception 
> stack here 
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.Th

[jira] [Reopened] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reopened SPARK-3726:
--
  Assignee: Manoj Kumar  (was: Manish Amde)

This wasn't really fixed actually; my mistake.  (The option is overridden for 
RandomForest.)

> RandomForest: Support for bootstrap options
> ---
>
> Key: SPARK-3726
> URL: https://issues.apache.org/jira/browse/SPARK-3726
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>Priority: Minor
> Fix For: 1.2.0
>
>
> RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
> The expected size of each sample is the same as the original data (sampling 
> rate = 1.0), and sampling is done with replacement.  Adding support for other 
> sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3726.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Manish Amde  (was: Manoj Kumar)

Implemented in PR https://github.com/apache/spark/pull/2607

> RandomForest: Support for bootstrap options
> ---
>
> Key: SPARK-3726
> URL: https://issues.apache.org/jira/browse/SPARK-3726
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Manish Amde
>Priority: Minor
> Fix For: 1.2.0
>
>
> RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
> The expected size of each sample is the same as the original data (sampling 
> rate = 1.0), and sampling is done with replacement.  Adding support for other 
> sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280663#comment-14280663
 ] 

Joseph K. Bradley commented on SPARK-3726:
--

IMO I think it should be closed.  I'll get someone to fix the JIRA-PR 
links/tags.
Sorry for the wasted effort!

> RandomForest: Support for bootstrap options
> ---
>
> Key: SPARK-3726
> URL: https://issues.apache.org/jira/browse/SPARK-3726
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>Priority: Minor
>
> RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
> The expected size of each sample is the same as the original data (sampling 
> rate = 1.0), and sampling is done with replacement.  Adding support for other 
> sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-16 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280658#comment-14280658
 ] 

Manoj Kumar commented on SPARK-3726:


Ah I see. I had my doubts when I started looking at the code, but was in a 
hurry to send a Pull Request. So this can be closed?

> RandomForest: Support for bootstrap options
> ---
>
> Key: SPARK-3726
> URL: https://issues.apache.org/jira/browse/SPARK-3726
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>Priority: Minor
>
> RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
> The expected size of each sample is the same as the original data (sampling 
> rate = 1.0), and sampling is done with replacement.  Adding support for other 
> sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280651#comment-14280651
 ] 

Joseph K. Bradley commented on SPARK-3726:
--

Sorry!  I had forgotten that this was really solved by 
[https://github.com/apache/spark/commit/8602195510f5821b37746bb7fa24902f43a1bd93]!
  That commit added subsamplingRate.  Thinking more about this, I'm not sure if 
sampling without replacement is needed (or useful, since it is more expensive 
and makes for less randomness in the bootstrapped samples).

Users can currently set subsamplingRate via Strategy, and I don't think it 
needs to be added to the train* methods.

Let me know if you have a good use case for subsampling without replacement.  
Thanks!

> RandomForest: Support for bootstrap options
> ---
>
> Key: SPARK-3726
> URL: https://issues.apache.org/jira/browse/SPARK-3726
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>Priority: Minor
>
> RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
> The expected size of each sample is the same as the original data (sampling 
> rate = 1.0), and sampling is done with replacement.  Adding support for other 
> sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4766) ML Estimator Params should subclass Transformer Params

2015-01-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280604#comment-14280604
 ] 

Joseph K. Bradley commented on SPARK-4766:
--

That is a good point, but I'll put it in another JIRA as a separate issue: 
[https://issues.apache.org/jira/browse/SPARK-5283]

> ML Estimator Params should subclass Transformer Params
> --
>
> Key: SPARK-4766
> URL: https://issues.apache.org/jira/browse/SPARK-4766
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> Currently, in spark.ml, both Transformers and Estimators extend the same 
> Params classes.  There should be one Params class for the Transformer and one 
> for the Estimator, where the Estimator params class extends the Transformer 
> one.
> E.g., it is weird to be able to do:
> {code}
> val model: LogisticRegressionModel = ...
> model.getMaxIter()
> {code}
> (This is the only case where this happens currently, but it is worth setting 
> a precedent.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5283) ML sharedParams should be public

2015-01-16 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-5283:


 Summary: ML sharedParams should be public
 Key: SPARK-5283
 URL: https://issues.apache.org/jira/browse/SPARK-5283
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley


The many shared Params implemented in sharedParams.scala should be made public.

Pros:
* Easier for developers of outside packages
* Standardized parameter and input/output column names

Cons:
* None?  Except that we'd need to make sure that the APIs are good enough



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5231) History Server shows wrong job submission time.

2015-01-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5231.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Kousuke Saruta

> History Server shows wrong job submission time.
> ---
>
> Key: SPARK-5231
> URL: https://issues.apache.org/jira/browse/SPARK-5231
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.3.0
>
>
> History Server doesn't show collect job submission time.
> It's because JobProgressListener updates job submission time every time 
> onJobStart method is invoked from ReplayListenerBus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4766) ML Estimator Params should subclass Transformer Params

2015-01-16 Thread Peter Rudenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280571#comment-14280571
 ] 

Peter Rudenko commented on SPARK-4766:
--

Also make a traits that extends Params public. Here's a use case. Want to make 
custom transformers that take several columns and produces single one:

{code}
trait HasMultipleInputColumns extends Params{
  val inputColumns: Param[Seq[String]] = new Param(this, "input columns", 
"names of input columns")
  def getInputCols: Seq[String] = {
paramMap(inputColumns)
  }
}

/*
* Takes a col1, col2, ... and produces a column "features" -> Vector(col1,col2)
*/
class LRFeatureListTransformer extends Transformer with HasMultipleInputColumns 
with HasOutputColumn


/*
* Takes a col1, col2, ... and produces a column "features" -> 
Vector(col1+col2+...)
*/
class SumFeatureListTransformer extends Transformer with 
HasMultipleInputColumns with HasOutputColumn
{code}

can't import HasOutputColumn trait, because it's private to ml package.

> ML Estimator Params should subclass Transformer Params
> --
>
> Key: SPARK-4766
> URL: https://issues.apache.org/jira/browse/SPARK-4766
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> Currently, in spark.ml, both Transformers and Estimators extend the same 
> Params classes.  There should be one Params class for the Transformer and one 
> for the Estimator, where the Estimator params class extends the Transformer 
> one.
> E.g., it is weird to be able to do:
> {code}
> val model: LogisticRegressionModel = ...
> model.getMaxIter()
> {code}
> (This is the only case where this happens currently, but it is worth setting 
> a precedent.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range

2015-01-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5201:
-
Affects Version/s: (was: 1.2.0)
   1.0.0

> ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing 
> with inclusive range
> --
>
> Key: SPARK-5201
> URL: https://issues.apache.org/jira/browse/SPARK-5201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Ye Xianjin
>Assignee: Ye Xianjin
>  Labels: rdd
> Fix For: 1.3.0, 1.2.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {code}
>  sc.makeRDD(1 to (Int.MaxValue)).count   // result = 0
>  sc.makeRDD(1 to (Int.MaxValue - 1)).count   // result = 2147483646 = 
> Int.MaxValue - 1
>  sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = 
> Int.MaxValue - 1
> {code}
> More details on the discussion https://github.com/apache/spark/pull/2874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5201) ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing with inclusive range

2015-01-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5201.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Ye Xianjin
Target Version/s: 1.3.0, 1.2.1  (was: 1.0.2, 1.1.1, 1.2.0)

> ParallelCollectionRDD.slice(seq, numSlices) has int overflow when dealing 
> with inclusive range
> --
>
> Key: SPARK-5201
> URL: https://issues.apache.org/jira/browse/SPARK-5201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Ye Xianjin
>Assignee: Ye Xianjin
>  Labels: rdd
> Fix For: 1.3.0, 1.2.1
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {code}
>  sc.makeRDD(1 to (Int.MaxValue)).count   // result = 0
>  sc.makeRDD(1 to (Int.MaxValue - 1)).count   // result = 2147483646 = 
> Int.MaxValue - 1
>  sc.makeRDD(1 until (Int.MaxValue)).count// result = 2147483646 = 
> Int.MaxValue - 1
> {code}
> More details on the discussion https://github.com/apache/spark/pull/2874



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-1507) Spark on Yarn: Add support for user to specify # cores for ApplicationMaster

2015-01-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-1507.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: WangTaoTheTonic
Target Version/s: 1.3.0

> Spark on Yarn: Add support for user to specify # cores for ApplicationMaster
> 
>
> Key: SPARK-1507
> URL: https://issues.apache.org/jira/browse/SPARK-1507
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: WangTaoTheTonic
> Fix For: 1.3.0
>
>
> Now that Hadoop 2.x can schedule cores as a resource we should allow the user 
> to specify the # of cores for the ApplicationMaster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280491#comment-14280491
 ] 

Apache Spark commented on SPARK-5270:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4074

> Elegantly check if RDD is empty
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280486#comment-14280486
 ] 

Apache Spark commented on SPARK-3726:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/4073

> RandomForest: Support for bootstrap options
> ---
>
> Key: SPARK-3726
> URL: https://issues.apache.org/jira/browse/SPARK-3726
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>Priority: Minor
>
> RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
> The expected size of each sample is the same as the original data (sampling 
> rate = 1.0), and sampling is done with replacement.  Adding support for other 
> sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-16 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280260#comment-14280260
 ] 

Al M commented on SPARK-5270:
-

I don't mind at all.  I'd be really happy to have such a utility method in 
Spark.

> Elegantly check if RDD is empty
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280173#comment-14280173
 ] 

Apache Spark commented on SPARK-4630:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4070

> Dynamically determine optimal number of partitions
> --
>
> Key: SPARK-4630
> URL: https://issues.apache.org/jira/browse/SPARK-4630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kostas Sakellis
>Assignee: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark 
> job. There is a direct relationship between the size of partitions to the 
> number of tasks - larger partitions, fewer tasks. For better performance, 
> Spark has a sweet spot for how large partitions should be that get executed 
> by a task. If partitions are too small, then the user pays a disproportionate 
> cost in scheduling overhead. If the partitions are too large, then task 
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size) 
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account 
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the 
> partition sizing for the user. This feature can be turned off/on with a 
> configuration option. 
> To make this happen, we propose modifying the DAGScheduler to take into 
> account partition sizes upon stage completion. Before scheduling the next 
> stage, the scheduler can examine the sizes of the partitions and determine 
> the appropriate number tasks to create. Since this change requires 
> non-trivial modifications to the DAGScheduler, a detailed design doc will be 
> attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning

2015-01-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280167#comment-14280167
 ] 

Apache Spark commented on SPARK-5282:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/4069

> RowMatrix easily gets int overflow in the memory size warning
> -
>
> Key: SPARK-5282
> URL: https://issues.apache.org/jira/browse/SPARK-5282
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: centos, others should be similar
>Reporter: yuhao yang
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The warning in the RowMatrix will easily get int overflow when the cols is 
> larger than 16385.
> minor issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning

2015-01-16 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280159#comment-14280159
 ] 

yuhao yang commented on SPARK-5282:
---

typical wrong message: Row matrix: 17000 cloumns will require at least 
-1982967296 bytes of memory!

PR on the way.

> RowMatrix easily gets int overflow in the memory size warning
> -
>
> Key: SPARK-5282
> URL: https://issues.apache.org/jira/browse/SPARK-5282
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: centos, others should be similar
>Reporter: yuhao yang
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The warning in the RowMatrix will easily get int overflow when the cols is 
> larger than 16385.
> minor issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5282) RowMatrix easily gets int overflow in the memory size warning

2015-01-16 Thread yuhao yang (JIRA)

yuhao yang created SPARK-5282:
-

 Summary: RowMatrix easily gets int overflow in the memory size 
warning
 Key: SPARK-5282
 URL: https://issues.apache.org/jira/browse/SPARK-5282
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Trivial


The warning in the RowMatrix will easily get int overflow when the cols is 
larger than 16385.

minor issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-01-16 Thread sarsol (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sarsol updated SPARK-5281:
--
Component/s: SQL

> Registering table on RDD is giving MissingRequirementError
> --
>
> Key: SPARK-5281
> URL: https://issues.apache.org/jira/browse/SPARK-5281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sarsol
>Priority: Critical
>
> Application crashes on this line  rdd.registerTempTable("temp")  in 1.2 
> version when using sbt or Eclipse SCALA IDE
> Stacktrace 
> Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
> class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
> primordial classloader with boot classpath 
> [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
>  Files\Java\jre7\lib\resources.jar;C:\Program 
> Files\Java\jre7\lib\rt.jar;C:\Program 
> Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
> Files\Java\jre7\lib\jsse.jar;C:\Program 
> Files\Java\jre7\lib\jce.jar;C:\Program 
> Files\Java\jre7\lib\charsets.jar;C:\Program 
> Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
>   at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
>   at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
>   at 
> com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280145#comment-14280145
 ] 

Sean Owen commented on SPARK-5270:
--

I think it would be nice to have a utility method like this indeed since it can 
wrap up all these options. Check for 0 partitions then check for first element. 
Mind if I make a PR?

> Elegantly check if RDD is empty
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-01-16 Thread sarsol (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sarsol updated SPARK-5281:
--
Priority: Critical  (was: Major)

> Registering table on RDD is giving MissingRequirementError
> --
>
> Key: SPARK-5281
> URL: https://issues.apache.org/jira/browse/SPARK-5281
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: sarsol
>Priority: Critical
>
> Application crashes on this line  rdd.registerTempTable("temp")  in 1.2 
> version when using sbt or Eclipse SCALA IDE
> Stacktrace 
> Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
> class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
> primordial classloader with boot classpath 
> [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
>  Files\Java\jre7\lib\resources.jar;C:\Program 
> Files\Java\jre7\lib\rt.jar;C:\Program 
> Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
> Files\Java\jre7\lib\jsse.jar;C:\Program 
> Files\Java\jre7\lib\jce.jar;C:\Program 
> Files\Java\jre7\lib\charsets.jar;C:\Program 
> Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
>   at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
>   at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
>   at 
> com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5234) examples for ml don't have sparkContext.stop

2015-01-16 Thread yuhao yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang closed SPARK-5234.
-

fixed

> examples for ml don't have sparkContext.stop
> 
>
> Key: SPARK-5234
> URL: https://issues.apache.org/jira/browse/SPARK-5234
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
> Environment: all
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Trivial
> Fix For: 1.3.0, 1.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Not sure why sc.stop() is not in the 
> org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, 
> SimpleTextClassificationPipeline}. 
> I can prepare a PR if it's not intentional to omit the call to stop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-01-16 Thread sarsol (JIRA)

sarsol created SPARK-5281:
-

 Summary: Registering table on RDD is giving MissingRequirementError
 Key: SPARK-5281
 URL: https://issues.apache.org/jira/browse/SPARK-5281
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: sarsol


Application crashes on this line  rdd.registerTempTable("temp")  in 1.2 version 
when using sbt or Eclipse SCALA IDE

Stacktrace 
Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
primordial classloader with boot classpath 
[C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
 Files\Java\jre7\lib\resources.jar;C:\Program 
Files\Java\jre7\lib\rt.jar;C:\Program 
Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program 
Files\Java\jre7\lib\charsets.jar;C:\Program 
Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
at 
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
at scala.reflect.api.Universe.typeOf(Universe.scala:59)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
at 
com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4357) Modify release publishing to work with Scala 2.11

2015-01-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280037#comment-14280037
 ] 

François Garillot commented on SPARK-4357:
--

Scala 2.11.5 [has been released|http://scala-lang.org/news/2.11.5]. What would 
be the next step to help you with this ?

> Modify release publishing to work with Scala 2.11
> -
>
> Key: SPARK-4357
> URL: https://issues.apache.org/jira/browse/SPARK-4357
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> We'll need to do some effort to make our publishing work with 2.11 since the 
> current pipeline assumes a single set of artifacts is published.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called

2015-01-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280034#comment-14280034
 ] 

François Garillot commented on SPARK-5147:
--

I see. Thanks for your answers !

For the locality issue, how about running recovery from the WAL as if it was 
replication ? In that sense, we would be using the WAL's HDFS write as a 
transport mechanism (as it will replicate on 2 other executors), and the 
recreating a block at the end point. Perhaps it's worth noting this idea in a 
JIRA as a possible future enhancement ?

> write ahead logs from streaming receiver are not purged because 
> cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
> --
>
> Key: SPARK-5147
> URL: https://issues.apache.org/jira/browse/SPARK-5147
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Max Xu
>Priority: Blocker
>
> Hi all,
> We are running a Spark streaming application with ReliableKafkaReceiver. We 
> have "spark.streaming.receiver.writeAheadLog.enable" set to true so write 
> ahead logs (WALs) for received data are created under receivedData/streamId 
> folder in the checkpoint directory. 
> However, old WALs are never purged by time. receivedBlockMetadata and 
> checkpoint files are purged correctly though. I went through the code, 
> WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is 
> responsible for cleaning up the old blocks. It has method cleanupOldBlocks, 
> which is never called by any class. ReceiverSupervisorImpl class holds a 
> WriteAheadLogBasedBlockHandler  instance. However, it only calls storeBlock 
> method to create WALs but never calls cleanupOldBlocks method to purge old 
> WALs.
> The size of the WAL folder increases constantly on HDFS. This is preventing 
> us from running the ReliableKafkaReceiver 24x7. Can somebody please take a 
> look.
> Thanks,
> Max



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size

2015-01-16 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280025#comment-14280025
 ] 

yuhao yang commented on SPARK-5186:
---

I just updated the PR with a hashCode fix. Please help review at will.

> Vector.equals  and Vector.hashCode are very inefficient and fail on 
> SparseVectors with large size
> -
>
> Key: SPARK-5186
> URL: https://issues.apache.org/jira/browse/SPARK-5186
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Derrick Burns
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> The implementation of Vector.equals and Vector.hashCode are correct but slow 
> for SparseVectors that are truly sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5280) Import RDF graphs into GraphX

2015-01-16 Thread lukovnikov (JIRA)

lukovnikov created SPARK-5280:
-

 Summary: Import RDF graphs into GraphX
 Key: SPARK-5280
 URL: https://issues.apache.org/jira/browse/SPARK-5280
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: lukovnikov


RDF (Resource Description Framework) models knowledge in a graph and is heavily 
used on the Semantic Web and beyond.
GraphX should include a way to import RDF data easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4867) UDF clean up

2015-01-16 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279983#comment-14279983
 ] 

Reynold Xin commented on SPARK-4867:


BTW if we plan to implement most SQL functions using this new UDF interface, 
than we should consider making mutable primitive types a first class citizen. 
Otherwise we will incur a huge performance hit when any functions on primitives 
are invoked.

> UDF clean up
> 
>
> Key: SPARK-4867
> URL: https://issues.apache.org/jira/browse/SPARK-4867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Blocker
>
> Right now our support and internal implementation of many functions has a few 
> issues.  Specifically:
>  - UDFS don't know their input types and thus don't do type coercion.
>  - We hard code a bunch of built in functions into the parser.  This is bad 
> because in SQL it creates new reserved words for things that aren't actually 
> keywords.  Also it means that for each function we need to add support to 
> both SQLContext and HiveContext separately.
> For this JIRA I propose we do the following:
>  - Change the interfaces for registerFunction and ScalaUdf to include types 
> for the input arguments as well as the output type.
>  - Add a rule to analysis that does type coercion for UDFs.
>  - Add a parse rule for functions to SQLParser.
>  - Rewrite all the UDFs that are currently hacked into the various parsers 
> using this new functionality.
> Depending on how big this refactoring becomes we could split parts 1&2 from 
> part 3 above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-16 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279962#comment-14279962
 ] 

Al M commented on SPARK-5270:
-

Good point it's not a catch-all solution.  The rdd.partitions.size solution 
does work well in the case of empty RDDs created by Spark streaming.

> Elegantly check if RDD is empty
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

84 matches

Mail list logo