[jira] [Resolved] (SPARK-1293) Support for reading/writing complex types in Parquet

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1293.


   Resolution: Fixed
Fix Version/s: 1.0.1

> Support for reading/writing complex types in Parquet
> 
>
> Key: SPARK-1293
> URL: https://issues.apache.org/jira/browse/SPARK-1293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Andre Schumacher
> Fix For: 1.0.1, 1.1.0
>
>
> Complex types include: Arrays, Maps, and Nested rows (structs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-768) Fail a task when the remote block it is fetching is not serializable

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-768.
---

Resolution: Cannot Reproduce
  Assignee: Raymond Liu  (was: Reynold Xin)

> Fail a task when the remote block it is fetching is not serializable
> 
>
> Key: SPARK-768
> URL: https://issues.apache.org/jira/browse/SPARK-768
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Raymond Liu
>
> When a task is fetching a remote block (e.g. locality wait exceeded), and if 
> the block is not serializable, the task would hang.
> The block manager should fail the task instead of hanging the task ... once 
> the task fails, eventually it will get scheduled to the local node to be 
> executed successfully. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-768) Fail a task when the remote block it is fetching is not serializable

2014-06-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038536#comment-14038536
 ] 

Reynold Xin commented on SPARK-768:
---

Thanks for confirming. I'm going to close this issue then.


> Fail a task when the remote block it is fetching is not serializable
> 
>
> Key: SPARK-768
> URL: https://issues.apache.org/jira/browse/SPARK-768
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> When a task is fetching a remote block (e.g. locality wait exceeded), and if 
> the block is not serializable, the task would hang.
> The block manager should fail the task instead of hanging the task ... once 
> the task fails, eventually it will get scheduled to the local node to be 
> executed successfully. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-768) Fail a task when the remote block it is fetching is not serializable

2014-06-19 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038534#comment-14038534
 ] 

Raymond Liu commented on SPARK-768:
---

Hi Reynold

If this is the first case, then I think, yes, it won't hang, at least from what 
I observe from my test and the code in this path. Only that recompute might be 
a problem? If I do the same thing on the cached RDD for a lot of iterations, 
eventually, all partitions will have a local block stored in each node. We can 
either accept this behavior, or need to modify the block ack message to 
identify this specific case other than return None as block not found.

> Fail a task when the remote block it is fetching is not serializable
> 
>
> Key: SPARK-768
> URL: https://issues.apache.org/jira/browse/SPARK-768
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> When a task is fetching a remote block (e.g. locality wait exceeded), and if 
> the block is not serializable, the task would hang.
> The block manager should fail the task instead of hanging the task ... once 
> the task fails, eventually it will get scheduled to the local node to be 
> executed successfully. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2177) describe table result contains only one column

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2177.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

> describe table result contains only one column
> --
>
> Key: SPARK-2177
> URL: https://issues.apache.org/jira/browse/SPARK-2177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yin Huai
> Fix For: 1.0.1, 1.1.0
>
>
> {code}
> scala> hql("describe src").collect().foreach(println)
> [key  string  None]
> [valuestring  None]
> {code}
> The result should contain 3 columns instead of one. This screws up JDBC or 
> even the downstream consumer of the Scala/Java/Python APIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1477) Add the lifecycle interface

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1477:
---

Assignee: Guoqiang Li

> Add the lifecycle interface
> ---
>
> Key: SPARK-1477
> URL: https://issues.apache.org/jira/browse/SPARK-1477
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> Now the Spark in the code, there are a lot of interface or class  defines the 
> stop and start 
> method,eg:[SchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala],[HttpServer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala],[ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala]
>  . we should use a life cycle interface improve the code



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1477) Add the lifecycle interface

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1477:
---

 Target Version/s: 1.1.0
Affects Version/s: 1.0.1

> Add the lifecycle interface
> ---
>
> Key: SPARK-1477
> URL: https://issues.apache.org/jira/browse/SPARK-1477
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Guoqiang Li
>
> Now the Spark in the code, there are a lot of interface or class  defines the 
> stop and start 
> method,eg:[SchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala],[HttpServer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala],[ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala]
>  . we should use a life cycle interface improve the code



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2201) Improve FlumeInputDStream's stability

2014-06-19 Thread chao.wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038525#comment-14038525
 ] 

chao.wu commented on SPARK-2201:


good idea

> Improve FlumeInputDStream's stability
> -
>
> Key: SPARK-2201
> URL: https://issues.apache.org/jira/browse/SPARK-2201
> Project: Spark
>  Issue Type: Improvement
>Reporter: sunshangchun
>
> Currently only one flume receiver can work with FlumeInputDStream and I am 
> willing to do some works to improve it, my ideas are described as follows: 
> a ip and port denotes a physical host, and a logical host consists of one or 
> more physical hosts
> In our case, spark flume receivers bind themselves to a logical host when 
> started, and a flume agent get physical hosts and push events to them.
> Two classes are introduced, LogicalHostRouter supplies a map between logical 
> host and physical host, and LogicalHostRouterListener let relation changes 
> watchable.
> Some works need to be done here: 
> 1. LogicalHostRouter and LogicalHostRouterListener  can be implemented by 
> zookeeper. when physical host started, create tmp node in zk,  listeners just 
> watch those tmp nodes.
> 2. when spark FlumeReceivers started, they acquire a physical host 
> (localhost's ip and an idle port) and register itself to zookeeper.
> 3. A new flume sink. In the method of appendEvents, they get physical hosts 
> and push data to them in a round-robin manner.
> Does it a feasible plan? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-768) Fail a task when the remote block it is fetching is not serializable

2014-06-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038523#comment-14038523
 ] 

Reynold Xin commented on SPARK-768:
---

I think it was the first case. It used to be the case that when a block was 
kept in memory in deserialized form, and a task got scheduled to a remote node 
and tried to fetch the block, if the block was not serializable, the whole 
thing would hang.

Maybe we have already fixed it. If you can verify this is no longer a problem, 
we can close the ticket. Thanks!


> Fail a task when the remote block it is fetching is not serializable
> 
>
> Key: SPARK-768
> URL: https://issues.apache.org/jira/browse/SPARK-768
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> When a task is fetching a remote block (e.g. locality wait exceeded), and if 
> the block is not serializable, the task would hang.
> The block manager should fail the task instead of hanging the task ... once 
> the task fails, eventually it will get scheduled to the local node to be 
> executed successfully. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2201) Improve FlumeInputDStream's stability

2014-06-19 Thread sunshangchun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sunshangchun updated SPARK-2201:


Summary: Improve FlumeInputDStream's stability  (was: Improve 
FlumeInputDStream)

> Improve FlumeInputDStream's stability
> -
>
> Key: SPARK-2201
> URL: https://issues.apache.org/jira/browse/SPARK-2201
> Project: Spark
>  Issue Type: Improvement
>Reporter: sunshangchun
>
> Currently only one flume receiver can work with FlumeInputDStream and I am 
> willing to do some works to improve it, my ideas are described as follows: 
> a ip and port denotes a physical host, and a logical host consists of one or 
> more physical hosts
> In our case, spark flume receivers bind themselves to a logical host when 
> started, and a flume agent get physical hosts and push events to them.
> Two classes are introduced, LogicalHostRouter supplies a map between logical 
> host and physical host, and LogicalHostRouterListener let relation changes 
> watchable.
> Some works need to be done here: 
> 1. LogicalHostRouter and LogicalHostRouterListener  can be implemented by 
> zookeeper. when physical host started, create tmp node in zk,  listeners just 
> watch those tmp nodes.
> 2. when spark FlumeReceivers started, they acquire a physical host 
> (localhost's ip and an idle port) and register itself to zookeeper.
> 3. A new flume sink. In the method of appendEvents, they get physical hosts 
> and push data to them in a round-robin manner.
> Does it a feasible plan? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2212) HashJoin

2014-06-19 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038514#comment-14038514
 ] 

Cheng Hao commented on SPARK-2212:
--

https://github.com/apache/spark/pull/1147

> HashJoin
> 
>
> Key: SPARK-2212
> URL: https://issues.apache.org/jira/browse/SPARK-2212
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2215) Multi-way join

2014-06-19 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038513#comment-14038513
 ] 

Cheng Hao commented on SPARK-2215:
--

The multi-way join implementation in Shark is quite complicated, but we have 
real case to show this can improve the join performance incredibly. I can start 
working the prototype for it soon.

> Multi-way join
> --
>
> Key: SPARK-2215
> URL: https://issues.apache.org/jira/browse/SPARK-2215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Support the multi-way join (multiple table joins) in a single reduce stage if 
> they have the same join keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2215) Multi-way join

2014-06-19 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-2215:
-

Description: Support the multi-way join (multiple table joins) in a single 
reduce stage if they have the same join key.  (was: Support the multi-way join 
(multiple table joins) in a single reduce stage if they has the same join key.)

> Multi-way join
> --
>
> Key: SPARK-2215
> URL: https://issues.apache.org/jira/browse/SPARK-2215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Support the multi-way join (multiple table joins) in a single reduce stage if 
> they have the same join key.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2215) Multi-way join

2014-06-19 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-2215:
-

Description: Support the multi-way join (multiple table joins) in a single 
reduce stage if they have the same join keys.  (was: Support the multi-way join 
(multiple table joins) in a single reduce stage if they have the same join key.)

> Multi-way join
> --
>
> Key: SPARK-2215
> URL: https://issues.apache.org/jira/browse/SPARK-2215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Support the multi-way join (multiple table joins) in a single reduce stage if 
> they have the same join keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2216) Cost-based join reordering

2014-06-19 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038510#comment-14038510
 ] 

Cheng Hao commented on SPARK-2216:
--

Yes, this can be a big change, i think we need to add some sub tasks for it, 
and implement it gradually.

> Cost-based join reordering
> --
>
> Key: SPARK-2216
> URL: https://issues.apache.org/jira/browse/SPARK-2216
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Coat-based join reordering



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2218) rename Equals to EqualTo in Spark SQL expressions

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2218:
---

Summary: rename Equals to EqualTo in Spark SQL expressions  (was: rename 
Equals to EqualsTo in Spark SQL expressions)

> rename Equals to EqualTo in Spark SQL expressions
> -
>
> Key: SPARK-2218
> URL: https://issues.apache.org/jira/browse/SPARK-2218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The class name Equals is very error prone because there exists scala.Equals. 
> I just wasted a bunch of time debugging the optimizer because of this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2218) rename Equals to EqualsTo in Spark SQL expressions

2014-06-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038500#comment-14038500
 ] 

Reynold Xin commented on SPARK-2218:


Michael has a PR here https://github.com/apache/spark/pull/734

It is not fully ready yet.

> rename Equals to EqualsTo in Spark SQL expressions
> --
>
> Key: SPARK-2218
> URL: https://issues.apache.org/jira/browse/SPARK-2218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The class name Equals is very error prone because there exists scala.Equals. 
> I just wasted a bunch of time debugging the optimizer because of this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2214) Broadcast Join (aka map join)

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2214:
---

Summary: Broadcast Join (aka map join)  (was: MapSide Join)

> Broadcast Join (aka map join)
> -
>
> Key: SPARK-2214
> URL: https://issues.apache.org/jira/browse/SPARK-2214
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2215) Multi-way join

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2215:
---

Priority: Minor  (was: Major)

> Multi-way join
> --
>
> Key: SPARK-2215
> URL: https://issues.apache.org/jira/browse/SPARK-2215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Support the multi-way join (multiple table joins) in a single reduce stage if 
> they has the same join key.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2215) Multi-way join

2014-06-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038497#comment-14038497
 ] 

Reynold Xin commented on SPARK-2215:


I personally find multiway join operator extremely complicated and am not sure 
if it is the best idea. In Shark we implemented it, but I think there are only 
2 people in this world that understand that code ...

> Multi-way join
> --
>
> Key: SPARK-2215
> URL: https://issues.apache.org/jira/browse/SPARK-2215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Support the multi-way join (multiple table joins) in a single reduce stage if 
> they has the same join key.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2216) Cost-based join reordering

2014-06-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038494#comment-14038494
 ] 

Reynold Xin commented on SPARK-2216:


The prerequisite of this change is to design the APIs for cardinality and size 
estimation for operators. 

> Cost-based join reordering
> --
>
> Key: SPARK-2216
> URL: https://issues.apache.org/jira/browse/SPARK-2216
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Coat-based join reordering



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2218) rename Equals to EqualsTo in Spark SQL expressions

2014-06-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2218:
--

 Summary: rename Equals to EqualsTo in Spark SQL expressions
 Key: SPARK-2218
 URL: https://issues.apache.org/jira/browse/SPARK-2218
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin


The class name Equals is very error prone because there exists scala.Equals. I 
just wasted a bunch of time debugging the optimizer because of this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2217) When casting BigDecimal to Timestamp, BigDecimal.longValue() may be negative

2014-06-19 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-2217:
-

 Summary: When casting BigDecimal to Timestamp, 
BigDecimal.longValue() may be negative
 Key: SPARK-2217
 URL: https://issues.apache.org/jira/browse/SPARK-2217
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


Please refer to this PR comment: 
https://github.com/apache/spark/pull/1143/files#discussion_r14007203



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2216) Cost-based join reordering

2014-06-19 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-2216:


 Summary: Cost-based join reordering
 Key: SPARK-2216
 URL: https://issues.apache.org/jira/browse/SPARK-2216
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao


Coat-based join reordering



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2215) Multi-way join

2014-06-19 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-2215:


 Summary: Multi-way join
 Key: SPARK-2215
 URL: https://issues.apache.org/jira/browse/SPARK-2215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao


Support the multi-way join (multiple table joins) in a single reduce stage if 
they has the same join key.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2214) MapSide Join

2014-06-19 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-2214:


 Summary: MapSide Join
 Key: SPARK-2214
 URL: https://issues.apache.org/jira/browse/SPARK-2214
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2213) Sort Merge Join

2014-06-19 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-2213:


 Summary: Sort Merge Join
 Key: SPARK-2213
 URL: https://issues.apache.org/jira/browse/SPARK-2213
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2212) HashJoin

2014-06-19 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-2212:


 Summary: HashJoin
 Key: SPARK-2212
 URL: https://issues.apache.org/jira/browse/SPARK-2212
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Hao
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2211) Join Optimization

2014-06-19 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-2211:


 Summary: Join Optimization
 Key: SPARK-2211
 URL: https://issues.apache.org/jira/browse/SPARK-2211
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


This includes couple of sub tasks for Join Optimization in Spark-SQL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2210) cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0)

2014-06-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2210:
--

 Summary: cast to boolean on boolean value gets turned into 
NOT((boolean_condition) = 0)
 Key: SPARK-2210
 URL: https://issues.apache.org/jira/browse/SPARK-2210
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin


{code}
explain select cast(cast(key=0 as boolean) as boolean) aaa from src
{code}

should be

{code}
[Physical execution plan:]
[Project [(key#10:0 = 0) AS aaa#7]]
[ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
{code}

However, it is currently
{code}
[Physical execution plan:]
[Project [NOT((key#10=0) = 0) AS aaa#7]]
[ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1949) Servlet 2.5 vs 3.0 conflict in SBT build

2014-06-19 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038441#comment-14038441
 ] 

Andrew Ash commented on SPARK-1949:
---

Sean's PR: https://github.com/apache/spark/pull/906

> Servlet 2.5 vs 3.0 conflict in SBT build
> 
>
> Key: SPARK-1949
> URL: https://issues.apache.org/jira/browse/SPARK-1949
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Sean Owen
>Priority: Minor
>
> [~kayousterhout] mentioned that:
> {quote}
> I had some trouble compiling an application (Shark) against Spark 1.0,
> where Shark had a runtime exception (at the bottom of this message) because
> it couldn't find the javax.servlet classes.  SBT seemed to have trouble
> downloading the servlet APIs that are dependencies of Jetty (used by the
> Spark web UI), so I had to manually add them to the application's build
> file:
> libraryDependencies += "org.mortbay.jetty" % "servlet-api" % "3.0.20100224"
> Not exactly sure why this happens but thought it might be useful in case
> others run into the same problem.
> {quote}
> This is a symptom of Servlet API conflict which we battled in the Maven 
> build. The resolution is to nix Servlet 2.5 and odd old Jetty / Netty 3.x 
> dependencies. It looks like the Hive part of the assembly in the SBT build 
> doesn't exclude all these entirely.
> I'll open a suggested PR to band-aid the SBT build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2208) local metrics tests can fail on fast machines

2014-06-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038414#comment-14038414
 ] 

Patrick Wendell commented on SPARK-2208:


A hotfix was merged here, but we should really fix the test:
https://github.com/apache/spark/pull/1141

> local metrics tests can fail on fast machines
> -
>
> Key: SPARK-2208
> URL: https://issues.apache.org/jira/browse/SPARK-2208
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>  Labels: starter
>
> I'm temporarily disabling this check. I think the issue is that on fast 
> machines the fetch wait time can actually be zero, even across all tasks.
> We should see if we can write this in a different way to make sure there is a 
> delay.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1209) SparkHadoopUtil should not use package org.apache.hadoop

2014-06-19 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038328#comment-14038328
 ] 

Mark Grover commented on SPARK-1209:


ok, I will take over. Thanks Sandy.

> SparkHadoopUtil should not use package org.apache.hadoop
> 
>
> Key: SPARK-1209
> URL: https://issues.apache.org/jira/browse/SPARK-1209
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Sandy Pérez González
>Assignee: Mark Grover
>
> It's private, so the change won't break compatibility



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-768) Fail a task when the remote block it is fetching is not serializable

2014-06-19 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038307#comment-14038307
 ] 

Raymond Liu commented on SPARK-768:
---

And for case 2, the problem is that current code seems not make difference 
between the NonSerializableException been thrown by fetch remote block during 
computation and the exception been thrown during serialization of the task 
resut. it wll take it all as the task result is not serializable and abort the 
whole taskset. Thus the job will fail in the end I think. Is this what you mean 
hanging?

> Fail a task when the remote block it is fetching is not serializable
> 
>
> Key: SPARK-768
> URL: https://issues.apache.org/jira/browse/SPARK-768
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> When a task is fetching a remote block (e.g. locality wait exceeded), and if 
> the block is not serializable, the task would hang.
> The block manager should fail the task instead of hanging the task ... once 
> the task fails, eventually it will get scheduled to the local node to be 
> executed successfully. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2209) Cast shouldn't do null check twice

2014-06-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038295#comment-14038295
 ] 

Reynold Xin commented on SPARK-2209:


https://github.com/apache/spark/pull/1143

> Cast shouldn't do null check twice
> --
>
> Key: SPARK-2209
> URL: https://issues.apache.org/jira/browse/SPARK-2209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.0.1, 1.1.0
>
>
> Cast does two null checks, one in eval and another one in the function 
> returned by nullOrCast. It's best to get rid of the one in nullOrCast (since 
> eval will be the more common code path).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2209) Cast shouldn't do null check twice

2014-06-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2209:
--

 Summary: Cast shouldn't do null check twice
 Key: SPARK-2209
 URL: https://issues.apache.org/jira/browse/SPARK-2209
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.0.1, 1.1.0


Cast does two null checks, one in eval and another one in the function returned 
by nullOrCast. It's best to get rid of the one in nullOrCast (since eval will 
be the more common code path).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-768) Fail a task when the remote block it is fetching is not serializable

2014-06-19 Thread Raymond Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038288#comment-14038288
 ] 

Raymond Liu commented on SPARK-768:
---

Hi Reynold

I am trying to figure out this issue. Here is my understanding: when the 
situation you mentioned happen. it means: the block is stored in memory level 
without serialization. otherwise, the execption alread been thown in previous 
steps. So under this condition, I can figure out two cases which might run into 
this problem : 

1. the rdd is cached in memory, and as you mentioned, it got run on other node, 
in this case, it seems to me that the remote fetch operation of blockmanager 
will catch the exception in connectionManager and return None to cachemanager, 
then the task go to compute code path, though this lead to over compute and a 
second copy of block is stored. But this do not hang the task. and the job 
eventually got done.  And I have write some cases to verify this. This case, we 
might find some solution to optimize it?

2. you are using BlockRDD in DStream case, and the storage level is Memory, 
Then upon compute of the BlockRDD on another node, the exception is thown, 
while in this case, I think the Task Executor will catch the exception and fail 
the task?

So, either case seems to me  will eventually finish the job. I am wondering 
which kind of case I am missing here which will lead to the hanging of the 
task, Can you kindly give me an example?

> Fail a task when the remote block it is fetching is not serializable
> 
>
> Key: SPARK-768
> URL: https://issues.apache.org/jira/browse/SPARK-768
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> When a task is fetching a remote block (e.g. locality wait exceeded), and if 
> the block is not serializable, the task would hang.
> The block manager should fail the task instead of hanging the task ... once 
> the task fails, eventually it will get scheduled to the local node to be 
> executed successfully. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2200) breeze DenseVector not serializable with KryoSerializer

2014-06-19 Thread Neville Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038223#comment-14038223
 ] 

Neville Li edited comment on SPARK-2200 at 6/20/14 1:23 AM:


With 0.7 the error went away when reference tracking is set to true.
With 0.8.1 it works either way.

Turns out in 0.7 the recursive references was caused by this:
{code}
  private final val innerUpdate: ((Int,E) => Unit) = if ((offset == 0) && 
(stride == 1)) { (i:Int,v:E) => {data(i) = v} } else {(i:Int,v:E) => 
{data(offset+i*stride)=v}  }
{code}

The function val has an closure $outer that references itself. It was removed 
in 0.8.1.


was (Author: sinisa_lyh):
With 0.7 the error went away when reference tracking is set to true.
With 0.8.1 it works either way.

Turns out in 0.7 the recursive references was caused by this:
  private final val innerUpdate: ((Int,E) => Unit) = if ((offset == 0) && 
(stride == 1)) { (i:Int,v:E) => {data(i) = v} } else {(i:Int,v:E) => 
{data(offset+i*stride)=v}  }

The function val has an closure $outer that references itself. It was removed 
in 0.8.1.

> breeze DenseVector not serializable with KryoSerializer
> ---
>
> Key: SPARK-2200
> URL: https://issues.apache.org/jira/browse/SPARK-2200
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> Spark 1.0.0 depends on breeze 0.7 and for some reason serializing DenseVector 
> with KryoSerializer throws the following stack trace. Looks like some 
> recursive field in the object. Upgrading to 0.8.1 solved this.
> {code}
> java.lang.StackOverflowError
>   at java.lang.reflect.Field.getDeclaringClass(Field.java:154)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
>   at 
> sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
>   at java.lang.reflect.Field.get(Field.java:379)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> ...
> {code}
> Code to reproduce:
> {code}
> import breeze.linalg.DenseVector
> import org.apache.spark.SparkConf
> import org.apache.spark.serializer.KryoSerializer
> object SerializerTest {
>   def main(args: Array[String]) {
> val conf = new SparkConf()
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.kryo.registrator", classOf[MyRegistrator].getName)
>   .set("spark.kryo.referenceTracking", "false")
>   .set("spark.kryoserializer.buffer.mb", "8")
> val serializer = new KryoSerializer(conf).newInstance()
> serializer.serialize(DenseVector.rand(10))
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2200) breeze DenseVector not serializable with KryoSerializer

2014-06-19 Thread Neville Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038223#comment-14038223
 ] 

Neville Li commented on SPARK-2200:
---

With 0.7 the error went away when reference tracking is set to true.
With 0.8.1 it works either way.

Turns out in 0.7 the recursive references was caused by this:
  private final val innerUpdate: ((Int,E) => Unit) = if ((offset == 0) && 
(stride == 1)) { (i:Int,v:E) => {data(i) = v} } else {(i:Int,v:E) => 
{data(offset+i*stride)=v}  }

The function val has an closure $outer that references itself. It was removed 
in 0.8.1.

> breeze DenseVector not serializable with KryoSerializer
> ---
>
> Key: SPARK-2200
> URL: https://issues.apache.org/jira/browse/SPARK-2200
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> Spark 1.0.0 depends on breeze 0.7 and for some reason serializing DenseVector 
> with KryoSerializer throws the following stack trace. Looks like some 
> recursive field in the object. Upgrading to 0.8.1 solved this.
> {code}
> java.lang.StackOverflowError
>   at java.lang.reflect.Field.getDeclaringClass(Field.java:154)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
>   at 
> sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
>   at java.lang.reflect.Field.get(Field.java:379)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> ...
> {code}
> Code to reproduce:
> {code}
> import breeze.linalg.DenseVector
> import org.apache.spark.SparkConf
> import org.apache.spark.serializer.KryoSerializer
> object SerializerTest {
>   def main(args: Array[String]) {
> val conf = new SparkConf()
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.kryo.registrator", classOf[MyRegistrator].getName)
>   .set("spark.kryo.referenceTracking", "false")
>   .set("spark.kryoserializer.buffer.mb", "8")
> val serializer = new KryoSerializer(conf).newInstance()
> serializer.serialize(DenseVector.rand(10))
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2208) local metrics tests can fail on fast machines

2014-06-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2208:
---

Labels: starter  (was: )

> local metrics tests can fail on fast machines
> -
>
> Key: SPARK-2208
> URL: https://issues.apache.org/jira/browse/SPARK-2208
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>  Labels: starter
>
> I'm temporarily disabling this check. I think the issue is that on fast 
> machines the fetch wait time can actually be zero, even across all tasks.
> We should see if we can write this in a different way to make sure there is a 
> delay.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2208) local metrics tests can fail on fast machines

2014-06-19 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-2208:
--

 Summary: local metrics tests can fail on fast machines
 Key: SPARK-2208
 URL: https://issues.apache.org/jira/browse/SPARK-2208
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell


I'm temporarily disabling this check. I think the issue is that on fast 
machines the fetch wait time can actually be zero, even across all tasks.

We should see if we can write this in a different way to make sure there is a 
delay.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2192) Examples Data Not in Binary Distribution

2014-06-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038200#comment-14038200
 ] 

Patrick Wendell commented on SPARK-2192:


It might be good to have all the example data in src/main/resources.

> Examples Data Not in Binary Distribution
> 
>
> Key: SPARK-2192
> URL: https://issues.apache.org/jira/browse/SPARK-2192
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Pat McDonough
>
> The data used by examples is not packaged up with the binary distribution. 
> The data subdirectory of spark should make it's way in to the distribution 
> somewhere so the examples can use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

2014-06-19 Thread Suren Hiraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038156#comment-14038156
 ] 

Suren Hiraman commented on SPARK-2202:
--

Will do tomorrow. Interesting problem.

> saveAsTextFile hangs on final 2 tasks
> -
>
> Key: SPARK-2202
> URL: https://issues.apache.org/jira/browse/SPARK-2202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: CentOS 5.7
> 16 nodes, 24 cores per node, 14g RAM per executor
>Reporter: Suren Hiraman
>
> I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
> data.
> The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
> tasks, always on the same node.
> It seems that the 2 tasks are waiting for data from a remote task/RDD 
> partition.
> After about 2 hours or so, the stuck tasks get a closed connection exception 
> and you can see the remote side logging that as well. Log lines are below.
> My custom settings are:
> conf.set("spark.executor.memory", "14g") // TODO make this 
> configurable
> 
> // shuffle configs
> conf.set("spark.default.parallelism", "320")
> conf.set("spark.shuffle.file.buffer.kb", "200")
> conf.set("spark.reducer.maxMbInFlight", "96")
> 
> conf.set("spark.rdd.compress","true")
> 
> conf.set("spark.worker.timeout","180")
> 
> // akka settings
> conf.set("spark.akka.threads", "300")
> conf.set("spark.akka.timeout", "180")
> conf.set("spark.akka.frameSize", "100")
> conf.set("spark.akka.batchSize", "30")
> conf.set("spark.akka.askTimeout", "30")
> 
> // block manager
> conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18")
> conf.set("spark.blockManagerHeartBeatMs", "8")
> "STUCK" WORKER
> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
> connection to ConnectionManagerId(172.16.25.103,57626)
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
> REMOTE WORKER
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing 
> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
> SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2151) spark-submit issue (int format expected for memory parameter)

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2151:
---

Description: 
Get this exception when invoking spark-submit in standalone cluster mode:

{code}
Exception in thread "main" java.lang.NumberFormatException: For input string: 
"38g"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at 
org.apache.spark.deploy.ClientArguments.parse(ClientArguments.scala:55)
at 
org.apache.spark.deploy.ClientArguments.(ClientArguments.scala:47)
at org.apache.spark.deploy.Client$.main(Client.scala:148)
at org.apache.spark.deploy.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

  was:
Get this exception when invoking spark-submit in standalone cluster mode:

Exception in thread "main" java.lang.NumberFormatException: For input string: 
"38g"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at 
org.apache.spark.deploy.ClientArguments.parse(ClientArguments.scala:55)
at 
org.apache.spark.deploy.ClientArguments.(ClientArguments.scala:47)
at org.apache.spark.deploy.Client$.main(Client.scala:148)
at org.apache.spark.deploy.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



> spark-submit issue (int format expected for memory parameter)
> -
>
> Key: SPARK-2151
> URL: https://issues.apache.org/jira/browse/SPARK-2151
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
> Fix For: 1.0.1, 1.1.0
>
>
> Get this exception when invoking spark-submit in standalone cluster mode:
> {code}
> Exception in thread "main" java.lang.NumberFormatException: For input string: 
> "38g"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Integer.parseInt(Integer.java:492)
>   at java.lang.Integer.parseInt(Integer.java:527)
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>   at 
> org.apache.spark.deploy.ClientArguments.parse(ClientArguments.scala:55)
>   at 
> org.apache.spark.deploy.ClientArguments.(ClientArguments.scala:47)
>   at org.apache.spark.deploy.Client$.main(Client.scala:148)
>   at org.apache.spark.deploy.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2151) spark-submit issue (int format expected for memory parameter)

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2151.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1
 Assignee: Nishkam Ravi

> spark-submit issue (int format expected for memory parameter)
> -
>
> Key: SPARK-2151
> URL: https://issues.apache.org/jira/browse/SPARK-2151
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>Assignee: Nishkam Ravi
> Fix For: 1.0.1, 1.1.0
>
>
> Get this exception when invoking spark-submit in standalone cluster mode:
> {code}
> Exception in thread "main" java.lang.NumberFormatException: For input string: 
> "38g"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Integer.parseInt(Integer.java:492)
>   at java.lang.Integer.parseInt(Integer.java:527)
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>   at 
> org.apache.spark.deploy.ClientArguments.parse(ClientArguments.scala:55)
>   at 
> org.apache.spark.deploy.ClientArguments.(ClientArguments.scala:47)
>   at org.apache.spark.deploy.Client$.main(Client.scala:148)
>   at org.apache.spark.deploy.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

2014-06-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038101#comment-14038101
 ] 

Patrick Wendell commented on SPARK-2202:


Yes, please do!

> saveAsTextFile hangs on final 2 tasks
> -
>
> Key: SPARK-2202
> URL: https://issues.apache.org/jira/browse/SPARK-2202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: CentOS 5.7
> 16 nodes, 24 cores per node, 14g RAM per executor
>Reporter: Suren Hiraman
>
> I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
> data.
> The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
> tasks, always on the same node.
> It seems that the 2 tasks are waiting for data from a remote task/RDD 
> partition.
> After about 2 hours or so, the stuck tasks get a closed connection exception 
> and you can see the remote side logging that as well. Log lines are below.
> My custom settings are:
> conf.set("spark.executor.memory", "14g") // TODO make this 
> configurable
> 
> // shuffle configs
> conf.set("spark.default.parallelism", "320")
> conf.set("spark.shuffle.file.buffer.kb", "200")
> conf.set("spark.reducer.maxMbInFlight", "96")
> 
> conf.set("spark.rdd.compress","true")
> 
> conf.set("spark.worker.timeout","180")
> 
> // akka settings
> conf.set("spark.akka.threads", "300")
> conf.set("spark.akka.timeout", "180")
> conf.set("spark.akka.frameSize", "100")
> conf.set("spark.akka.batchSize", "30")
> conf.set("spark.akka.askTimeout", "30")
> 
> // block manager
> conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18")
> conf.set("spark.blockManagerHeartBeatMs", "8")
> "STUCK" WORKER
> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
> connection to ConnectionManagerId(172.16.25.103,57626)
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
> REMOTE WORKER
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing 
> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
> SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on wrong executors

2014-06-19 Thread Sebastien Rainville (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastien Rainville updated SPARK-2204:
---

Summary: Scheduler for Mesos in fine-grained mode launches tasks on wrong 
executors  (was: Scheduler for Mesos in fine-grained mode launches tasks on 
random executors)

> Scheduler for Mesos in fine-grained mode launches tasks on wrong executors
> --
>
> Key: SPARK-2204
> URL: https://issues.apache.org/jira/browse/SPARK-2204
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Sebastien Rainville
>Priority: Blocker
>
> MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is 
> assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning 
> task lists in the same order as the offers it was passed, but in the current 
> implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid 
> assigning the tasks always to the same executors. The result is that the 
> tasks are launched on random executors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1545) Add Random Forest algorithm to MLlib

2014-06-19 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-1545:
---

Target Version/s: 1.1.0

> Add Random Forest algorithm to MLlib
> 
>
> Key: SPARK-1545
> URL: https://issues.apache.org/jira/browse/SPARK-1545
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding Random Forest support to Spark MLlib. The 
> implementation needs to adapt the classic algorithm to the scalable tree 
> implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1536) Add multiclass classification support to MLlib

2014-06-19 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-1536:
---

Target Version/s: 1.1.0

> Add multiclass classification support to MLlib
> --
>
> Key: SPARK-1536
> URL: https://issues.apache.org/jira/browse/SPARK-1536
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 0.9.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> The current decision tree implementation in MLlib only supports binary 
> classification. This task involves adding multiclass classification support 
> to the decision tree implementation.
> The tasks involves:
> - Choosing a good strategy for multiclass classification among multiple 
> options:
>   -- add multi class support to impurity but it won't work well with the 
> categorical features since the centriod-based ordering assumptions won't hold 
> true
>   -- error-correcting output codes
>   -- one-vs-all
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1546) Add AdaBoost algorithm to Spark MLlib

2014-06-19 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-1546:
---

Affects Version/s: (was: 1.0.0)
   1.1.0

> Add AdaBoost algorithm to Spark MLlib
> -
>
> Key: SPARK-1546
> URL: https://issues.apache.org/jira/browse/SPARK-1546
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the AdaBoost algorithm to Spark MLlib. The 
> implementation needs to adapt the classic AdaBoost algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-06-19 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-1547:
---

Target Version/s: 1.1.0

> Add gradient boosting algorithm to MLlib
> 
>
> Key: SPARK-1547
> URL: https://issues.apache.org/jira/browse/SPARK-1547
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the gradient boosting algorithm to Spark MLlib. The 
> implementation needs to adapt the gradient boosting algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2207) Add minimum information gain and minimum instances per node as training parameters for decision tree.

2014-06-19 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2207:
-

Assignee: Manish Amde

> Add minimum information gain and minimum instances per node as training 
> parameters for decision tree.
> -
>
> Key: SPARK-2207
> URL: https://issues.apache.org/jira/browse/SPARK-2207
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification

2014-06-19 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2206:
-

Assignee: Manish Amde

> Automatically infer the number of classification classes in multiclass 
> classification
> -
>
> Key: SPARK-2206
> URL: https://issues.apache.org/jira/browse/SPARK-2206
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> Currently, the user needs to specify the numClassesForClassification 
> parameter explicitly during multiclass classification for decision trees. 
> This feature will automatically infer this information (and possibly class 
> histograms) from the training data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

2014-06-19 Thread Suren Hiraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037979#comment-14037979
 ] 

Suren Hiraman commented on SPARK-2202:
--

So it turns out that when we remove all of our custom setttings (leaving only 
executor memory and default parallelism), the flow completes.

Would you like me to re-run with the above settings and provide you with JStack 
output?


> saveAsTextFile hangs on final 2 tasks
> -
>
> Key: SPARK-2202
> URL: https://issues.apache.org/jira/browse/SPARK-2202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: CentOS 5.7
> 16 nodes, 24 cores per node, 14g RAM per executor
>Reporter: Suren Hiraman
>
> I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
> data.
> The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
> tasks, always on the same node.
> It seems that the 2 tasks are waiting for data from a remote task/RDD 
> partition.
> After about 2 hours or so, the stuck tasks get a closed connection exception 
> and you can see the remote side logging that as well. Log lines are below.
> My custom settings are:
> conf.set("spark.executor.memory", "14g") // TODO make this 
> configurable
> 
> // shuffle configs
> conf.set("spark.default.parallelism", "320")
> conf.set("spark.shuffle.file.buffer.kb", "200")
> conf.set("spark.reducer.maxMbInFlight", "96")
> 
> conf.set("spark.rdd.compress","true")
> 
> conf.set("spark.worker.timeout","180")
> 
> // akka settings
> conf.set("spark.akka.threads", "300")
> conf.set("spark.akka.timeout", "180")
> conf.set("spark.akka.frameSize", "100")
> conf.set("spark.akka.batchSize", "30")
> conf.set("spark.akka.askTimeout", "30")
> 
> // block manager
> conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18")
> conf.set("spark.blockManagerHeartBeatMs", "8")
> "STUCK" WORKER
> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
> connection to ConnectionManagerId(172.16.25.103,57626)
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
> REMOTE WORKER
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing 
> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
> SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2207) Add minimum information gain and minimum instances per node as training parameters for decision tree.

2014-06-19 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-2207:
---

Summary: Add minimum information gain and minimum instances per node as 
training parameters for decision tree.  (was: Add minimum info gain and min 
instances per node as training parameters for decision tree)

> Add minimum information gain and minimum instances per node as training 
> parameters for decision tree.
> -
>
> Key: SPARK-2207
> URL: https://issues.apache.org/jira/browse/SPARK-2207
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2207) Add minimum info gain and min instances per node as training parameters for decision tree

2014-06-19 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-2207:
---

Target Version/s: 1.1.0

> Add minimum info gain and min instances per node as training parameters for 
> decision tree
> -
>
> Key: SPARK-2207
> URL: https://issues.apache.org/jira/browse/SPARK-2207
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification

2014-06-19 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-2206:
---

 Target Version/s: 1.1.0
Affects Version/s: 1.0.0

> Automatically infer the number of classification classes in multiclass 
> classification
> -
>
> Key: SPARK-2206
> URL: https://issues.apache.org/jira/browse/SPARK-2206
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>
> Currently, the user needs to specify the numClassesForClassification 
> parameter explicitly during multiclass classification for decision trees. 
> This feature will automatically infer this information (and possibly class 
> histograms) from the training data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2207) Add minimum info gain and min instances per node as training parameters for decision tree

2014-06-19 Thread Manish Amde (JIRA)
Manish Amde created SPARK-2207:
--

 Summary: Add minimum info gain and min instances per node as 
training parameters for decision tree
 Key: SPARK-2207
 URL: https://issues.apache.org/jira/browse/SPARK-2207
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Manish Amde






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification

2014-06-19 Thread Manish Amde (JIRA)
Manish Amde created SPARK-2206:
--

 Summary: Automatically infer the number of classification classes 
in multiclass classification
 Key: SPARK-2206
 URL: https://issues.apache.org/jira/browse/SPARK-2206
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Manish Amde


Currently, the user needs to specify the numClassesForClassification parameter 
explicitly during multiclass classification for decision trees. This feature 
will automatically infer this information (and possibly class histograms) from 
the training data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1544) Add support for deep decision trees.

2014-06-19 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde closed SPARK-1544.
--


The PR has been accepted.

> Add support for deep decision trees.
> 
>
> Key: SPARK-1544
> URL: https://issues.apache.org/jira/browse/SPARK-1544
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
> Fix For: 1.0.0
>
>
> The current tree implementation stores an Array[Double] of size O(#features \ 
> #splits * 2^maxDepth)* in memory for aggregating histograms over partitions. 
> The current implementation might not scale to very deep trees since the 
> memory requirement grows exponentially with tree depth. 
> This task enables construction of arbitrary deep trees.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.

2014-06-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037909#comment-14037909
 ] 

Yin Huai commented on SPARK-2205:
-

The cause of this bug is that in HashJoin, outputPartitioning returns the 
output partitioning of its left child. 

> Unnecessary exchange operators in a join on multiple tables with the same 
> join key.
> ---
>
> Key: SPARK-2205
> URL: https://issues.apache.org/jira/browse/SPARK-2205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> {code}
> hql("select * from src x join src y on (x.key=y.key) join src z on 
> (y.key=z.key)")
> SchemaRDD[1] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5]
>  HashJoin [key#6], [key#8], BuildRight
>   Exchange (HashPartitioning [key#6], 200)
>HashJoin [key#4], [key#6], BuildRight
> Exchange (HashPartitioning [key#4], 200)
>  HiveTableScan [key#4,value#5], (MetastoreRelation default, src, 
> Some(x)), None
> Exchange (HashPartitioning [key#6], 200)
>  HiveTableScan [key#6,value#7], (MetastoreRelation default, src, 
> Some(y)), None
>   Exchange (HashPartitioning [key#8], 200)
>HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), 
> None
> {code}
> However, this is fine...
> {code}
> hql("select * from src x join src y on (x.key=y.key) join src z on 
> (x.key=z.key)")
> res5: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[5] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5]
>  HashJoin [key#26], [key#30], BuildRight
>   HashJoin [key#26], [key#28], BuildRight
>Exchange (HashPartitioning [key#26], 200)
> HiveTableScan [key#26,value#27], (MetastoreRelation default, src, 
> Some(x)), None
>Exchange (HashPartitioning [key#28], 200)
> HiveTableScan [key#28,value#29], (MetastoreRelation default, src, 
> Some(y)), None
>   Exchange (HashPartitioning [key#30], 200)
>HiveTableScan [key#30,value#31], (MetastoreRelation default, src, 
> Some(z)), None
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections

2014-06-19 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037904#comment-14037904
 ] 

Henry Saputra commented on SPARK-704:
-

Trying to reproduce and understand the issue. 
After a new SendingConnection is created it is creating its own channel then 
register to the ConnectionManager#selector to listen to state changes. 
When SendingConnection is being ask to send message, it will call 
Connection#registerInterest to ready for write which later in the 

Detecting whether SendingConnection is disconnected will be done when there is 
an attempt to write to the channel which will throw an exception which I 
believe should be sufficient for the purpose of the issue?

Just want to clarify if I understand the problem correctly.

> ConnectionManager sometimes cannot detect loss of sending connections
> -
>
> Key: SPARK-704
> URL: https://issues.apache.org/jira/browse/SPARK-704
> Project: Spark
>  Issue Type: Bug
>Reporter: Charles Reiss
>Assignee: Henry Saputra
>
> ConnectionManager currently does not detect when SendingConnections 
> disconnect except if it is trying to send through them. As a result, a node 
> failure just after a connection is initiated but before any acknowledgement 
> messages can be sent may result in a hang.
> ConnectionManager has code intended to detect this case by detecting the 
> failure of a corresponding ReceivingConnection, but this code assumes that 
> the remote host:port of the ReceivingConnection is the same as the 
> ConnectionManagerId, which is almost never true. Additionally, there does not 
> appear to be any reason to assume a corresponding ReceivingConnection will 
> exist.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on random executors

2014-06-19 Thread Sebastien Rainville (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037903#comment-14037903
 ] 

Sebastien Rainville commented on SPARK-2204:


Created PR: https://github.com/apache/spark/pull/1140

> Scheduler for Mesos in fine-grained mode launches tasks on random executors
> ---
>
> Key: SPARK-2204
> URL: https://issues.apache.org/jira/browse/SPARK-2204
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Sebastien Rainville
>Priority: Blocker
>
> MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is 
> assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning 
> task lists in the same order as the offers it was passed, but in the current 
> implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid 
> assigning the tasks always to the same executors. The result is that the 
> tasks are launched on random executors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2191) Double execution with CREATE TABLE AS SELECT

2014-06-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2191.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1
 Assignee: Michael Armbrust

> Double execution with CREATE TABLE AS SELECT
> 
>
> Key: SPARK-2191
> URL: https://issues.apache.org/jira/browse/SPARK-2191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.0.1, 1.1.0
>
>
> Reproduction:
> {code}
> scala> hql("CREATE TABLE foo AS select unix_timestamp() from src limit 
> 1").collect()
> res5: Array[org.apache.spark.sql.Row] = Array()
> scala> hql("SELECT * FROM foo").collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1403159129], [1403159130])
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.

2014-06-19 Thread Yin Huai (JIRA)
Yin Huai created SPARK-2205:
---

 Summary: Unnecessary exchange operators in a join on multiple 
tables with the same join key.
 Key: SPARK-2205
 URL: https://issues.apache.org/jira/browse/SPARK-2205
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


{code}
hql("select * from src x join src y on (x.key=y.key) join src z on 
(y.key=z.key)")

SchemaRDD[1] at RDD at SchemaRDD.scala:100
== Query Plan ==
Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5]
 HashJoin [key#6], [key#8], BuildRight
  Exchange (HashPartitioning [key#6], 200)
   HashJoin [key#4], [key#6], BuildRight
Exchange (HashPartitioning [key#4], 200)
 HiveTableScan [key#4,value#5], (MetastoreRelation default, src, Some(x)), 
None
Exchange (HashPartitioning [key#6], 200)
 HiveTableScan [key#6,value#7], (MetastoreRelation default, src, Some(y)), 
None
  Exchange (HashPartitioning [key#8], 200)
   HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), 
None
{code}

However, this is fine...
{code}
hql("select * from src x join src y on (x.key=y.key) join src z on 
(x.key=z.key)")

res5: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[5] at RDD at SchemaRDD.scala:100
== Query Plan ==
Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5]
 HashJoin [key#26], [key#30], BuildRight
  HashJoin [key#26], [key#28], BuildRight
   Exchange (HashPartitioning [key#26], 200)
HiveTableScan [key#26,value#27], (MetastoreRelation default, src, Some(x)), 
None
   Exchange (HashPartitioning [key#28], 200)
HiveTableScan [key#28,value#29], (MetastoreRelation default, src, Some(y)), 
None
  Exchange (HashPartitioning [key#30], 200)
   HiveTableScan [key#30,value#31], (MetastoreRelation default, src, Some(z)), 
None
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on random executors

2014-06-19 Thread Sebastien Rainville (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastien Rainville updated SPARK-2204:
---

Description: MesosSchedulerBackend.resourceOffers(SchedulerDriver, 
List[Offer]) is assuming that 
TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists in 
the same order as the offers it was passed, but in the current implementation 
TaskSchedulerImpl.resourceOffers shuffles the offers to avoid assigning the 
tasks always to the same executors. The result is that the tasks are launched 
on random executors.  (was: 
MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is assuming 
that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists 
in the same order as the offers it was passed, but in the current 
implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid 
assigning the tasks always to the same executors. The result is that the tasks 
are launched on random executors.6)

> Scheduler for Mesos in fine-grained mode launches tasks on random executors
> ---
>
> Key: SPARK-2204
> URL: https://issues.apache.org/jira/browse/SPARK-2204
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Sebastien Rainville
>Priority: Blocker
>
> MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is 
> assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning 
> task lists in the same order as the offers it was passed, but in the current 
> implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid 
> assigning the tasks always to the same executors. The result is that the 
> tasks are launched on random executors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on random executors

2014-06-19 Thread Sebastien Rainville (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastien Rainville updated SPARK-2204:
---

Fix Version/s: (was: 1.0.1)

> Scheduler for Mesos in fine-grained mode launches tasks on random executors
> ---
>
> Key: SPARK-2204
> URL: https://issues.apache.org/jira/browse/SPARK-2204
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Sebastien Rainville
>Priority: Blocker
>
> MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is 
> assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning 
> task lists in the same order as the offers it was passed, but in the current 
> implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid 
> assigning the tasks always to the same executors. The result is that the 
> tasks are launched on random executors.6



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1800) Add broadcast hash join operator

2014-06-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037880#comment-14037880
 ] 

Yin Huai commented on SPARK-1800:
-

Maybe add an improvement in future that tasks in the same node can share those 
hashtables. 

Also, if we have a star join, maybe we want to limit the total size of those 
hashtables? So, they will not occupy too much space.

> Add broadcast hash join operator
> 
>
> Key: SPARK-1800
> URL: https://issues.apache.org/jira/browse/SPARK-1800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on random executors

2014-06-19 Thread Sebastien Rainville (JIRA)
Sebastien Rainville created SPARK-2204:
--

 Summary: Scheduler for Mesos in fine-grained mode launches tasks 
on random executors
 Key: SPARK-2204
 URL: https://issues.apache.org/jira/browse/SPARK-2204
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Sebastien Rainville
Priority: Blocker
 Fix For: 1.0.1


MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is assuming 
that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists 
in the same order as the offers it was passed, but in the current 
implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid 
assigning the tasks always to the same executors. The result is that the tasks 
are launched on random executors.6



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2177) describe table result contains only one column

2014-06-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037809#comment-14037809
 ] 

Yin Huai commented on SPARK-2177:
-

Generally Hive generates results of DDL statements as plain text (unless we use 
"set hive.ddl.output.format=json"). It is not quite easy to parse those plain 
strings and I think it is not a good idea to understand how Hive works for 
every describe commands and write our code to generate the exactly same output. 
With changes made in this PR,  Spark SQL can support a subset of describe 
commands which are commonly used. This subset is defined by 
{code}
DESCRIBE [EXTENDED] [db_name.]table_name
{code}
All other cases are still treated as native commands.

> describe table result contains only one column
> --
>
> Key: SPARK-2177
> URL: https://issues.apache.org/jira/browse/SPARK-2177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yin Huai
>
> {code}
> scala> hql("describe src").collect().foreach(println)
> [key  string  None]
> [valuestring  None]
> {code}
> The result should contain 3 columns instead of one. This screws up JDBC or 
> even the downstream consumer of the Scala/Java/Python APIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2177) describe table result contains only one column

2014-06-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037810#comment-14037810
 ] 

Yin Huai commented on SPARK-2177:
-

We should also put what cases we support in the release note. But, where is 
that field?

> describe table result contains only one column
> --
>
> Key: SPARK-2177
> URL: https://issues.apache.org/jira/browse/SPARK-2177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yin Huai
>
> {code}
> scala> hql("describe src").collect().foreach(println)
> [key  string  None]
> [valuestring  None]
> {code}
> The result should contain 3 columns instead of one. This screws up JDBC or 
> even the downstream consumer of the Scala/Java/Python APIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2203) PySpark does not infer default numPartitions in same way as Spark

2014-06-19 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2203:
-

 Summary: PySpark does not infer default numPartitions in same way 
as Spark
 Key: SPARK-2203
 URL: https://issues.apache.org/jira/browse/SPARK-2203
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson


For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark 
will always assume that the default parallelism to use for the reduce side is 
ctx.defaultParallelism, which is a constant typically determined by the number 
of cores in cluster.

In contrast, Spark's Partitioner#defaultPartitioner will use the same number of 
reduce partitions as map partitions unless the defaultParallelism config is 
explicitly set. This tends to be a better default in order to avoid OOMs, and 
should also be the behavior of PySpark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2126) Move MapOutputTracker behind ShuffleManager interface

2014-06-19 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037746#comment-14037746
 ] 

Nan Zhu commented on SPARK-2126:


[~pwendell] Yes, [~markhamstra] just emailed me  

Yes, I have been working on it for two evenings, it's a big change and I 
haven't make any significant change, so I don't mind that a core developer come 
to lead this and I'm still willing to contribute anything I can 




> Move MapOutputTracker behind ShuffleManager interface
> -
>
> Key: SPARK-2126
> URL: https://issues.apache.org/jira/browse/SPARK-2126
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Nan Zhu
>
> This will require changing the interface between the DAGScheduler and 
> MapOutputTracker to be method calls on the ShuffleManager instead. However, 
> it will make it easier to do push-based shuffle and other ideas requiring 
> changes to map output tracking.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2038) Don't shadow "conf" variable in saveAsHadoop functions

2014-06-19 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037739#comment-14037739
 ] 

Nan Zhu commented on SPARK-2038:


[~pwendell] Yeah, it's a good idea, just submit a new PR: 
https://github.com/apache/spark/pull/1137

> Don't shadow "conf" variable in saveAsHadoop functions
> --
>
> Key: SPARK-2038
> URL: https://issues.apache.org/jira/browse/SPARK-2038
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Nan Zhu
>Priority: Minor
>  Labels: api-breaking
> Fix For: 1.1.0
>
>
> This could lead to a lot of bugs. We should just change it to hadoopConf. I 
> noticed this when reviewing SPARK-1677.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-2038) Don't shadow "conf" variable in saveAsHadoop functions

2014-06-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-2038:



> Don't shadow "conf" variable in saveAsHadoop functions
> --
>
> Key: SPARK-2038
> URL: https://issues.apache.org/jira/browse/SPARK-2038
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Nan Zhu
>Priority: Minor
>  Labels: api-breaking
> Fix For: 1.1.0
>
>
> This could lead to a lot of bugs. We should just change it to hadoopConf. I 
> noticed this when reviewing SPARK-1677.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2038) Don't shadow "conf" variable in saveAsHadoop functions

2014-06-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037703#comment-14037703
 ] 

Patrick Wendell commented on SPARK-2038:


Hey [~CodingCat] - I realized there is actually an intermediate fix. Don't 
change the name of the method argument, but inside of the method immediately do 
`val hadoopConf = conf` then add a comment that it's to avoid naming collision.

So I think you could still submit your patch with that change. Does that make 
sense?

> Don't shadow "conf" variable in saveAsHadoop functions
> --
>
> Key: SPARK-2038
> URL: https://issues.apache.org/jira/browse/SPARK-2038
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Nan Zhu
>Priority: Minor
>  Labels: api-breaking
> Fix For: 1.1.0
>
>
> This could lead to a lot of bugs. We should just change it to hadoopConf. I 
> noticed this when reviewing SPARK-1677.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

2014-06-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037696#comment-14037696
 ] 

Patrick Wendell commented on SPARK-2202:


When the tasks are hanging. Could you go to the individual node and run 
`jstack` on the Executor process? It's possible there is a bug in the HDFS 
client library, in Spark, or somewhere else.

> saveAsTextFile hangs on final 2 tasks
> -
>
> Key: SPARK-2202
> URL: https://issues.apache.org/jira/browse/SPARK-2202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: CentOS 5.7
> 16 nodes, 24 cores per node, 14g RAM per executor
>Reporter: Suren Hiraman
>Priority: Blocker
>
> I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
> data.
> The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
> tasks, always on the same node.
> It seems that the 2 tasks are waiting for data from a remote task/RDD 
> partition.
> After about 2 hours or so, the stuck tasks get a closed connection exception 
> and you can see the remote side logging that as well. Log lines are below.
> My custom settings are:
> conf.set("spark.executor.memory", "14g") // TODO make this 
> configurable
> 
> // shuffle configs
> conf.set("spark.default.parallelism", "320")
> conf.set("spark.shuffle.file.buffer.kb", "200")
> conf.set("spark.reducer.maxMbInFlight", "96")
> 
> conf.set("spark.rdd.compress","true")
> 
> conf.set("spark.worker.timeout","180")
> 
> // akka settings
> conf.set("spark.akka.threads", "300")
> conf.set("spark.akka.timeout", "180")
> conf.set("spark.akka.frameSize", "100")
> conf.set("spark.akka.batchSize", "30")
> conf.set("spark.akka.askTimeout", "30")
> 
> // block manager
> conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18")
> conf.set("spark.blockManagerHeartBeatMs", "8")
> "STUCK" WORKER
> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
> connection to ConnectionManagerId(172.16.25.103,57626)
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
> REMOTE WORKER
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing 
> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
> SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

2014-06-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2202:
---

Priority: Major  (was: Blocker)

> saveAsTextFile hangs on final 2 tasks
> -
>
> Key: SPARK-2202
> URL: https://issues.apache.org/jira/browse/SPARK-2202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: CentOS 5.7
> 16 nodes, 24 cores per node, 14g RAM per executor
>Reporter: Suren Hiraman
>
> I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
> data.
> The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
> tasks, always on the same node.
> It seems that the 2 tasks are waiting for data from a remote task/RDD 
> partition.
> After about 2 hours or so, the stuck tasks get a closed connection exception 
> and you can see the remote side logging that as well. Log lines are below.
> My custom settings are:
> conf.set("spark.executor.memory", "14g") // TODO make this 
> configurable
> 
> // shuffle configs
> conf.set("spark.default.parallelism", "320")
> conf.set("spark.shuffle.file.buffer.kb", "200")
> conf.set("spark.reducer.maxMbInFlight", "96")
> 
> conf.set("spark.rdd.compress","true")
> 
> conf.set("spark.worker.timeout","180")
> 
> // akka settings
> conf.set("spark.akka.threads", "300")
> conf.set("spark.akka.timeout", "180")
> conf.set("spark.akka.frameSize", "100")
> conf.set("spark.akka.batchSize", "30")
> conf.set("spark.akka.askTimeout", "30")
> 
> // block manager
> conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18")
> conf.set("spark.blockManagerHeartBeatMs", "8")
> "STUCK" WORKER
> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
> connection to ConnectionManagerId(172.16.25.103,57626)
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
> REMOTE WORKER
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing 
> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
> SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

2014-06-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037698#comment-14037698
 ] 

Patrick Wendell commented on SPARK-2202:


I changed the priority because we usually wait until we've diagnosed the exact 
issue to assign something as a blocker.

> saveAsTextFile hangs on final 2 tasks
> -
>
> Key: SPARK-2202
> URL: https://issues.apache.org/jira/browse/SPARK-2202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: CentOS 5.7
> 16 nodes, 24 cores per node, 14g RAM per executor
>Reporter: Suren Hiraman
>
> I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
> data.
> The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
> tasks, always on the same node.
> It seems that the 2 tasks are waiting for data from a remote task/RDD 
> partition.
> After about 2 hours or so, the stuck tasks get a closed connection exception 
> and you can see the remote side logging that as well. Log lines are below.
> My custom settings are:
> conf.set("spark.executor.memory", "14g") // TODO make this 
> configurable
> 
> // shuffle configs
> conf.set("spark.default.parallelism", "320")
> conf.set("spark.shuffle.file.buffer.kb", "200")
> conf.set("spark.reducer.maxMbInFlight", "96")
> 
> conf.set("spark.rdd.compress","true")
> 
> conf.set("spark.worker.timeout","180")
> 
> // akka settings
> conf.set("spark.akka.threads", "300")
> conf.set("spark.akka.timeout", "180")
> conf.set("spark.akka.frameSize", "100")
> conf.set("spark.akka.batchSize", "30")
> conf.set("spark.akka.askTimeout", "30")
> 
> // block manager
> conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18")
> conf.set("spark.blockManagerHeartBeatMs", "8")
> "STUCK" WORKER
> 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
> connection to ConnectionManagerId(172.16.25.103,57626)
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)
> REMOTE WORKER
> 14/06/18 19:41:18 INFO network.ConnectionManager: Removing 
> ReceivingConnection to ConnectionManagerId(172.16.25.124,55610)
> 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
> SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2180) HiveQL doesn't support GROUP BY with HAVING clauses

2014-06-19 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037697#comment-14037697
 ] 

William Benton commented on SPARK-2180:
---

PR is here:  https://github.com/apache/spark/pull/1136

> HiveQL doesn't support GROUP BY with HAVING clauses
> ---
>
> Key: SPARK-2180
> URL: https://issues.apache.org/jira/browse/SPARK-2180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: William Benton
>Priority: Minor
>
> The HiveQL implementation doesn't support HAVING clauses for aggregations.  
> This prevents some of the TPCDS benchmarks from running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2126) Move MapOutputTracker behind ShuffleManager interface

2014-06-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037692#comment-14037692
 ] 

Patrick Wendell commented on SPARK-2126:


Hey All,

This proposal is a fairly hairy refactoring of Spark internals. It might not be 
the best candidate for an external contribution. [~CodingCat] if you wanted to 
take a initial attempt at this, go right ahead! Just a warning though, it might 
be that we use your code as a starting point for the design. 

The final version of this patch will probably need to be written by someone who 
has worked a lot on these internals ([~markhamstra] you'd actually be a good 
candidate yourself! but not sure you have the cycles).

> Move MapOutputTracker behind ShuffleManager interface
> -
>
> Key: SPARK-2126
> URL: https://issues.apache.org/jira/browse/SPARK-2126
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Nan Zhu
>
> This will require changing the interface between the DAGScheduler and 
> MapOutputTracker to be method calls on the ShuffleManager instead. However, 
> it will make it easier to do push-based shuffle and other ideas requiring 
> changes to map output tracking.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib

2014-06-19 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037659#comment-14037659
 ] 

Valeriy Avanesov commented on SPARK-2199:
-

Here is the implementation we currently have. https://github.com/akopich/dplsa
Robust and non robust PLSA are implemented but no regularizers are currently 
supported. 

> Distributed probabilistic latent semantic analysis in MLlib
> ---
>
> Key: SPARK-2199
> URL: https://issues.apache.org/jira/browse/SPARK-2199
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Denis Turdakov
>  Labels: features
>
> Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
> topics from text corpus. PLSA was historically a predecessor of LDA. However 
> recent research shows that modifications of PLSA sometimes performs better 
> then LDA[1]. Furthermore, the most recent paper by same authors shows that 
> there is a clear way to extend PLSA to LDA and beyond[2].
> We should implement distributed versions of PLSA. In addition it should be 
> possible  to easily add user defined regularizers or combination of them. We 
> will implement regularizers that allows
> * extract sparse topics
> * extract human interpretable topics 
> * perform semi-supervised training 
> * sort out non-topic specific terms. 
> [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
> Proceedings of ECIR'13.
> [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
> Regularization for Stochastic Matrix Factorization. 
> http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2126) Move MapOutputTracker behind ShuffleManager interface

2014-06-19 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-2126:


Assignee: Nan Zhu

> Move MapOutputTracker behind ShuffleManager interface
> -
>
> Key: SPARK-2126
> URL: https://issues.apache.org/jira/browse/SPARK-2126
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Nan Zhu
>
> This will require changing the interface between the DAGScheduler and 
> MapOutputTracker to be method calls on the ShuffleManager instead. However, 
> it will make it easier to do push-based shuffle and other ideas requiring 
> changes to map output tracking.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2200) breeze DenseVector not serializable with KryoSerializer

2014-06-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037637#comment-14037637
 ] 

Xiangrui Meng commented on SPARK-2200:
--

[~neville] Do you know the root cause and how this is fixed in breeze 0.8.1? 
You disabled reference tracking, which may be the reason.

> breeze DenseVector not serializable with KryoSerializer
> ---
>
> Key: SPARK-2200
> URL: https://issues.apache.org/jira/browse/SPARK-2200
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> Spark 1.0.0 depends on breeze 0.7 and for some reason serializing DenseVector 
> with KryoSerializer throws the following stack trace. Looks like some 
> recursive field in the object. Upgrading to 0.8.1 solved this.
> {code}
> java.lang.StackOverflowError
>   at java.lang.reflect.Field.getDeclaringClass(Field.java:154)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
>   at 
> sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
>   at java.lang.reflect.Field.get(Field.java:379)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> ...
> {code}
> Code to reproduce:
> {code}
> import breeze.linalg.DenseVector
> import org.apache.spark.SparkConf
> import org.apache.spark.serializer.KryoSerializer
> object SerializerTest {
>   def main(args: Array[String]) {
> val conf = new SparkConf()
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.kryo.registrator", classOf[MyRegistrator].getName)
>   .set("spark.kryo.referenceTracking", "false")
>   .set("spark.kryoserializer.buffer.mb", "8")
> val serializer = new KryoSerializer(conf).newInstance()
> serializer.serialize(DenseVector.rand(10))
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2202) saveAsTextFile hangs on final 2 tasks

2014-06-19 Thread Suren Hiraman (JIRA)
Suren Hiraman created SPARK-2202:


 Summary: saveAsTextFile hangs on final 2 tasks
 Key: SPARK-2202
 URL: https://issues.apache.org/jira/browse/SPARK-2202
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: CentOS 5.7
16 nodes, 24 cores per node, 14g RAM per executor
Reporter: Suren Hiraman
Priority: Blocker


I have a flow that takes in about 10 GB of data and writes out about 10 GB of 
data.

The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining 
tasks, always on the same node.

It seems that the 2 tasks are waiting for data from a remote task/RDD partition.

After about 2 hours or so, the stuck tasks get a closed connection exception 
and you can see the remote side logging that as well. Log lines are below.

My custom settings are:

conf.set("spark.executor.memory", "14g") // TODO make this 
configurable

// shuffle configs
conf.set("spark.default.parallelism", "320")
conf.set("spark.shuffle.file.buffer.kb", "200")
conf.set("spark.reducer.maxMbInFlight", "96")

conf.set("spark.rdd.compress","true")

conf.set("spark.worker.timeout","180")

// akka settings
conf.set("spark.akka.threads", "300")
conf.set("spark.akka.timeout", "180")
conf.set("spark.akka.frameSize", "100")
conf.set("spark.akka.batchSize", "30")
conf.set("spark.akka.askTimeout", "30")

// block manager
conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18")
conf.set("spark.blockManagerHeartBeatMs", "8")


"STUCK" WORKER
14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from 
connection to ConnectionManagerId(172.16.25.103,57626)

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcher.read0(Native Method)

at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)

at sun.nio.ch.IOUtil.read(IOUtil.java:224)

at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)

at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496)


REMOTE WORKER

14/06/18 19:41:18 INFO network.ConnectionManager: Removing ReceivingConnection 
to ConnectionManagerId(172.16.25.124,55610)

14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding 
SendingConnectionManagerId not found



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2051) spark.yarn.dist.* configs are not supported in yarn-cluster mode

2014-06-19 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-2051.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

> spark.yarn.dist.* configs are not supported in yarn-cluster mode
> 
>
> Key: SPARK-2051
> URL: https://issues.apache.org/jira/browse/SPARK-2051
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.0
>
>
>   Spark configuration
> {{conf/spark-defaults.conf}}:
> {quote}
> spark.yarn.dist.archives /toona/conf
> spark.executor.extraClassPath ./conf
> spark.driver.extraClassPath  ./conf
> {quote}
> 
> HDFS directory
> {{hadoop dfs -cat /toona/conf/toona.conf}} :
> {quote}
>  redis.num=4
> {quote}
> 
> The following command execution fails
> {code}
> YARN_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --num-executors 2 
> --driver-memory 2g --executor-memory 2g --master yarn-cluster --class 
> toona.DeployTest toona-assembly.jar  
> {code}
> 
> The following is  the test code
> {code}
> package toona
> import com.typesafe.config.Config
> import com.typesafe.config.ConfigFactory
> object DeployTest {
>   def main(args: Array[String]) {
> val conf = ConfigFactory.load("toona.conf")
> val redisNum = conf.getInt("redis.num") // Here will throw an 
> `ConfigException` exception
> assert(redisNum == 4)
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2198) Partition the scala build file so that it is easier to maintain

2014-06-19 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037467#comment-14037467
 ] 

Helena Edelson commented on SPARK-2198:
---

I am sad to hear that the Maven POMs will be primary (vs scala SBT) and 
staying. 
It was very odd to see the SBT/Maven redundancies however.

> Partition the scala build file so that it is easier to maintain
> ---
>
> Key: SPARK-2198
> URL: https://issues.apache.org/jira/browse/SPARK-2198
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Helena Edelson
>Priority: Minor
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Partition to standard Dependencies, Version, Settings, Publish.scala. keeping 
> the SparkBuild clean to describe the modules and their deps so that changes 
> in versions, for example, need only be made in Version.scala, settings 
> changes such as in scalac in Settings.scala, etc.
> I'd be happy to do this ([~helena_e])



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2201) Improve FlumeInputDStream

2014-06-19 Thread sunshangchun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sunshangchun updated SPARK-2201:


Description: 
Currently only one flume receiver can work with FlumeInputDStream and I am 
willing to do some works to improve it, my ideas are described as follows: 

a ip and port denotes a physical host, and a logical host consists of one or 
more physical hosts

In our case, spark flume receivers bind themselves to a logical host when 
started, and a flume agent get physical hosts and push events to them.
Two classes are introduced, LogicalHostRouter supplies a map between logical 
host and physical host, and LogicalHostRouterListener let relation changes 
watchable.

Some works need to be done here: 
1. LogicalHostRouter and LogicalHostRouterListener  can be implemented by 
zookeeper. when physical host started, create tmp node in zk,  listeners just 
watch those tmp nodes.
2. when spark FlumeReceivers started, they acquire a physical host (localhost's 
ip and an idle port) and register itself to zookeeper.
3. A new flume sink. In the method of appendEvents, they get physical hosts and 
push data to them in a round-robin manner.

Does it a feasible plan? Thanks.


> Improve FlumeInputDStream
> -
>
> Key: SPARK-2201
> URL: https://issues.apache.org/jira/browse/SPARK-2201
> Project: Spark
>  Issue Type: Improvement
>Reporter: sunshangchun
>
> Currently only one flume receiver can work with FlumeInputDStream and I am 
> willing to do some works to improve it, my ideas are described as follows: 
> a ip and port denotes a physical host, and a logical host consists of one or 
> more physical hosts
> In our case, spark flume receivers bind themselves to a logical host when 
> started, and a flume agent get physical hosts and push events to them.
> Two classes are introduced, LogicalHostRouter supplies a map between logical 
> host and physical host, and LogicalHostRouterListener let relation changes 
> watchable.
> Some works need to be done here: 
> 1. LogicalHostRouter and LogicalHostRouterListener  can be implemented by 
> zookeeper. when physical host started, create tmp node in zk,  listeners just 
> watch those tmp nodes.
> 2. when spark FlumeReceivers started, they acquire a physical host 
> (localhost's ip and an idle port) and register itself to zookeeper.
> 3. A new flume sink. In the method of appendEvents, they get physical hosts 
> and push data to them in a round-robin manner.
> Does it a feasible plan? Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2201) Improve FlumeInputDStream

2014-06-19 Thread sunshangchun (JIRA)
sunshangchun created SPARK-2201:
---

 Summary: Improve FlumeInputDStream
 Key: SPARK-2201
 URL: https://issues.apache.org/jira/browse/SPARK-2201
 Project: Spark
  Issue Type: Improvement
Reporter: sunshangchun






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2198) Partition the scala build file so that it is easier to maintain

2014-06-19 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037431#comment-14037431
 ] 

Mark Hamstra commented on SPARK-2198:
-

While this is an admirable goal, I'm afraid that hand editing the SBT build 
files won't be a very durable solution.  That is because it is currently our 
goal to consolidate the Maven and SBT builds by deriving the SBT build 
configuration from the Maven POMs: 
https://issues.apache.org/jira/browse/SPARK-1776.  As such, any partitioning of 
the SBT build file will really need to be incorporated into the code that is 
generating that file from the Maven input. 

> Partition the scala build file so that it is easier to maintain
> ---
>
> Key: SPARK-2198
> URL: https://issues.apache.org/jira/browse/SPARK-2198
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Helena Edelson
>Priority: Minor
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Partition to standard Dependencies, Version, Settings, Publish.scala. keeping 
> the SparkBuild clean to describe the modules and their deps so that changes 
> in versions, for example, need only be made in Version.scala, settings 
> changes such as in scalac in Settings.scala, etc.
> I'd be happy to do this ([~helena_e])



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2200) breeze DenseVector not serializable with KryoSerializer

2014-06-19 Thread Neville Li (JIRA)
Neville Li created SPARK-2200:
-

 Summary: breeze DenseVector not serializable with KryoSerializer
 Key: SPARK-2200
 URL: https://issues.apache.org/jira/browse/SPARK-2200
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor


Spark 1.0.0 depends on breeze 0.7 and for some reason serializing DenseVector 
with KryoSerializer throws the following stack trace. Looks like some recursive 
field in the object. Upgrading to 0.8.1 solved this.
{code}
java.lang.StackOverflowError
at java.lang.reflect.Field.getDeclaringClass(Field.java:154)
at 
sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
at 
sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
at java.lang.reflect.Field.get(Field.java:379)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
...
{code}

Code to reproduce:
{code}
import breeze.linalg.DenseVector
import org.apache.spark.SparkConf
import org.apache.spark.serializer.KryoSerializer

object SerializerTest {
  def main(args: Array[String]) {
val conf = new SparkConf()
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.kryo.registrator", classOf[MyRegistrator].getName)
  .set("spark.kryo.referenceTracking", "false")
  .set("spark.kryoserializer.buffer.mb", "8")

val serializer = new KryoSerializer(conf).newInstance()
serializer.serialize(DenseVector.rand(10))
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2200) breeze DenseVector not serializable with KryoSerializer

2014-06-19 Thread Neville Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037424#comment-14037424
 ] 

Neville Li commented on SPARK-2200:
---

https://github.com/apache/spark/pull/940 addresses this.

> breeze DenseVector not serializable with KryoSerializer
> ---
>
> Key: SPARK-2200
> URL: https://issues.apache.org/jira/browse/SPARK-2200
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> Spark 1.0.0 depends on breeze 0.7 and for some reason serializing DenseVector 
> with KryoSerializer throws the following stack trace. Looks like some 
> recursive field in the object. Upgrading to 0.8.1 solved this.
> {code}
> java.lang.StackOverflowError
>   at java.lang.reflect.Field.getDeclaringClass(Field.java:154)
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
>   at 
> sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
>   at java.lang.reflect.Field.get(Field.java:379)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> ...
> {code}
> Code to reproduce:
> {code}
> import breeze.linalg.DenseVector
> import org.apache.spark.SparkConf
> import org.apache.spark.serializer.KryoSerializer
> object SerializerTest {
>   def main(args: Array[String]) {
> val conf = new SparkConf()
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.kryo.registrator", classOf[MyRegistrator].getName)
>   .set("spark.kryo.referenceTracking", "false")
>   .set("spark.kryoserializer.buffer.mb", "8")
> val serializer = new KryoSerializer(conf).newInstance()
> serializer.serialize(DenseVector.rand(10))
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2181) The keys for sorting the columns of Executor page in SparkUI are incorrect

2014-06-19 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037420#comment-14037420
 ] 

Guoqiang Li commented on SPARK-2181:


PR: https://github.com/apache/spark/pull/1135

> The keys for sorting the columns of Executor page in SparkUI are incorrect
> --
>
> Key: SPARK-2181
> URL: https://issues.apache.org/jira/browse/SPARK-2181
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shuo Xiang
>Assignee: Guoqiang Li
>Priority: Minor
>
> Under the Executor page of SparkUI, each column is sorted alphabetically 
> (after clicking). However, it should be sorted by the value, not the string.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib

2014-06-19 Thread Denis Turdakov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Turdakov updated SPARK-2199:
--

Description: 
Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
topics from text corpus. PLSA was historically a predecessor of LDA. However 
recent research shows that modifications of PLSA sometimes performs better then 
LDA[1]. Furthermore, the most recent paper by same authors shows that there is 
a clear way to extend PLSA to LDA and beyond[2].

We should implement distributed versions of PLSA. In addition it should be 
possible  to easily add user defined regularizers or combination of them. We 
will implement regularizers that allows
* extract sparse topics
* extract human interpretable topics 
* perform semi-supervised training 
* sort out non-topic specific terms. 

[1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
Proceedings of ECIR'13.
[2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
Regularization for Stochastic Matrix Factorization. 
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 


  was:
Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
topics from text corpus. PLSA was historically a predecessor of LDA. However 
recent research shows that modifications of PLSA sometimes performs better then 
LDA[1]. Furthermore, the most recent paper by same authors shows that there is 
a clear way to extend PLSA to LDA and beyond[2].
(empty line)
We should implement distributed versions of PLSA. In addition it should be 
possible  to easily add user defined regularizers or combination of them. We 
will implement regularizers that allows
* extract sparse topics
* extract human interpretable topics 
* perform semi-supervised training 
* sort out non-topic specific terms. 

[1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
Proceedings of ECIR'13.
[2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
Regularization for Stochastic Matrix Factorization. 
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



> Distributed probabilistic latent semantic analysis in MLlib
> ---
>
> Key: SPARK-2199
> URL: https://issues.apache.org/jira/browse/SPARK-2199
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Denis Turdakov
>  Labels: features
>
> Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
> topics from text corpus. PLSA was historically a predecessor of LDA. However 
> recent research shows that modifications of PLSA sometimes performs better 
> then LDA[1]. Furthermore, the most recent paper by same authors shows that 
> there is a clear way to extend PLSA to LDA and beyond[2].
> We should implement distributed versions of PLSA. In addition it should be 
> possible  to easily add user defined regularizers or combination of them. We 
> will implement regularizers that allows
> * extract sparse topics
> * extract human interpretable topics 
> * perform semi-supervised training 
> * sort out non-topic specific terms. 
> [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
> Proceedings of ECIR'13.
> [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
> Regularization for Stochastic Matrix Factorization. 
> http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib

2014-06-19 Thread Denis Turdakov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Turdakov updated SPARK-2199:
--

Description: 
Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
topics from text corpus. PLSA was historically a predecessor of LDA. However 
recent research shows that modifications of PLSA sometimes performs better then 
LDA[1]. Furthermore, the most recent paper by same authors shows that there is 
a clear way to extend PLSA to LDA and beyond[2].
(empty line)
We should implement distributed versions of PLSA. In addition it should be 
possible  to easily add user defined regularizers or combination of them. We 
will implement regularizers that allows
* extract sparse topics
* extract human interpretable topics 
* perform semi-supervised training 
* sort out non-topic specific terms. 

[1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
Proceedings of ECIR'13.
[2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
Regularization for Stochastic Matrix Factorization. 
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 


  was:
Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
topics from text corpus. PLSA was historically a predecessor of LDA. However 
recent research shows that modifications of PLSA sometimes performs better then 
LDA[1]. Furthermore, the most recent paper by same authors shows that there is 
a clear way to extend PLSA to LDA and beyond[2].
We should implement distributed versions of PLSA. In addition it should be 
possible  to easily add user defined regularizers or combination of them. We 
will implement regularizers that allows
•   extract sparse topics
•   extract human interpretable topics 
•   perform semi-supervised training 
•   sort out non-topic specific terms. 

[1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
Proceedings of ECIR'13.
[2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
Regularization for Stochastic Matrix Factorization. 
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



> Distributed probabilistic latent semantic analysis in MLlib
> ---
>
> Key: SPARK-2199
> URL: https://issues.apache.org/jira/browse/SPARK-2199
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Denis Turdakov
>  Labels: features
>
> Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
> topics from text corpus. PLSA was historically a predecessor of LDA. However 
> recent research shows that modifications of PLSA sometimes performs better 
> then LDA[1]. Furthermore, the most recent paper by same authors shows that 
> there is a clear way to extend PLSA to LDA and beyond[2].
> (empty line)
> We should implement distributed versions of PLSA. In addition it should be 
> possible  to easily add user defined regularizers or combination of them. We 
> will implement regularizers that allows
> * extract sparse topics
> * extract human interpretable topics 
> * perform semi-supervised training 
> * sort out non-topic specific terms. 
> [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
> Proceedings of ECIR'13.
> [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
> Regularization for Stochastic Matrix Factorization. 
> http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib

2014-06-19 Thread Denis Turdakov (JIRA)
Denis Turdakov created SPARK-2199:
-

 Summary: Distributed probabilistic latent semantic analysis in 
MLlib
 Key: SPARK-2199
 URL: https://issues.apache.org/jira/browse/SPARK-2199
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Denis Turdakov


Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
topics from text corpus. PLSA was historically a predecessor of LDA. However 
recent research shows that modifications of PLSA sometimes performs better then 
LDA[1]. Furthermore, the most recent paper by same authors shows that there is 
a clear way to extend PLSA to LDA and beyond[2].
We should implement distributed versions of PLSA. In addition it should be 
possible  to easily add user defined regularizers or combination of them. We 
will implement regularizers that allows
•   extract sparse topics
•   extract human interpretable topics 
•   perform semi-supervised training 
•   sort out non-topic specific terms. 

[1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
Proceedings of ECIR'13.
[2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
Regularization for Stochastic Matrix Factorization. 
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2194) EC2 Scripts don't work in europe

2014-06-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2194.
-

Resolution: Cannot Reproduce

After waiting a few hours the error message went away.

> EC2 Scripts don't work in europe
> 
>
> Key: SPARK-2194
> URL: https://issues.apache.org/jira/browse/SPARK-2194
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>
> When i tried to create a cluster I got:
> {code}
> Setting up security groups...
> ERROR:boto:400 Bad Request
> ERROR:boto:
> InvalidParameterValueInvalid 
> value 'null' for protocol. VPC security group rules must specify protocols 
> explicitly.a9a2a9b3-bcc4-443b-889b-61b0e459f54d
> {code}
> Switching back to US-EAST fixed the issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2198) Partition the scala build file so that it is easier to maintain

2014-06-19 Thread Helena Edelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Helena Edelson updated SPARK-2198:
--

Remaining Estimate: 3h  (was: 2h)
 Original Estimate: 3h  (was: 2h)

> Partition the scala build file so that it is easier to maintain
> ---
>
> Key: SPARK-2198
> URL: https://issues.apache.org/jira/browse/SPARK-2198
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Helena Edelson
>Priority: Minor
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Partition to standard Dependencies, Version, Settings, Publish.scala. keeping 
> the SparkBuild clean to describe the modules and their deps so that changes 
> in versions, for example, need only be made in Version.scala, settings 
> changes such as in scalac in Settings.scala, etc.
> I'd be happy to do this ([~helena_e]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2198) Partition the scala build file so that it is easier to maintain

2014-06-19 Thread Helena Edelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Helena Edelson updated SPARK-2198:
--

Remaining Estimate: 2h  (was: 1m)
 Original Estimate: 2h  (was: 1m)

> Partition the scala build file so that it is easier to maintain
> ---
>
> Key: SPARK-2198
> URL: https://issues.apache.org/jira/browse/SPARK-2198
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Helena Edelson
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Partition to standard Dependencies, Version, Settings, Publish.scala. keeping 
> the SparkBuild clean to describe the modules and their deps so that changes 
> in versions, for example, need only be made in Version.scala, settings 
> changes such as in scalac in Settings.scala, etc.
> I'd be happy to do this ([~helena_e]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2198) Partition the scala build file so that it is easier to maintain

2014-06-19 Thread Helena Edelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Helena Edelson updated SPARK-2198:
--

Description: 
Partition to standard Dependencies, Version, Settings, Publish.scala. keeping 
the SparkBuild clean to describe the modules and their deps so that changes in 
versions, for example, need only be made in Version.scala, settings changes 
such as in scalac in Settings.scala, etc.

I'd be happy to do this ([~helena_e])

  was:
Partition to standard Dependencies, Version, Settings, Publish.scala. keeping 
the SparkBuild clean to describe the modules and their deps so that changes in 
versions, for example, need only be made in Version.scala, settings changes 
such as in scalac in Settings.scala, etc.

I'd be happy to do this ([~helena_e]


> Partition the scala build file so that it is easier to maintain
> ---
>
> Key: SPARK-2198
> URL: https://issues.apache.org/jira/browse/SPARK-2198
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Helena Edelson
>Priority: Minor
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Partition to standard Dependencies, Version, Settings, Publish.scala. keeping 
> the SparkBuild clean to describe the modules and their deps so that changes 
> in versions, for example, need only be made in Version.scala, settings 
> changes such as in scalac in Settings.scala, etc.
> I'd be happy to do this ([~helena_e])



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >