[jira] [Closed] (SPARK-2351) Add Artificial Neural Network (ANN) to Spark

2014-07-02 Thread Bert Greevenbosch (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bert Greevenbosch closed SPARK-2351.


Resolution: Duplicate

Duplicate with SPARK-2352.

> Add Artificial Neural Network (ANN) to Spark
> 
>
> Key: SPARK-2351
> URL: https://issues.apache.org/jira/browse/SPARK-2351
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
> Environment: MLLIB code
>Reporter: Bert Greevenbosch
>
> It would be good if the Machine Learning Library contained Artificial Neural 
> Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2352) Add Artificial Neural Network (ANN) to Spark

2014-07-02 Thread Bert Greevenbosch (JIRA)
Bert Greevenbosch created SPARK-2352:


 Summary: Add Artificial Neural Network (ANN) to Spark
 Key: SPARK-2352
 URL: https://issues.apache.org/jira/browse/SPARK-2352
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
 Environment: MLLIB code
Reporter: Bert Greevenbosch


It would be good if the Machine Learning Library contained Artificial Neural 
Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2351) Add Artificial Neural Network (ANN) to Spark

2014-07-02 Thread Bert Greevenbosch (JIRA)
Bert Greevenbosch created SPARK-2351:


 Summary: Add Artificial Neural Network (ANN) to Spark
 Key: SPARK-2351
 URL: https://issues.apache.org/jira/browse/SPARK-2351
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
 Environment: MLLIB code
Reporter: Bert Greevenbosch


It would be good if the Machine Learning Library contained Artificial Neural 
Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050982#comment-14050982
 ] 

Yijie Shen edited comment on SPARK-2342 at 7/3/14 1:52 AM:
---

[~marmbrus], I fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.


was (Author: yijieshen):
[~marmbrus] I fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.

> Evaluation helper's output type doesn't conform to input type
> -
>
> Key: SPARK-2342
> URL: https://issues.apache.org/jira/browse/SPARK-2342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Yijie Shen
>Priority: Minor
>  Labels: easyfix
>
> In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
> {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
> ((Numeric[Any], Any, Any) => Any)): Any  {code}
> is intended  to do computations for Numeric add/Minus/Multipy.
> Just as the comment suggest : {quote}Those expressions are supposed to be in 
> the same data type, and also the return type.{quote}
> But in code, function f was casted to function signature:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code}
> I thought it as a typo and the correct should be:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050982#comment-14050982
 ] 

Yijie Shen edited comment on SPARK-2342 at 7/3/14 1:51 AM:
---

[~marmbrus] Fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.


was (Author: yijieshen):
Fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.

> Evaluation helper's output type doesn't conform to input type
> -
>
> Key: SPARK-2342
> URL: https://issues.apache.org/jira/browse/SPARK-2342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Yijie Shen
>Priority: Minor
>  Labels: easyfix
>
> In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
> {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
> ((Numeric[Any], Any, Any) => Any)): Any  {code}
> is intended  to do computations for Numeric add/Minus/Multipy.
> Just as the comment suggest : {quote}Those expressions are supposed to be in 
> the same data type, and also the return type.{quote}
> But in code, function f was casted to function signature:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code}
> I thought it as a typo and the correct should be:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050982#comment-14050982
 ] 

Yijie Shen edited comment on SPARK-2342 at 7/3/14 1:52 AM:
---

[~marmbrus] I fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.


was (Author: yijieshen):
[~marmbrus] Fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.

> Evaluation helper's output type doesn't conform to input type
> -
>
> Key: SPARK-2342
> URL: https://issues.apache.org/jira/browse/SPARK-2342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Yijie Shen
>Priority: Minor
>  Labels: easyfix
>
> In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
> {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
> ((Numeric[Any], Any, Any) => Any)): Any  {code}
> is intended  to do computations for Numeric add/Minus/Multipy.
> Just as the comment suggest : {quote}Those expressions are supposed to be in 
> the same data type, and also the return type.{quote}
> But in code, function f was casted to function signature:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code}
> I thought it as a typo and the correct should be:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050982#comment-14050982
 ] 

Yijie Shen commented on SPARK-2342:
---

Fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.

> Evaluation helper's output type doesn't conform to input type
> -
>
> Key: SPARK-2342
> URL: https://issues.apache.org/jira/browse/SPARK-2342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Yijie Shen
>Priority: Minor
>  Labels: easyfix
>
> In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
> {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
> ((Numeric[Any], Any, Any) => Any)): Any  {code}
> is intended  to do computations for Numeric add/Minus/Multipy.
> Just as the comment suggest : {quote}Those expressions are supposed to be in 
> the same data type, and also the return type.{quote}
> But in code, function f was casted to function signature:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code}
> I thought it as a typo and the correct should be:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack

2014-07-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050952#comment-14050952
 ] 

Rui Li commented on SPARK-2277:
---

PR created at:
https://github.com/apache/spark/pull/1212

> Make TaskScheduler track whether there's host on a rack
> ---
>
> Key: SPARK-2277
> URL: https://issues.apache.org/jira/browse/SPARK-2277
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Rui Li
>
> When TaskSetManager adds a pending task, it checks whether the tasks's 
> preferred location is available. Regarding RACK_LOCAL task, we consider the 
> preferred rack available if such a rack is defined for the preferred host. 
> This is incorrect as there may be no alive hosts on that rack at all. 
> Therefore, TaskScheduler should track the hosts on each rack, and provides an 
> API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack

2014-07-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050951#comment-14050951
 ] 

Rui Li commented on SPARK-2277:
---

Suppose task1 prefers node1 but node1 is not available at the moment. However, 
we know node1 is on rack1, which makes task1 prefers rack1 for RACK_LOCAL 
locality. The problem is, we don't know if there's alive host on rack1, so we 
cannot check the availability of this preference.
Please let me know if I misunderstand anything :)

> Make TaskScheduler track whether there's host on a rack
> ---
>
> Key: SPARK-2277
> URL: https://issues.apache.org/jira/browse/SPARK-2277
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Rui Li
>
> When TaskSetManager adds a pending task, it checks whether the tasks's 
> preferred location is available. Regarding RACK_LOCAL task, we consider the 
> preferred rack available if such a rack is defined for the preferred host. 
> This is incorrect as there may be no alive hosts on that rack at all. 
> Therefore, TaskScheduler should track the hosts on each rack, and provides an 
> API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050894#comment-14050894
 ] 

Andrew Or commented on SPARK-2350:
--

This is the root cause of SPARK-2154

> Master throws NPE
> -
>
> Key: SPARK-2350
> URL: https://issues.apache.org/jira/browse/SPARK-2350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
> Fix For: 1.1.0
>
>
> ... if we launch a driver and there are more waiting drivers to be launched. 
> This is because we remove from a list while iterating through this.
> Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
> commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
> {code}
> for (driver <- waitingDrivers) {
>   if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
> driver.desc.cores) {
> launchDriver(worker, driver)
> waitingDrivers -= driver
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050891#comment-14050891
 ] 

Andrew Or commented on SPARK-2350:
--

In general, if Master dies because of an exception, it automatically restarts 
and the exception message is hidden in the logs. It took a while for 
[~ilikerps] and I to find the exception as we are scrolling through the logs. 

> Master throws NPE
> -
>
> Key: SPARK-2350
> URL: https://issues.apache.org/jira/browse/SPARK-2350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
> Fix For: 1.1.0
>
>
> ... if we launch a driver and there are more waiting drivers to be launched. 
> This is because we remove from a list while iterating through this.
> Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
> commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
> {code}
> for (driver <- waitingDrivers) {
>   if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
> driver.desc.cores) {
> launchDriver(worker, driver)
> waitingDrivers -= driver
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050891#comment-14050891
 ] 

Andrew Or edited comment on SPARK-2350 at 7/3/14 12:07 AM:
---

In general, if Master dies because of an exception, it automatically restarts 
and the exception message is hidden in the logs. In the mean time, the symptoms 
are not indicative of a Master having thrown an exception and restarted. It 
took a while for [~ilikerps] and I to find the exception as we were scrolling 
through the logs.


was (Author: andrewor):
In general, if Master dies because of an exception, it automatically restarts 
and the exception message is hidden in the logs. It took a while for 
[~ilikerps] and I to find the exception as we are scrolling through the logs. 

> Master throws NPE
> -
>
> Key: SPARK-2350
> URL: https://issues.apache.org/jira/browse/SPARK-2350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
> Fix For: 1.1.0
>
>
> ... if we launch a driver and there are more waiting drivers to be launched. 
> This is because we remove from a list while iterating through this.
> Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
> commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
> {code}
> for (driver <- waitingDrivers) {
>   if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
> driver.desc.cores) {
> launchDriver(worker, driver)
> waitingDrivers -= driver
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2350:
-

Description: 
... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.

Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).

{code}
for (driver <- waitingDrivers) {
  if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
  }
}
{code}

  was:
... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.

{code}
for (driver <- waitingDrivers) {
  if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
  }
}
{code}


> Master throws NPE
> -
>
> Key: SPARK-2350
> URL: https://issues.apache.org/jira/browse/SPARK-2350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
> Fix For: 1.1.0
>
>
> ... if we launch a driver and there are more waiting drivers to be launched. 
> This is because we remove from a list while iterating through this.
> Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
> commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
> {code}
> for (driver <- waitingDrivers) {
>   if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
> driver.desc.cores) {
> launchDriver(worker, driver)
> waitingDrivers -= driver
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2350:
-

Description: 
... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.

{code}
  for (driver <- waitingDrivers) {
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
driver.desc.cores) {
  launchDriver(worker, driver)
  waitingDrivers -= driver
}
  }
{code}

  was:... if we launch a driver and there are more waiting drivers to be 
launched. This is because we remove from a list while iterating through this.


> Master throws NPE
> -
>
> Key: SPARK-2350
> URL: https://issues.apache.org/jira/browse/SPARK-2350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
> Fix For: 1.1.0
>
>
> ... if we launch a driver and there are more waiting drivers to be launched. 
> This is because we remove from a list while iterating through this.
> {code}
>   for (driver <- waitingDrivers) {
> if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
> driver.desc.cores) {
>   launchDriver(worker, driver)
>   waitingDrivers -= driver
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)
Andrew Or created SPARK-2350:


 Summary: Master throws NPE
 Key: SPARK-2350
 URL: https://issues.apache.org/jira/browse/SPARK-2350
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2350:
-

Description: 
... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.

{code}
for (driver <- waitingDrivers) {
  if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
  }
}
{code}

  was:
... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.

{code}
  for (driver <- waitingDrivers) {
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
driver.desc.cores) {
  launchDriver(worker, driver)
  waitingDrivers -= driver
}
  }
{code}


> Master throws NPE
> -
>
> Key: SPARK-2350
> URL: https://issues.apache.org/jira/browse/SPARK-2350
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
> Fix For: 1.1.0
>
>
> ... if we launch a driver and there are more waiting drivers to be launched. 
> This is because we remove from a list while iterating through this.
> {code}
> for (driver <- waitingDrivers) {
>   if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= 
> driver.desc.cores) {
> launchDriver(worker, driver)
> waitingDrivers -= driver
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack

2014-07-02 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050886#comment-14050886
 ] 

Mridul Muralidharan commented on SPARK-2277:


I am not sure I follow this requirement.
For preferred locations, we populate their corresponding racks (if available) 
as preferred rack.

For available executors hosts, we lookup the rack they belong to - and then see 
if that rack is preferred or not.

This, ofcourse, assumes a host is only on a single rack.


What exactly is the behavior you are expecting from scheduler ?

> Make TaskScheduler track whether there's host on a rack
> ---
>
> Key: SPARK-2277
> URL: https://issues.apache.org/jira/browse/SPARK-2277
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Rui Li
>
> When TaskSetManager adds a pending task, it checks whether the tasks's 
> preferred location is available. Regarding RACK_LOCAL task, we consider the 
> preferred rack available if such a rack is defined for the preferred host. 
> This is incorrect as there may be no alive hosts on that rack at all. 
> Therefore, TaskScheduler should track the hosts on each rack, and provides an 
> API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap

2014-07-02 Thread Andrew Or (JIRA)
Andrew Or created SPARK-2349:


 Summary: Fix NPE in ExternalAppendOnlyMap
 Key: SPARK-2349
 URL: https://issues.apache.org/jira/browse/SPARK-2349
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or


It throws an NPE on null keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1614) Move Mesos protobufs out of TaskState

2014-07-02 Thread Martin Zapletal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050804#comment-14050804
 ] 

Martin Zapletal commented on SPARK-1614:


I am considering moving the protobufs to a new object - something like object 
org.apache.spark.MesosTaskState. Is that acceptable solution with regards to 
the requirements (to avoid the conflicts)? If not, can you please suggest which 
place would be the best for it?

> Move Mesos protobufs out of TaskState
> -
>
> Key: SPARK-1614
> URL: https://issues.apache.org/jira/browse/SPARK-1614
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 0.9.1
>Reporter: Shivaram Venkataraman
>Priority: Minor
>  Labels: Starter
>
> To isolate usage of Mesos protobufs it would be good to move them out of 
> TaskState into either a new class (MesosUtils ?) or 
> CoarseGrainedMesos{Executor, Backend}.
> This would allow applications to build Spark to run without including 
> protobuf from Mesos in their shaded jars.  This is one way to avoid protobuf 
> conflicts between Mesos and Hadoop 
> (https://issues.apache.org/jira/browse/MESOS-1203)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2346) Error parsing table names that starts with numbers

2014-07-02 Thread Alexander Albul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Albul updated SPARK-2346:
---

Summary: Error parsing table names that starts with numbers  (was: Error 
parsing table names that starts from numbers)

> Error parsing table names that starts with numbers
> --
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>  Labels: Parser, SQL
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> {quote}
> When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2014-07-02 Thread Chirag Todarka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050757#comment-14050757
 ] 

Chirag Todarka commented on SPARK-2348:
---

[~pwendell]
[~cheffpj]

Hi Patrick/Pat,

I am new to the project and want to contribute in this. 
I hope this will be a great starting point for me so please if possible assign 
it to me.

Regards,
Chirag Todarka

> In Windows having a enviorinment variable named 'classpath' gives error
> ---
>
> Key: SPARK-2348
> URL: https://issues.apache.org/jira/browse/SPARK-2348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Windows 7 Enterprise
>Reporter: Chirag Todarka
>
> Operating System:: Windows 7 Enterprise
> If having enviorinment variable named 'classpath' gives then starting 
> 'spark-shell' gives below error::
> \spark\bin>spark-shell
> Failed to initialize compiler: object scala.runtime in compiler mirror not 
> found
> .
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programatically, settings.usejavacp.value = true.
> 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler 
> acces
> sed before init set up.  Assuming no postInit code.
> Failed to initialize compiler: object scala.runtime in compiler mirror not 
> found
> .
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.AssertionError: assertion failed: null
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
> la:202)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
> kILoop.scala:929)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
> scala:884)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
> scala:884)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
> Loader.scala:135)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-1305) Support persisting RDD's directly to Tachyon

2014-07-02 Thread Henry Saputra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henry Saputra updated SPARK-1305:
-

Comment: was deleted

(was: Never mind, Found it, it was when Spark in incubtor)

> Support persisting RDD's directly to Tachyon
> 
>
> Key: SPARK-1305
> URL: https://issues.apache.org/jira/browse/SPARK-1305
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager
>Reporter: Patrick Wendell
>Assignee: Haoyuan Li
>Priority: Blocker
> Fix For: 1.0.0
>
>
> This is already an ongoing pull request - in a nutshell we want to support 
> Tachyon as a storage level in Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-1305) Support persisting RDD's directly to Tachyon

2014-07-02 Thread Henry Saputra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henry Saputra updated SPARK-1305:
-

Comment: was deleted

(was: Sorry to comment on old JIRA but does anyone have PR for this ticket?)

> Support persisting RDD's directly to Tachyon
> 
>
> Key: SPARK-1305
> URL: https://issues.apache.org/jira/browse/SPARK-1305
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager
>Reporter: Patrick Wendell
>Assignee: Haoyuan Li
>Priority: Blocker
> Fix For: 1.0.0
>
>
> This is already an ongoing pull request - in a nutshell we want to support 
> Tachyon as a storage level in Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2014-07-02 Thread Chirag Todarka (JIRA)
Chirag Todarka created SPARK-2348:
-

 Summary: In Windows having a enviorinment variable named 
'classpath' gives error
 Key: SPARK-2348
 URL: https://issues.apache.org/jira/browse/SPARK-2348
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Windows 7 Enterprise
Reporter: Chirag Todarka


Operating System:: Windows 7 Enterprise
If having enviorinment variable named 'classpath' gives then starting 
'spark-shell' gives below error::

\spark\bin>spark-shell

Failed to initialize compiler: object scala.runtime in compiler mirror not found
.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programatically, settings.usejavacp.value = true.
14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler acces
sed before init set up.  Assuming no postInit code.

Failed to initialize compiler: object scala.runtime in compiler mirror not found
.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programatically, settings.usejavacp.value = true.
Exception in thread "main" java.lang.AssertionError: assertion failed: null
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
la:202)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
kILoop.scala:929)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
scala:884)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
scala:884)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
Loader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-02 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050721#comment-14050721
 ] 

Yin Huai commented on SPARK-2339:
-

Also, names of those registered tables are case sensitive. But, names of Hive 
tables are case insensitive. It will cause confusion when a user using 
HiveContext. I think it may be good to treat all identifiers case insensitive 
when a user is using HiveContext and make HiveContext.sql as a alias of 
HiveContext.hql (basically do not expose catalyst's SQLParser in HiveContext).

> SQL parser in sql-core is case sensitive, but a table alias is converted to 
> lower case when we create Subquery
> --
>
> Key: SPARK-2339
> URL: https://issues.apache.org/jira/browse/SPARK-2339
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Yin Huai
> Fix For: 1.1.0
>
>
> Reported by 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
> After we get the table from the catalog, because the table has an alias, we 
> will temporarily insert a Subquery. Then, we convert the table alias to lower 
> case no matter if the parser is case sensitive or not.
> To see the issue ...
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Person(name: String, age: Int)
> val people = 
> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p 
> => Person(p(0), p(1).trim.toInt))
> people.registerAsTable("people")
> sqlContext.sql("select PEOPLE.name from people PEOPLE")
> {code}
> The plan is ...
> {code}
> == Query Plan ==
> Project ['PEOPLE.name]
>  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
> basicOperators.scala:176
> {code}
> You can find that "PEOPLE.name" is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2347) Graph object can not be set to StorageLevel.MEMORY_ONLY_SER

2014-07-02 Thread Baoxu Shi (JIRA)
Baoxu Shi created SPARK-2347:


 Summary: Graph object can not be set to 
StorageLevel.MEMORY_ONLY_SER
 Key: SPARK-2347
 URL: https://issues.apache.org/jira/browse/SPARK-2347
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0
 Environment: Spark standalone with 5 workers and 1 driver
Reporter: Baoxu Shi


I'm creating Graph object by using 

Graph(vertices, edges, null, StorageLevel.MEMORY_ONLY, StorageLevel.MEMORY_ONLY)

But that will throw out not serializable exception on both workers and driver. 

14/07/02 16:30:26 ERROR BlockManagerWorker: Exception handling buffer message
java.io.NotSerializableException: org.apache.spark.graphx.impl.VertexPartition
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at 
org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:106)
at 
org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:30)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:988)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:997)
at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:392)
at 
org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:358)
at 
org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
at 
org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
at 
org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:662)
at 
org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:504)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Even if the driver sometime does not throw this exception, it will throw 

java.io.FileNotFoundException: 
/tmp/spark-local-20140702151845-9620/2a/shuffle_2_25_3 (No such file or 
directory)

I know that VertexPartition not supposed to be serializable, so is there any 
workaround on this?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2346) Error parsing table names that starts from numbers

2014-07-02 Thread Alexander Albul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Albul updated SPARK-2346:
---

Description: 
Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
when they start from numbers.

Steps to reproduce:

{code:title=Test.scala|borderStyle=solid}
case class Data(value: String)

object Test {
  def main(args: Array[String]) {
val sc = new SparkContext("local", "sql")
val sqlSc = new SQLContext(sc)
import sqlSc._

sc.parallelize(List(Data("one"), Data("two"))).registerAsTable("123_table")
sql("SELECT * FROM '123_table'").collect().foreach(println)
  }
}
{code}

And here is an exception:

{quote}
Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
expected but "123_table" found

SELECT * FROM '123_table'
  ^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
at io.ubix.spark.Test$.main(Test.scala:24)
at io.ubix.spark.Test.main(Test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{quote}

When i am changing from 123_table to table_123 problem disappears.

  was:
Looks like org.apache.spark.sql.catalyst.SqlParser cannot parse table names 
when they start from numbers.

Steps to reproduce:

{code:title=Test.scala|borderStyle=solid}
case class Data(value: String)

object Test {
  def main(args: Array[String]) {
val sc = new SparkContext("local", "sql")
val sqlSc = new SQLContext(sc)
import sqlSc._

sc.parallelize(List(Data("one"), Data("two"))).registerAsTable("123_table")
sql("SELECT * FROM '123_table'").collect().foreach(println)
  }
}
{code}

And here is an exception:

{quote}
Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
expected but "123_table" found

SELECT * FROM '123_table'
  ^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
at io.ubix.spark.Test$.main(Test.scala:24)
at io.ubix.spark.Test.main(Test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{quote}

When i am changing from 123_table to table_123 problem disappears.


> Error parsing table names that starts from numbers
> --
>
> Key: SPARK-2346
> URL: https://issues.apache.org/jira/browse/SPARK-2346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Alexander Albul
>  Labels: Parser, SQL
>
> Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
> when they start from numbers.
> Steps to reproduce:
> {code:title=Test.scala|borderStyle=solid}
> case class Data(value: String)
> object Test {
>   def main(args: Array[String]) {
> val sc = new SparkContext("local", "sql")
> val sqlSc = new SQLContext(sc)
> import sqlSc._
> sc.parallelize(List(Data("one"), 
> Data("two"))).registerAsTable("123_table")
> sql("SELECT * FROM '123_table'").collect().foreach(println)
>   }
> }
> {code}
> And here is an exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
> expected but "123_table" found
> SELECT * FROM '123_table'
>   ^
>   at scala.sys.package$.error(package.scala:27)
>   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
>   at io.ubix.spark.Test$.main(Test.scala:24)
>   at io.ubix.spark.Test.main(Test.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   a

[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark

2014-07-02 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050670#comment-14050670
 ] 

Hari Shreedharan commented on SPARK-2345:
-

Looks like we'd have to do this in a new DStream, since the ForEachDStream 
takes a (RDD[T], Time)=> Unit, but to call runJob we'd have to pass in 
(Iterator[T], Time)=>Unit. I am not sure how much value this adds, but it does 
seem like if we are not using one of the built-in save/collect methods, you'd 
have to specifically run this function in context.runJob(...)

Do you think this makes sense, [~tdas], [~pwendell]?

> ForEachDStream should have an option of running the foreachfunc on Spark
> 
>
> Key: SPARK-2345
> URL: https://issues.apache.org/jira/browse/SPARK-2345
> Project: Spark
>  Issue Type: Bug
>Reporter: Hari Shreedharan
>
> Today the Job generated simply calls the foreachfunc, but does not run it on 
> spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2346) Error parsing table names that starts from numbers

2014-07-02 Thread Alexander Albul (JIRA)
Alexander Albul created SPARK-2346:
--

 Summary: Error parsing table names that starts from numbers
 Key: SPARK-2346
 URL: https://issues.apache.org/jira/browse/SPARK-2346
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Alexander Albul


Looks like org.apache.spark.sql.catalyst.SqlParser cannot parse table names 
when they start from numbers.

Steps to reproduce:

{code:title=Test.scala|borderStyle=solid}
case class Data(value: String)

object Test {
  def main(args: Array[String]) {
val sc = new SparkContext("local", "sql")
val sqlSc = new SQLContext(sc)
import sqlSc._

sc.parallelize(List(Data("one"), Data("two"))).registerAsTable("123_table")
sql("SELECT * FROM '123_table'").collect().foreach(println)
  }
}
{code}

And here is an exception:

{quote}
Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' 
expected but "123_table" found

SELECT * FROM '123_table'
  ^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
at io.ubix.spark.Test$.main(Test.scala:24)
at io.ubix.spark.Test.main(Test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{quote}

When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark

2014-07-02 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050659#comment-14050659
 ] 

Hari Shreedharan commented on SPARK-2345:
-

Currently, the job (like saveAsTextFile or saveAsHadoopFile) on the DStream 
will cause the rdd.save calls to be executed on sparkContext.runJob, which in 
turn will call the foreachfunc which is passed to the ForEachDStream. So a case 
where this DStream is saved off works fine. 

But if you simply do a register and have the foreachfunc do some processing and 
custom writes may cause the application to be run locally.

> ForEachDStream should have an option of running the foreachfunc on Spark
> 
>
> Key: SPARK-2345
> URL: https://issues.apache.org/jira/browse/SPARK-2345
> Project: Spark
>  Issue Type: Bug
>Reporter: Hari Shreedharan
>
> Today the Job generated simply calls the foreachfunc, but does not run it on 
> spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark

2014-07-02 Thread Hari Shreedharan (JIRA)
Hari Shreedharan created SPARK-2345:
---

 Summary: ForEachDStream should have an option of running the 
foreachfunc on Spark
 Key: SPARK-2345
 URL: https://issues.apache.org/jira/browse/SPARK-2345
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan


Today the Job generated simply calls the foreachfunc, but does not run it on 
spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2014-07-02 Thread Alex (JIRA)
Alex created SPARK-2344:
---

 Summary: Add Fuzzy C-Means algorithm to MLlib
 Key: SPARK-2344
 URL: https://issues.apache.org/jira/browse/SPARK-2344
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Alex


I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.

FCM is very similar to K - Means which is already implemented, and they differ 
only in the degree of relationship each point has with each cluster:
(in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.

As part of the implementation I would like:
- create a base class for K- Means and FCM
- implement the relationship for each algorithm differently (in its class)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1054) Get Cassandra support in Spark Core/Spark Cassandra Module

2014-07-02 Thread Rohit Rai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050544#comment-14050544
 ] 

Rohit Rai commented on SPARK-1054:
--

With the https://github.com/datastax/cassandra-driver-spark from Datastax, we 
should work on getting a united standard API in Spark, getting good things from 
both worlds.

> Get Cassandra support in Spark Core/Spark Cassandra Module
> --
>
> Key: SPARK-1054
> URL: https://issues.apache.org/jira/browse/SPARK-1054
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Rohit Rai
>  Labels: calliope, cassandra
>
> Calliope is a library providing an interface to consume data from Cassandra 
> to spark and store RDDs from Spark to Cassandra. 
> Building as wrapper over Cassandra's Hadoop I/O it provides a simplified and 
> very generic API to consume and produces data from and to Cassandra. It 
> allows you to consume data from Legacy as well as CQL3 Cassandra Storage.  It 
> can also harness C* to speed up your process by fetching only the relevant 
> data from C* harnessing CQL3 and C*'s secondary indexes. Though it currently 
> uses only the Hadoop I/O formats for Cassandra in near future we see the same 
> API harnessing other means of consuming Cassandra data like using the 
> StorageProxy or even reading from SSTables directly.
> Over the basic data fetch functionality, the Calliope API harnesses Scala and 
> it's implicit parameters and conversions for you to work on a higher 
> abstraction dealing with tuples/objects instead of Cassandra's Row/Columns in 
> your MapRed jobs.
> Over past few months we have seen the combination of Spark+Cassandra gaining 
> a lot of traction. And we feel Calliope provides the path of least friction 
> for developers to start working with this combination.
> We have been using this ins production for over a year now and the Calliope 
> early access repository has 30+ users.  I am putting this issue to start a 
> discussion around whether we would want Calliope to be a part of Spark and if 
> yes, what will be involved in doing so.
> You can read more about Calliope here -
> http://tuplejump.github.io/calliope



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1054) Get Cassandra support in Spark Core/Spark Cassandra Module

2014-07-02 Thread Rohit Rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Rai updated SPARK-1054:
-

Summary: Get Cassandra support in Spark Core/Spark Cassandra Module  (was: 
Contribute Calliope Core to Spark as spark-cassandra)

> Get Cassandra support in Spark Core/Spark Cassandra Module
> --
>
> Key: SPARK-1054
> URL: https://issues.apache.org/jira/browse/SPARK-1054
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Rohit Rai
>  Labels: calliope, cassandra
>
> Calliope is a library providing an interface to consume data from Cassandra 
> to spark and store RDDs from Spark to Cassandra. 
> Building as wrapper over Cassandra's Hadoop I/O it provides a simplified and 
> very generic API to consume and produces data from and to Cassandra. It 
> allows you to consume data from Legacy as well as CQL3 Cassandra Storage.  It 
> can also harness C* to speed up your process by fetching only the relevant 
> data from C* harnessing CQL3 and C*'s secondary indexes. Though it currently 
> uses only the Hadoop I/O formats for Cassandra in near future we see the same 
> API harnessing other means of consuming Cassandra data like using the 
> StorageProxy or even reading from SSTables directly.
> Over the basic data fetch functionality, the Calliope API harnesses Scala and 
> it's implicit parameters and conversions for you to work on a higher 
> abstraction dealing with tuples/objects instead of Cassandra's Row/Columns in 
> your MapRed jobs.
> Over past few months we have seen the combination of Spark+Cassandra gaining 
> a lot of traction. And we feel Calliope provides the path of least friction 
> for developers to start working with this combination.
> We have been using this ins production for over a year now and the Calliope 
> early access repository has 30+ users.  I am putting this issue to start a 
> discussion around whether we would want Calliope to be a part of Spark and if 
> yes, what will be involved in doing so.
> You can read more about Calliope here -
> http://tuplejump.github.io/calliope



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack

2014-07-02 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050381#comment-14050381
 ] 

Chen He commented on SPARK-2277:


This is interesting. I will take a look.

> Make TaskScheduler track whether there's host on a rack
> ---
>
> Key: SPARK-2277
> URL: https://issues.apache.org/jira/browse/SPARK-2277
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Rui Li
>
> When TaskSetManager adds a pending task, it checks whether the tasks's 
> preferred location is available. Regarding RACK_LOCAL task, we consider the 
> preferred rack available if such a rack is defined for the preferred host. 
> This is incorrect as there may be no alive hosts on that rack at all. 
> Therefore, TaskScheduler should track the hosts on each rack, and provides an 
> API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050347#comment-14050347
 ] 

Michael Armbrust commented on SPARK-2342:
-

This does look like a typo (though maybe one that doesn't matter due to 
erasure?).  That said, if you make a PR I'll certainly merge it.  Thanks!

> Evaluation helper's output type doesn't conform to input type
> -
>
> Key: SPARK-2342
> URL: https://issues.apache.org/jira/browse/SPARK-2342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Yijie Shen
>Priority: Minor
>  Labels: easyfix
>
> In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
> {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
> ((Numeric[Any], Any, Any) => Any)): Any  {code}
> is intended  to do computations for Numeric add/Minus/Multipy.
> Just as the comment suggest : {quote}Those expressions are supposed to be in 
> the same data type, and also the return type.{quote}
> But in code, function f was casted to function signature:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code}
> I thought it as a typo and the correct should be:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2287) Make ScalaReflection be able to handle Generic case classes.

2014-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2287.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1
 Assignee: Takuya Ueshin

> Make ScalaReflection be able to handle Generic case classes.
> 
>
> Key: SPARK-2287
> URL: https://issues.apache.org/jira/browse/SPARK-2287
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.0.1, 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2328) Add execution of `SHOW TABLES` before `TestHive.reset()`.

2014-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2328.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1
 Assignee: Takuya Ueshin

> Add execution of `SHOW TABLES` before `TestHive.reset()`.
> -
>
> Key: SPARK-2328
> URL: https://issues.apache.org/jira/browse/SPARK-2328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.0.1, 1.1.0
>
>
> {{PruningSuite}} is executed first of Hive tests unfortunately, 
> {{TestHive.reset()}} breaks the test environment.
> To prevent this, we must run a query before calling reset the first time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2186) Spark SQL DSL support for simple aggregations such as SUM and AVG

2014-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2186.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

> Spark SQL DSL support for simple aggregations such as SUM and AVG
> -
>
> Key: SPARK-2186
> URL: https://issues.apache.org/jira/browse/SPARK-2186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Zongheng Yang
>Priority: Minor
> Fix For: 1.0.1, 1.1.0
>
>
> Inspired by this thread 
> (http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-td7874.html):
>  Spark SQL doesn't seem to have DSL support for simple aggregations such as 
> AVG and SUM. It'd be nice if the user could avoid writing a SQL query and 
> instead write something like:
> {code}
> data.select('country, 'age.avg, 'hits.sum).groupBy('country).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1850) Bad exception if multiple jars exist when running PySpark

2014-07-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-1850.


Resolution: Fixed

> Bad exception if multiple jars exist when running PySpark
> -
>
> Key: SPARK-1850
> URL: https://issues.apache.org/jira/browse/SPARK-1850
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Andrew Or
> Fix For: 1.0.1
>
>
> {code}
> Found multiple Spark assembly jars in 
> /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
> Traceback (most recent call last):
>   File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py", 
> line 43, in 
> sc = SparkContext(os.environ.get("MASTER", "local[*]"), "PySparkShell", 
> pyFiles=add_files)
>   File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", 
> line 94, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", 
> line 180, in _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File 
> "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py", 
> line 49, in launch_gateway
> gateway_port = int(proc.stdout.readline())
> ValueError: invalid literal for int() with base 10: 
> 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
> {code}
> It's trying to read the Java gateway port as an int from the sub-process' 
> STDOUT. However, what it read was an error message, which is clearly not an 
> int. We should differentiate between these cases and just propagate the 
> original message if it's not an int. Right now, this exception is not very 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1850) Bad exception if multiple jars exist when running PySpark

2014-07-02 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050318#comment-14050318
 ] 

Andrew Or commented on SPARK-1850:
--

Ye, I will change it.

> Bad exception if multiple jars exist when running PySpark
> -
>
> Key: SPARK-1850
> URL: https://issues.apache.org/jira/browse/SPARK-1850
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Andrew Or
> Fix For: 1.0.1
>
>
> {code}
> Found multiple Spark assembly jars in 
> /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
> Traceback (most recent call last):
>   File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py", 
> line 43, in 
> sc = SparkContext(os.environ.get("MASTER", "local[*]"), "PySparkShell", 
> pyFiles=add_files)
>   File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", 
> line 94, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", 
> line 180, in _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File 
> "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py", 
> line 49, in launch_gateway
> gateway_port = int(proc.stdout.readline())
> ValueError: invalid literal for int() with base 10: 
> 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
> {code}
> It's trying to read the Java gateway port as an int from the sub-process' 
> STDOUT. However, what it read was an error message, which is clearly not an 
> int. We should differentiate between these cases and just propagate the 
> original message if it's not an int. Right now, this exception is not very 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2343) QueueInputDStream with oneAtATime=false does not dequeue items

2014-07-02 Thread Manuel Laflamme (JIRA)
Manuel Laflamme created SPARK-2343:
--

 Summary: QueueInputDStream with oneAtATime=false does not dequeue 
items
 Key: SPARK-2343
 URL: https://issues.apache.org/jira/browse/SPARK-2343
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 0.9.1, 0.9.0
Reporter: Manuel Laflamme
Priority: Minor


QueueInputDStream does not dequeue items when used with the oneAtATime flag 
disabled. The same items are reprocessed for every batch. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1989) Exit executors faster if they get into a cycle of heavy GC

2014-07-02 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050005#comment-14050005
 ] 

Guoqiang Li commented on SPARK-1989:


In this case should also triggers the driver garbage collection.
The related work: 
https://github.com/witgo/spark/compare/taskEvent

> Exit executors faster if they get into a cycle of heavy GC
> --
>
> Key: SPARK-1989
> URL: https://issues.apache.org/jira/browse/SPARK-1989
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
> Fix For: 1.1.0
>
>
> I've seen situations where an application is allocating too much memory 
> across its tasks + cache to proceed, but Java gets into a cycle where it 
> repeatedly runs full GCs, frees up a bit of the heap, and continues instead 
> of giving up. This then leads to timeouts and confusing error messages. It 
> would be better to crash with OOM sooner. The JVM has options to support 
> this: http://java.dzone.com/articles/tracking-excessive-garbage.
> The right solution would probably be:
> - Add some config options used by spark-submit to set XX:GCTimeLimit and 
> XX:GCHeapFreeLimit, with more conservative values than the defaults (e.g. 90% 
> time limit, 5% free limit)
> - Make sure we pass these into the Java options for executors in each 
> deployment mode



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049942#comment-14049942
 ] 

Sean Owen commented on SPARK-2341:
--

I've been a bit uncomfortable with how the MLlib API conflates categorical 
values and numbers, since they aren't numbers in general. Treating them as 
numbers is a convenience in some cases, and common in papers, but feels like 
suboptimal software design -- should a user have to convert categoricals to 
some numeric representation? To me it invites confusion, and this is one 
symptom. So I am not sure "multiclass" should mean "parse target as double" to 
begin with?

OK, it's not the issue here. But we're on the subject of an experimental API 
subject to change with an example of something related that could be improved 
along the way, and it's my #1 wish for MLlib at the moment. I'd really like to 
work on a change to try to accommodate classes as, say, strings at least, and 
not presume doubles. But I am trying to figure out if anyone agrees with that. 

> loadLibSVMFile doesn't handle regression datasets
> -
>
> Key: SPARK-2341
> URL: https://issues.apache.org/jira/browse/SPARK-2341
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Eustache
>Priority: Minor
>  Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-07-02 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049939#comment-14049939
 ] 

Alexander Ulanov commented on SPARK-1473:
-

Does anybody work on this issue?

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Priority: Minor
>  Labels: features
> Fix For: 1.1.0
>
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor

2014-07-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049937#comment-14049937
 ] 

Matthew Farrellee commented on SPARK-1284:
--

[~jblomo] -

will you add a reproducer script to this issue?

i did a simple test based on what you suggested w/ the tip of master and could 
not reproduce -

{code}
$ ./dist/bin/pyspark
Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
...
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
  /_/

Using Python version 2.7.5 (default, Feb 19 2014 13:47:28)
SparkContext available as sc.
>>> data = sc.textFile('/etc/passwd')
14/07/02 07:03:59 INFO MemoryStore: ensureFreeSpace(32816) called with 
curMem=0, maxMem=308910489
14/07/02 07:03:59 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 32.0 KB, free 294.6 MB)
>>> data.cache()
/etc/passwd MappedRDD[1] at textFile at NativeMethodAccessorImpl.java:-2
>>> data.take(10)
...[expected output]...
>>> data.flatMap(lambda line: line.split(':')).map(lambda word: (word, 
>>> 1)).reduceByKey(lambda x, y: x + y).collect()
...[expected output, no hang]...
{code}

> pyspark hangs after IOError on Executor
> ---
>
> Key: SPARK-1284
> URL: https://issues.apache.org/jira/browse/SPARK-1284
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Jim Blomo
>
> When running a reduceByKey over a cached RDD, Python fails with an exception, 
> but the failure is not detected by the task runner.  Spark and the pyspark 
> shell hang waiting for the task to finish.
> The error is:
> {code}
> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "/home/hadoop/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/home/hadoop/spark/python/pyspark/serializers.py", line 182, in 
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/home/hadoop/spark/python/pyspark/serializers.py", line 118, in 
> dump_stream
> self._write_with_length(obj, stream)
>   File "/home/hadoop/spark/python/pyspark/serializers.py", line 130, in 
> _write_with_length
> stream.write(serialized)
> IOError: [Errno 104] Connection reset by peer
> 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 
> 4257 bytes in 47 ms
> Traceback (most recent call last):
>   File "/home/hadoop/spark/python/pyspark/daemon.py", line 117, in 
> launch_worker
> worker(listen_sock)
>   File "/home/hadoop/spark/python/pyspark/daemon.py", line 107, in worker
> outfile.flush()
> IOError: [Errno 32] Broken pipe
> {code}
> I can reproduce the error by running take(10) on the cached RDD before 
> running reduceByKey (which looks at the whole input file).
> Affects Version 1.0.0-SNAPSHOT (4d88030486)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1030) unneeded file required when running pyspark program using yarn-client

2014-07-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049929#comment-14049929
 ] 

Matthew Farrellee commented on SPARK-1030:
--

using pyspark to submit is deprecated in spark 1.0 in favor of spark-submit. i 
think this should be closed as resolved/workfix. /cc: [~pwendell] [~joshrosen]

> unneeded file required when running pyspark program using yarn-client
> -
>
> Key: SPARK-1030
> URL: https://issues.apache.org/jira/browse/SPARK-1030
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, PySpark, YARN
>Affects Versions: 0.8.1
>Reporter: Diana Carroll
>Assignee: Josh Rosen
>
> I can successfully run a pyspark program using the yarn-client master using 
> the following command:
> {code}
> SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar
>  \
> SPARK_YARN_APP_JAR=~/testdata.txt pyspark \
> test1.py
> {code}
> However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python 
> program, and therefore there's no JAR.  If I don't set the value, or if I set 
> the value to a non-existent files, Spark gives me an error message.  
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46)
> {code}
> or
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.io.FileNotFoundException: File file:dummy.txt does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
> {code}
> My program is very simple:
> {code}
> from pyspark import SparkContext
> def main():
> sc = SparkContext("yarn-client", "Simple App")
> logData = 
> sc.textFile("hdfs://localhost/user/training/weblogs/2013-09-15.log")
> numjpgs = logData.filter(lambda s: '.jpg' in s).count()
> print "Number of JPG requests: " + str(numjpgs)
> {code}
> Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at 
> all; I can point it at anything, as long as it's a valid, accessible file, 
> and it works the same.
> Although there's an obvious workaround for this bug, it's high priority from 
> my perspective because I'm working on a course to teach people how to do 
> this, and it's really hard to explain why this variable is needed!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1257) Endless running task when using pyspark with input file containing a long line

2014-07-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049933#comment-14049933
 ] 

Matthew Farrellee commented on SPARK-1257:
--

recommend close as resolved w/ option for filer to reopen if the issue 
reproduces in 1.0 /cc: [~pwendell] [~joshrosen]

> Endless running task when using pyspark with input file containing a long line
> --
>
> Key: SPARK-1257
> URL: https://issues.apache.org/jira/browse/SPARK-1257
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0
>Reporter: Hanchen Su
>
> When launching any pyspark applications with an input file containing a very 
> long line(about 7 characters), the job will be hanging and never stops. 
> The application UI shows that there is a task running endlessly.
> There will be no problem using the scala version with the same input.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1550) Successive creation of spark context fails in pyspark, if the previous initialization of spark context had failed.

2014-07-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049918#comment-14049918
 ] 

Matthew Farrellee commented on SPARK-1550:
--

this issue as reported is no longer present in spark 1.0, where defaults are 
provided for app name and master.

{code}
$ SPARK_HOME=dist 
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.1-src.zip python
Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyspark import SparkContext
>>> sc=SparkContext('local')
[successful creation of context]
{code}

i believe this should be closed as resolved. /cc: [~pwendell]

> Successive creation of spark context fails in pyspark, if the previous 
> initialization of spark context had failed.
> --
>
> Key: SPARK-1550
> URL: https://issues.apache.org/jira/browse/SPARK-1550
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Prabin Banka
>  Labels: pyspark, sparkcontext
>
> For example;-
> In PySpark, if we try to initialize spark context with insufficient 
> arguments, >>>sc=SparkContext('local')
> it fails with an exception 
> Exception: An application name must be set in your configuration
> This is all fine. 
> However, any successive creation of spark context with correct arguments, 
> also fails,
> >>>s1=SparkContext('local', 'test1')
> AttributeError: 'SparkContext' object has no attribute 'master'



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1850) Bad exception if multiple jars exist when running PySpark

2014-07-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049895#comment-14049895
 ] 

Matthew Farrellee commented on SPARK-1850:
--

[~andrewor14] -

i think this should be closed as resolved in SPARK-2242

the current output for the error is,

{noformat}
$ ./dist/bin/pyspark
Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/shell.py", 
line 43, in 
sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
  File 
"/home/matt/Documents/Repositories/spark/dist/python/pyspark/context.py", line 
95, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File 
"/home/matt/Documents/Repositories/spark/dist/python/pyspark/context.py", line 
191, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File 
"/home/matt/Documents/Repositories/spark/dist/python/pyspark/java_gateway.py", 
line 66, in launch_gateway
raise Exception(error_msg)
Exception: Launching GatewayServer failed with exit code 1!(Warning: unexpected 
output detected.)

Found multiple Spark assembly jars in 
/home/matt/Documents/Repositories/spark/dist/lib:
spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4-.jar
spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
Please remove all but one jar.
{noformat}

> Bad exception if multiple jars exist when running PySpark
> -
>
> Key: SPARK-1850
> URL: https://issues.apache.org/jira/browse/SPARK-1850
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Andrew Or
> Fix For: 1.0.1
>
>
> {code}
> Found multiple Spark assembly jars in 
> /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
> Traceback (most recent call last):
>   File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py", 
> line 43, in 
> sc = SparkContext(os.environ.get("MASTER", "local[*]"), "PySparkShell", 
> pyFiles=add_files)
>   File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", 
> line 94, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", 
> line 180, in _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File 
> "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py", 
> line 49, in launch_gateway
> gateway_port = int(proc.stdout.readline())
> ValueError: invalid literal for int() with base 10: 
> 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
> {code}
> It's trying to read the Java gateway port as an int from the sub-process' 
> STDOUT. However, what it read was an error message, which is clearly not an 
> int. We should differentiate between these cases and just propagate the 
> original message if it's not an int. Right now, this exception is not very 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1884) Shark failed to start

2014-07-02 Thread Pete MacKinnon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049877#comment-14049877
 ] 

Pete MacKinnon commented on SPARK-1884:
---

This is due to the version of protobuf-java provided by Shark being older 
(2.4.1) than what's needed by Hadoop 2.4 (2.5.0). See SPARK-2338.

> Shark failed to start
> -
>
> Key: SPARK-1884
> URL: https://issues.apache.org/jira/browse/SPARK-1884
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 0.9.1
> Environment: ubuntu 14.04, spark 0.9.1, hive 0.13.0, hadoop 2.4.0 
> (stand alone), scala 2.11.0
>Reporter: Wei Cui
>Priority: Blocker
>
> the hadoop, hive, spark works fine.
> when start the shark, it failed with the following messages:
> Starting the Shark Command Line Client
> 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.input.dir.recursive 
> is deprecated. Instead, use 
> mapreduce.input.fileinputformat.input.dir.recursive
> 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.max.split.size is 
> deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
> 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size is 
> deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
> 14/05/19 16:47:21 INFO Configuration.deprecation: 
> mapred.min.split.size.per.rack is deprecated. Instead, use 
> mapreduce.input.fileinputformat.split.minsize.per.rack
> 14/05/19 16:47:21 INFO Configuration.deprecation: 
> mapred.min.split.size.per.node is deprecated. Instead, use 
> mapreduce.input.fileinputformat.split.minsize.per.node
> 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.reduce.tasks is 
> deprecated. Instead, use mapreduce.job.reduces
> 14/05/19 16:47:21 INFO Configuration.deprecation: 
> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
> mapreduce.reduce.speculative
> 14/05/19 16:47:22 WARN conf.Configuration: 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
> override final parameter: mapreduce.job.end-notification.max.retry.interval;  
> Ignoring.
> 14/05/19 16:47:22 WARN conf.Configuration: 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
> override final parameter: mapreduce.cluster.local.dir;  Ignoring.
> 14/05/19 16:47:22 WARN conf.Configuration: 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
> override final parameter: mapreduce.job.end-notification.max.attempts;  
> Ignoring.
> 14/05/19 16:47:22 WARN conf.Configuration: 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
> override final parameter: mapreduce.cluster.temp.dir;  Ignoring.
> Logging initialized using configuration in 
> jar:file:/usr/local/shark/lib_managed/jars/edu.berkeley.cs.shark/hive-common/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.properties
> Hive history 
> file=/tmp/root/hive_job_log_root_14857@ubuntu_201405191647_897494215.txt
> 6.004: [GC 279616K->18440K(1013632K), 0.0438980 secs]
> 6.445: [Full GC 59125K->7949K(1013632K), 0.0685160 secs]
> Reloading cached RDDs from previous Shark sessions... (use -skipRddReload 
> flag to skip reloading)
> 7.535: [Full GC 104136K->13059K(1013632K), 0.0885820 secs]
> 8.459: [Full GC 61237K->18031K(1013632K), 0.0820400 secs]
> 8.662: [Full GC 29832K->8958K(1013632K), 0.0869700 secs]
> 8.751: [Full GC 13433K->8998K(1013632K), 0.0856520 secs]
> 10.435: [Full GC 72246K->14140K(1013632K), 0.1797530 secs]
> Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072)
>   at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49)
>   at shark.SharkCliDriver.(SharkCliDriver.scala:283)
>   at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)
>   at shark.SharkCliDriver.main(SharkCliDriver.scala)
> Caused by: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:51)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070)
>   ... 4 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorA

[jira] [Commented] (SPARK-2306) BoundedPriorityQueue is private and not registered with Kryo

2014-07-02 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049818#comment-14049818
 ] 

Daniel Darabos commented on SPARK-2306:
---

You're the best, Ankit! Thanks!

> BoundedPriorityQueue is private and not registered with Kryo
> 
>
> Key: SPARK-2306
> URL: https://issues.apache.org/jira/browse/SPARK-2306
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Daniel Darabos
>
> Because BoundedPriorityQueue is private and not registered with Kryo, RDD.top 
> cannot be used when using Kryo (the recommended configuration).
> Curiously BoundedPriorityQueue is registered by GraphKryoRegistrator. But 
> that's the wrong registrator. (Is there one for Spark Core?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1681) Handle hive support correctly in ./make-distribution.sh

2014-07-02 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1681:
---

Summary: Handle hive support correctly in ./make-distribution.sh  (was: 
Handle hive support correctly in ./make-distribution)

> Handle hive support correctly in ./make-distribution.sh
> ---
>
> Key: SPARK-1681
> URL: https://issues.apache.org/jira/browse/SPARK-1681
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When Hive support is enabled we should copy the datanucleus jars to the 
> packaged distribution. The simplest way would be to create a lib_managed 
> folder in the final distribution so that the compute-classpath script 
> searches in exactly the same way whether or not it's a release.
> A slightly nicer solution is to put the jars inside of `/lib` and have some 
> fancier check for the jar location in the compute-classpath script.
> We should also document how to run Spark SQL on YARN when hive support is 
> enabled. In particular how to add the necessary jars to spark-submit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Eustache (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049778#comment-14049778
 ] 

Eustache commented on SPARK-2341:
-

Ok then would you mind that I work on a doc improvement for this ?

Perhaps a simple no-brainer like "for regression set this to true" could do
the job...

Personally I think `multiclassOrRegression` is a good option but I let it
to you to decide :)


> loadLibSVMFile doesn't handle regression datasets
> -
>
> Key: SPARK-2341
> URL: https://issues.apache.org/jira/browse/SPARK-2341
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Eustache
>Priority: Minor
>  Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049765#comment-14049765
 ] 

Xiangrui Meng edited comment on SPARK-2341 at 7/2/14 9:09 AM:
--

It is a little awkward to have both `regression` and `multiclass` as input 
arguments. I agree that a correct name should be `multiclassOrRegression` or 
`multiclassOrContinuous`. But it is certainly too long. We tried to make this 
clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false, 
any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. 
So it works for both +1/-1 and 1/0 cases. If true, the double value parsed 
directly from the label string will be used as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But 
for the API, I don't feel that it is necessary to change.



was (Author: mengxr):
It is a little awkward to have both `regression` and `multiclass` as input 
arguments. I agree that a correct name should be `multiclassOrRegression`. But 
it is certainly too long. We tried to make this clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false, 
any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. 
So it works for both +1/-1 and 1/0 cases. If true, the double value parsed 
directly from the label string will be used as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But 
for the API, I don't feel that it is necessary to change.


> loadLibSVMFile doesn't handle regression datasets
> -
>
> Key: SPARK-2341
> URL: https://issues.apache.org/jira/browse/SPARK-2341
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Eustache
>Priority: Minor
>  Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049765#comment-14049765
 ] 

Xiangrui Meng commented on SPARK-2341:
--

It is a little awkward to have both `regression` and `multiclass` as input 
arguments. I agree that a correct name should be `multiclassOrRegression`. But 
it is certainly too long. We tried to make this clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false, 
any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. 
So it works for both +1/-1 and 1/0 cases. If true, the double value parsed 
directly from the label string will be used as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But 
for the API, I don't feel that it is necessary to change.


> loadLibSVMFile doesn't handle regression datasets
> -
>
> Key: SPARK-2341
> URL: https://issues.apache.org/jira/browse/SPARK-2341
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Eustache
>Priority: Minor
>  Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Eustache (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049755#comment-14049755
 ] 

Eustache commented on SPARK-2341:
-

I see that LabelParser with multiclass=true works for the regression
setting.

What I fail to understand is how it is related to multiclass ? Is the
naming proper ?

In any case shouldn't we provide a naming that explicitly mentions
regression ?






> loadLibSVMFile doesn't handle regression datasets
> -
>
> Key: SPARK-2341
> URL: https://issues.apache.org/jira/browse/SPARK-2341
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Eustache
>Priority: Minor
>  Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049732#comment-14049732
 ] 

Xiangrui Meng commented on SPARK-2341:
--

Just set `multiclass = true` to load double values.

> loadLibSVMFile doesn't handle regression datasets
> -
>
> Key: SPARK-2341
> URL: https://issues.apache.org/jira/browse/SPARK-2341
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Eustache
>Priority: Minor
>  Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yijie Shen updated SPARK-2342:
--

Description: 
In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
{code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
((Numeric[Any], Any, Any) => Any)): Any  {code}
is intended  to do computations for Numeric add/Minus/Multipy.
Just as the comment suggest : {quote}Those expressions are supposed to be in 
the same data type, and also the return type.{quote}
But in code, function f was casted to function signature:
{code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code}
I thought it as a typo and the correct should be:
{code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code}

  was:
In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
((Numeric[Any], Any, Any) => Any)): Any  
is intended  to do computations for Numeric add/Minus/Multipy.
Just as the comment suggest : "Those expressions are supposed to be in the same 
data type, and also the return type."
But in code, function f was casted to function signature:
(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int
I thought it as a typo and the correct should be:
(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType


> Evaluation helper's output type doesn't conform to input type
> -
>
> Key: SPARK-2342
> URL: https://issues.apache.org/jira/browse/SPARK-2342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Yijie Shen
>Priority: Minor
>  Labels: easyfix
>
> In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
> {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
> ((Numeric[Any], Any, Any) => Any)): Any  {code}
> is intended  to do computations for Numeric add/Minus/Multipy.
> Just as the comment suggest : {quote}Those expressions are supposed to be in 
> the same data type, and also the return type.{quote}
> But in code, function f was casted to function signature:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code}
> I thought it as a typo and the correct should be:
> {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)
Yijie Shen created SPARK-2342:
-

 Summary: Evaluation helper's output type doesn't conform to input 
type
 Key: SPARK-2342
 URL: https://issues.apache.org/jira/browse/SPARK-2342
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yijie Shen
Priority: Minor


In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
((Numeric[Any], Any, Any) => Any)): Any  
is intended  to do computations for Numeric add/Minus/Multipy.
Just as the comment suggest : "Those expressions are supposed to be in the same 
data type, and also the return type."
But in code, function f was casted to function signature:
(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int
I thought it as a typo and the correct should be:
(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2339:


Fix Version/s: 1.1.0

> SQL parser in sql-core is case sensitive, but a table alias is converted to 
> lower case when we create Subquery
> --
>
> Key: SPARK-2339
> URL: https://issues.apache.org/jira/browse/SPARK-2339
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Yin Huai
> Fix For: 1.1.0
>
>
> Reported by 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
> After we get the table from the catalog, because the table has an alias, we 
> will temporarily insert a Subquery. Then, we convert the table alias to lower 
> case no matter if the parser is case sensitive or not.
> To see the issue ...
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Person(name: String, age: Int)
> val people = 
> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p 
> => Person(p(0), p(1).trim.toInt))
> people.registerAsTable("people")
> sqlContext.sql("select PEOPLE.name from people PEOPLE")
> {code}
> The plan is ...
> {code}
> == Query Plan ==
> Project ['PEOPLE.name]
>  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
> basicOperators.scala:176
> {code}
> You can find that "PEOPLE.name" is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Eustache (JIRA)
Eustache created SPARK-2341:
---

 Summary: loadLibSVMFile doesn't handle regression datasets
 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor


Many datasets exist in LibSVM format for regression tasks [1] but currently the 
loadLibSVMFile primitive doesn't handle regression datasets.

More precisely, the LabelParser is either a MulticlassLabelParser or a 
BinaryLabelParser. What happens then is that the file is loaded but in 
multiclass mode : each target value is interpreted as a class name !

The fix would be to write a RegressionLabelParser which converts target values 
to Double and plug it into the loadLibSVMFile routine.

[1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)