[jira] [Created] (SPARK-3228) When DStream save RDD to hdfs , don't create directory and empty file if there are no data received from source in the batch duration .

2014-08-25 Thread Leo (JIRA)
Leo created SPARK-3228:
--

 Summary: When DStream save RDD to hdfs , don't create directory 
and empty file if there are no data received from source in the batch duration .
 Key: SPARK-3228
 URL: https://issues.apache.org/jira/browse/SPARK-3228
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Leo


When I use DStream to save files to hdfs, it will create a directory and a 
empty file named "_SUCCESS" for each job which made in the batch duration.
But if there are no data from source for a long time , and the duration is very 
short(e.g. 10s), it will create so many directory and empty files in hdfs.
I don't think it is necessary. So I want to modify class DStream's method 
saveAsObjectFiles and saveAsTextFiles , it creates directory and files just 
when the RDD's partitions size > 0 .



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2886) Use more specific actor system name than "spark"

2014-08-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-2886:


Assignee: Andrew Or

> Use more specific actor system name than "spark"
> 
>
> Key: SPARK-2886
> URL: https://issues.apache.org/jira/browse/SPARK-2886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.1.0
>
>
> With a recent PR (https://github.com/apache/spark/pull/1777) we log the name 
> of the actor system when it binds to a port. We should use a more specific 
> name instead of "spark."



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2886) Use more specific actor system name than "spark"

2014-08-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-2886.
--

Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/1810

> Use more specific actor system name than "spark"
> 
>
> Key: SPARK-2886
> URL: https://issues.apache.org/jira/browse/SPARK-2886
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Andrew Or
>Priority: Minor
> Fix For: 1.1.0
>
>
> With a recent PR (https://github.com/apache/spark/pull/1777) we log the name 
> of the actor system when it binds to a port. We should use a more specific 
> name instead of "spark."



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3167) Port recent spark-submit changes to windows

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110382#comment-14110382
 ] 

Apache Spark commented on SPARK-3167:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2129

> Port recent spark-submit changes to windows
> ---
>
> Key: SPARK-3167
> URL: https://issues.apache.org/jira/browse/SPARK-3167
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3145) Hive on Spark umbrella

2014-08-25 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110362#comment-14110362
 ] 

Patrick Wendell commented on SPARK-3145:


[~bcwalrus] hey BC I made a minor change to the title since this concerns 
broader issues than dependencies. Hope that's alright!

> Hive on Spark umbrella
> --
>
> Key: SPARK-3145
> URL: https://issues.apache.org/jira/browse/SPARK-3145
> Project: Spark
>  Issue Type: Epic
>  Components: Build, Shuffle, Spark Core
>Reporter: bc Wong
>
> This is an umbrella jira to point to dependency & asks from the Hive-on-Spark 
> project (HIVE-7292).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances launched with "Launch More Like This"

2014-08-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110363#comment-14110363
 ] 

Joseph K. Bradley commented on SPARK-3213:
--

Vida, that sounds fine; I'll show you how I did it tomorrow.  (I think it was 
not a temporary thing since I have not seen the spot instances get tags like 
that before.)
Patrick, good to know!  I'll use the script from now on.

> spark_ec2.py cannot find slave instances launched with "Launch More Like This"
> --
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
> Attachments: Screen Shot 2014-08-25 at 6.45.35 PM.png
>
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3145) Hive on Spark umbrella

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3145:
---

Summary: Hive on Spark umbrella  (was: Hive on Spark dependency umbrella)

> Hive on Spark umbrella
> --
>
> Key: SPARK-3145
> URL: https://issues.apache.org/jira/browse/SPARK-3145
> Project: Spark
>  Issue Type: Epic
>  Components: Build, Shuffle, Spark Core
>Reporter: bc Wong
>
> This is an umbrella jira to point to dependency & asks from the Hive-on-Spark 
> project (HIVE-7292).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2014-08-25 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110359#comment-14110359
 ] 

Yu Ishikawa commented on SPARK-2344:


HI Alex, 

Noted with tnaks!
I am very interested in design for standarized clustering algorithm API.

I'm trying to implement an approximate hierarchical clustering algorithm now 
too.
Standarized API helps me to implement that. I look forward to seeing this 
included in MLlib.
https://issues.apache.org/jira/browse/SPARK-2966

If you have a branch for implementign FCM on github, would you please let me 
know?

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3178:
---

Labels: starter  (was: )

> setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the 
> worker memory limit to zero
> 
>
> Key: SPARK-3178
> URL: https://issues.apache.org/jira/browse/SPARK-3178
> Project: Spark
>  Issue Type: Bug
> Environment: osx
>Reporter: Jon Haddad
>  Labels: starter
>
> This should either default to m or just completely fail.  Starting a worker 
> with zero memory isn't very helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3213) spark_ec2.py cannot find slave instances launched with "Launch More Like This"

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3213:
---

Issue Type: Improvement  (was: Bug)

> spark_ec2.py cannot find slave instances launched with "Launch More Like This"
> --
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
> Attachments: Screen Shot 2014-08-25 at 6.45.35 PM.png
>
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances launched with "Launch More Like This"

2014-08-25 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110353#comment-14110353
 ] 

Patrick Wendell commented on SPARK-3213:


Hey I don't think we previously supported adding slaves like this, so I'm 
renaming this from a bug to a feature :)

> spark_ec2.py cannot find slave instances launched with "Launch More Like This"
> --
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
> Attachments: Screen Shot 2014-08-25 at 6.45.35 PM.png
>
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3223) runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3223:
---

Priority: Critical  (was: Major)

> runAsSparkUser cannot change HDFS write permission properly in mesos cluster 
> mode
> -
>
> Key: SPARK-3223
> URL: https://issues.apache.org/jira/browse/SPARK-3223
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Mesos
>Affects Versions: 1.0.2
>Reporter: Jongyoul Lee
>Priority: Critical
> Fix For: 1.0.3
>
>
> While running mesos with --no-switch_user option, HDFS account name is 
> different from driver and executor. It makes a permission error at last 
> stage. Executor's id is mesos' user id and driver's id is who runs 
> spark-submit. So, moving output from _temporary/path/to/output/part- to 
> /output/path/part- fails because of permission error. The solution for 
> this is only setting SPARK_USER to HADOOP_USER_NAME when MesosExecutorBackend 
> calls runAsSparkUser. HADOOP_USER_NAME is used when FileSystem get user.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3224) FetchFailed stages could show up multiple times in failed stages in web ui

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3224:
---

Priority: Blocker  (was: Critical)

> FetchFailed stages could show up multiple times in failed stages in web ui
> --
>
> Key: SPARK-3224
> URL: https://issues.apache.org/jira/browse/SPARK-3224
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> Today I saw a job in which a reduce stage failed and showed up a lot of times 
> in the failed stages. I think the reason is that the DAGScheduler stage 
> complete (with failure) event multiple times in the case of FetchFailed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2014-08-25 Thread Alex (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110342#comment-14110342
 ] 

Alex commented on SPARK-2344:
-

Hi,
I'm currently working on the implementation of FCM myself.
Also see this: https://issues.apache.org/jira/browse/SPARK-2430
(JIRA for Standarized Clustering Algorithm API)

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3226) Doc update for MLlib dependencies

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110335#comment-14110335
 ] 

Apache Spark commented on SPARK-3226:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/2128

> Doc update for MLlib dependencies
> -
>
> Key: SPARK-3226
> URL: https://issues.apache.org/jira/browse/SPARK-3226
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> to mention `-Pnetlib-lgpl` option.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3227) Add MLlib migration guide (1.0 -> 1.1)

2014-08-25 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3227:


 Summary: Add MLlib migration guide (1.0 -> 1.1)
 Key: SPARK-3227
 URL: https://issues.apache.org/jira/browse/SPARK-3227
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley


Most API changes happen in decision tree.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3226) Doc update for MLlib dependencies

2014-08-25 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3226:


 Summary: Doc update for MLlib dependencies
 Key: SPARK-3226
 URL: https://issues.apache.org/jira/browse/SPARK-3226
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


to mention `-Pnetlib-lgpl` option.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2839) Documentation for statistical functions

2014-08-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2839:
-

Assignee: Burak Yavuz  (was: Xiangrui Meng)

> Documentation for statistical functions
> ---
>
> Key: SPARK-2839
> URL: https://issues.apache.org/jira/browse/SPARK-2839
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> Add documentation and code examples for statistical functions to MLlib's 
> programming guide.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3223) runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode

2014-08-25 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-3223:


Description: While running mesos with --no-switch_user option, HDFS account 
name is different from driver and executor. It makes a permission error at last 
stage. Executor's id is mesos' user id and driver's id is who runs 
spark-submit. So, moving output from _temporary/path/to/output/part- to 
/output/path/part- fails because of permission error. The solution for this 
is only setting SPARK_USER to HADOOP_USER_NAME when MesosExecutorBackend calls 
runAsSparkUser. HADOOP_USER_NAME is used when FileSystem get user.  (was: While 
running mesos with --no-switch_user option, HDFS account name is different from 
driver and executor. It makes a permission error at last stage. Executor's id 
is mesos' user id and driver's id is who runs spark-submit. So, moving output 
from _temporary/path/to/output/part- to /output/path/part- fails 
because of permission error. The solution for this is only setting SPARK_USER 
to HADOOP_USER_NAME when MesosExecutorBackend calls runAsSparkUser.)

> runAsSparkUser cannot change HDFS write permission properly in mesos cluster 
> mode
> -
>
> Key: SPARK-3223
> URL: https://issues.apache.org/jira/browse/SPARK-3223
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Mesos
>Affects Versions: 1.0.2
>Reporter: Jongyoul Lee
> Fix For: 1.0.3
>
>
> While running mesos with --no-switch_user option, HDFS account name is 
> different from driver and executor. It makes a permission error at last 
> stage. Executor's id is mesos' user id and driver's id is who runs 
> spark-submit. So, moving output from _temporary/path/to/output/part- to 
> /output/path/part- fails because of permission error. The solution for 
> this is only setting SPARK_USER to HADOOP_USER_NAME when MesosExecutorBackend 
> calls runAsSparkUser. HADOOP_USER_NAME is used when FileSystem get user.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3223) runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode

2014-08-25 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-3223:


Target Version/s: 1.0.3  (was: 1.1.0)

> runAsSparkUser cannot change HDFS write permission properly in mesos cluster 
> mode
> -
>
> Key: SPARK-3223
> URL: https://issues.apache.org/jira/browse/SPARK-3223
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Mesos
>Affects Versions: 1.0.2
>Reporter: Jongyoul Lee
> Fix For: 1.0.3
>
>
> While running mesos with --no-switch_user option, HDFS account name is 
> different from driver and executor. It makes a permission error at last 
> stage. Executor's id is mesos' user id and driver's id is who runs 
> spark-submit. So, moving output from _temporary/path/to/output/part- to 
> /output/path/part- fails because of permission error. The solution for 
> this is only setting SPARK_USER to HADOOP_USER_NAME when MesosExecutorBackend 
> calls runAsSparkUser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3223) runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode

2014-08-25 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-3223:


Fix Version/s: 1.0.3

> runAsSparkUser cannot change HDFS write permission properly in mesos cluster 
> mode
> -
>
> Key: SPARK-3223
> URL: https://issues.apache.org/jira/browse/SPARK-3223
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Mesos
>Affects Versions: 1.0.2
>Reporter: Jongyoul Lee
> Fix For: 1.0.3
>
>
> While running mesos with --no-switch_user option, HDFS account name is 
> different from driver and executor. It makes a permission error at last 
> stage. Executor's id is mesos' user id and driver's id is who runs 
> spark-submit. So, moving output from _temporary/path/to/output/part- to 
> /output/path/part- fails because of permission error. The solution for 
> this is only setting SPARK_USER to HADOOP_USER_NAME when MesosExecutorBackend 
> calls runAsSparkUser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3224) FetchFailed stages could show up multiple times in failed stages in web ui

2014-08-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3224:
---

Priority: Critical  (was: Major)

> FetchFailed stages could show up multiple times in failed stages in web ui
> --
>
> Key: SPARK-3224
> URL: https://issues.apache.org/jira/browse/SPARK-3224
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Today I saw a job in which a reduce stage failed and showed up a lot of times 
> in the failed stages. I think the reason is that the DAGScheduler stage 
> complete (with failure) event multiple times in the case of FetchFailed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3225) Typo in script

2014-08-25 Thread WangTaoTheTonic (JIRA)
WangTaoTheTonic created SPARK-3225:
--

 Summary: Typo in script
 Key: SPARK-3225
 URL: https://issues.apache.org/jira/browse/SPARK-3225
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor


use_conf_dir => user_conf_dir in load-spark-env.sh.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3224) FetchFailed stages could show up multiple times in failed stages in web ui

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110315#comment-14110315
 ] 

Apache Spark commented on SPARK-3224:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2127

> FetchFailed stages could show up multiple times in failed stages in web ui
> --
>
> Key: SPARK-3224
> URL: https://issues.apache.org/jira/browse/SPARK-3224
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Today I saw a job in which a reduce stage failed and showed up a lot of times 
> in the failed stages. I think the reason is that the DAGScheduler stage 
> complete (with failure) event multiple times in the case of FetchFailed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3223) runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110314#comment-14110314
 ] 

Apache Spark commented on SPARK-3223:
-

User 'jongyoul' has created a pull request for this issue:
https://github.com/apache/spark/pull/2126

> runAsSparkUser cannot change HDFS write permission properly in mesos cluster 
> mode
> -
>
> Key: SPARK-3223
> URL: https://issues.apache.org/jira/browse/SPARK-3223
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Mesos
>Affects Versions: 1.0.2
>Reporter: Jongyoul Lee
>
> While running mesos with --no-switch_user option, HDFS account name is 
> different from driver and executor. It makes a permission error at last 
> stage. Executor's id is mesos' user id and driver's id is who runs 
> spark-submit. So, moving output from _temporary/path/to/output/part- to 
> /output/path/part- fails because of permission error. The solution for 
> this is only setting SPARK_USER to HADOOP_USER_NAME when MesosExecutorBackend 
> calls runAsSparkUser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3223) runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3223:
---

Priority: Major  (was: Blocker)

> runAsSparkUser cannot change HDFS write permission properly in mesos cluster 
> mode
> -
>
> Key: SPARK-3223
> URL: https://issues.apache.org/jira/browse/SPARK-3223
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Mesos
>Affects Versions: 1.0.2
>Reporter: Jongyoul Lee
>
> While running mesos with --no-switch_user option, HDFS account name is 
> different from driver and executor. It makes a permission error at last 
> stage. Executor's id is mesos' user id and driver's id is who runs 
> spark-submit. So, moving output from _temporary/path/to/output/part- to 
> /output/path/part- fails because of permission error. The solution for 
> this is only setting SPARK_USER to HADOOP_USER_NAME when MesosExecutorBackend 
> calls runAsSparkUser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3223) runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3223:
---

Fix Version/s: (was: 1.1.0)

> runAsSparkUser cannot change HDFS write permission properly in mesos cluster 
> mode
> -
>
> Key: SPARK-3223
> URL: https://issues.apache.org/jira/browse/SPARK-3223
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Mesos
>Affects Versions: 1.0.2
>Reporter: Jongyoul Lee
>Priority: Blocker
>
> While running mesos with --no-switch_user option, HDFS account name is 
> different from driver and executor. It makes a permission error at last 
> stage. Executor's id is mesos' user id and driver's id is who runs 
> spark-submit. So, moving output from _temporary/path/to/output/part- to 
> /output/path/part- fails because of permission error. The solution for 
> this is only setting SPARK_USER to HADOOP_USER_NAME when MesosExecutorBackend 
> calls runAsSparkUser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3224) FetchFailed stages could show up multiple times in failed stages in web ui

2014-08-25 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3224:
--

 Summary: FetchFailed stages could show up multiple times in failed 
stages in web ui
 Key: SPARK-3224
 URL: https://issues.apache.org/jira/browse/SPARK-3224
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Reynold Xin
Assignee: Reynold Xin


Today I saw a job in which a reduce stage failed and showed up a lot of times 
in the failed stages. I think the reason is that the DAGScheduler stage 
complete (with failure) event multiple times in the case of FetchFailed. 





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3223) runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode

2014-08-25 Thread Jongyoul Lee (JIRA)
Jongyoul Lee created SPARK-3223:
---

 Summary: runAsSparkUser cannot change HDFS write permission 
properly in mesos cluster mode
 Key: SPARK-3223
 URL: https://issues.apache.org/jira/browse/SPARK-3223
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Mesos
Affects Versions: 1.0.2
Reporter: Jongyoul Lee
Priority: Blocker
 Fix For: 1.1.0


While running mesos with --no-switch_user option, HDFS account name is 
different from driver and executor. It makes a permission error at last stage. 
Executor's id is mesos' user id and driver's id is who runs spark-submit. So, 
moving output from _temporary/path/to/output/part- to 
/output/path/part- fails because of permission error. The solution for this 
is only setting SPARK_USER to HADOOP_USER_NAME when MesosExecutorBackend calls 
runAsSparkUser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3222) cross join support in HiveQl

2014-08-25 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang updated SPARK-3222:
---

Component/s: SQL

> cross join support in HiveQl
> 
>
> Key: SPARK-3222
> URL: https://issues.apache.org/jira/browse/SPARK-3222
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Adrian Wang
>
> Spark SQL hiveQl should support cross join.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2014-08-25 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110268#comment-14110268
 ] 

Yu Ishikawa edited comment on SPARK-2344 at 8/26/14 5:15 AM:
-

HI Alex, 

It seems that fuzzy c-means algorithm has not been merged into Spark yet.
I am implementing that algorithm and create a base class for k-means and FCM.
Would you please assign this issue to me.


was (Author: yuu.ishik...@gmail.com):
HI Alex, 

It seems that fuzzy c-means algorithm has been merged into Spark yet.
I am implementing that algorithm and create a base class for k-means and FCM.
Would you please assign this issue to me.

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2014-08-25 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110268#comment-14110268
 ] 

Yu Ishikawa commented on SPARK-2344:


HI Alex, 

It seems that fuzzy c-means algorithm has been merged into Spark yet.
I am implementing that algorithm and create a base class for k-means and FCM.
Would you please assign this issue to me.

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2541) Standalone mode can't access secure HDFS anymore

2014-08-25 Thread qingtang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110250#comment-14110250
 ] 

qingtang commented on SPARK-2541:
-

Hi, Thomas, could you share how do you access secure HDFS from  standalone 
deployment of spark? 

> Standalone mode can't access secure HDFS anymore
> 
>
> Key: SPARK-2541
> URL: https://issues.apache.org/jira/browse/SPARK-2541
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Thomas Graves
>
> In spark 0.9.x you could access secure HDFS from Standalone deploy, that 
> doesn't work in 1.X anymore. 
> It looks like the issues is in SparkHadoopUtil.runAsSparkUser.  Previously it 
> wouldn't do the doAs if the currentUser == user.  Not sure how it affects 
> when the daemons run as a super user but SPARK_USER is set to someone else.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2014-08-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110241#comment-14110241
 ] 

Joseph K. Bradley commented on SPARK-3155:
--

Hi Qiping, thanks very much for the offer!  It would be great to get your help. 
 [~mengxr] Could you please assign this?

Coordination: I just submitted a PR for DecisionTree 
[https://github.com/apache/spark/pull/2125] which does some major changes.  
After that PR, I hope to work on other parts of MLlib.  However, [~manishamde] 
plans to work on generalizing DecisionTree to include random forests, so you 
may want to coordinate with him.

More thoughts on pruning: In my mind, pruning is related to this JIRA or 
[https://issues.apache.org/jira/browse/SPARK-3161], which would change the 
example--node mapping for the training data.  I figure the example--node 
mapping should be treated the same way for the training and pruning/validation 
sets.

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3086) Use 1-indexing for decision tree nodes

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110232#comment-14110232
 ] 

Apache Spark commented on SPARK-3086:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/2125

> Use 1-indexing for decision tree nodes
> --
>
> Key: SPARK-3086
> URL: https://issues.apache.org/jira/browse/SPARK-3086
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> 1-indexing is good for binary trees. The root node gets index 1. And for any 
> node with index i, its left child is (i << 1), right child is (i << 1) + 1, 
> parent is (i >> 1), and its level is `java.lang.Integer.highestOneBit(idx)` 
> (also 1-indexing for levels).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3156) DecisionTree: Order categorical features adaptively

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110234#comment-14110234
 ] 

Apache Spark commented on SPARK-3156:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/2125

> DecisionTree: Order categorical features adaptively
> ---
>
> Key: SPARK-3156
> URL: https://issues.apache.org/jira/browse/SPARK-3156
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Improvement: accuracy
> Currently, ordered categorical features use a fixed bin ordering chosen 
> before training based on a subsample of the data.  (See the code using 
> centroids in findSplitsBins().)
> Proposal: Choose the ordering adaptively for every split.  This would require 
> a bit more computation on the master, but could improve results by splitting 
> more intelligently.
> Required changes: The result of aggregation is used in 
> findAggForOrderedFeatureClassification() to compute running totals over the 
> pre-set ordering of categorical feature values.  The stats should instead be 
> used to choose a new ordering of categories, before computing running totals.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3043) DecisionTree aggregation is inefficient

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110233#comment-14110233
 ] 

Apache Spark commented on SPARK-3043:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/2125

> DecisionTree aggregation is inefficient
> ---
>
> Key: SPARK-3043
> URL: https://issues.apache.org/jira/browse/SPARK-3043
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> 2 major efficiency issues in computation and storage:
> (1) DecisionTree aggregation involves reshaping data unnecessarily.
> E.g., the internal methods extractNodeInfo() and getBinDataForNode() involve 
> reshaping the data multiple times without real computation.
> (2) DecisionTree splits and aggregate bins can include many unused 
> bins/splits.
> The same number of splits/bins are used for all features.  E.g., if there is 
> a continuous feature which uses 100 bins, then there will also be 100 bins 
> allocated for all binary features, even though only 2 are necessary.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-08-25 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110229#comment-14110229
 ] 

Guoqiang Li commented on SPARK-3098:


Now the bug is this:
After the shuffle fetches, Multiple calls the {{zip}}, {{zipWithIndex}}, 
{{zipWithUniqueId}} operation returns the result is inconsistent.

[The PR 2083|https://github.com/apache/spark/pull/2083] will affect performance 
. I am testing the impact of specific performance.

Another solution is to re-implement the above operation


>  In some cases, operation zipWithIndex get a wrong results
> --
>
> Key: SPARK-3098
> URL: https://issues.apache.org/jira/browse/SPARK-3098
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Guoqiang Li
>Priority: Critical
>
> The reproduce code:
> {code}
>  val c = sc.parallelize(1 to 7899).flatMap { i =>
>   (1 to 1).toSeq.map(p => i * 6000 + p)
> }.distinct().zipWithIndex() 
> c.join(c).filter(t => t._2._1 != t._2._2).take(3)
> {code}
>  => 
> {code}
>  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
> (36579712,(13,14)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3222) cross join support in HiveQl

2014-08-25 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-3222:
--

 Summary: cross join support in HiveQl
 Key: SPARK-3222
 URL: https://issues.apache.org/jira/browse/SPARK-3222
 Project: Spark
  Issue Type: New Feature
Reporter: Adrian Wang


Spark SQL hiveQl should support cross join.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2976) Replace tabs with spaces

2014-08-25 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2976.
--

   Resolution: Fixed
Fix Version/s: 1.2.0

> Replace tabs with spaces
> 
>
> Key: SPARK-2976
> URL: https://issues.apache.org/jira/browse/SPARK-2976
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.2.0
>
>
> Currently, there are too many tabs in source file, which does not correspond 
> to coding style.
> I saw following 3 files have tabs.
> * sorttable.js
> * JavaPageRank.java
> * JavaKinesisWordCountASL.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2014-08-25 Thread Qiping Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110194#comment-14110194
 ] 

Qiping Li commented on SPARK-3155:
--

Hi Joseph, glad to see you have considered to support pruning in MLLib's 
decision tree, 
is there someone working on this issue, or you can assign this issue to me. 
I'm ready to help on this module.

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2976) Replace tabs with spaces

2014-08-25 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2976:
-

Summary: Replace tabs with spaces  (was: Too many ugly tabs instead of 
white spaces)

> Replace tabs with spaces
> 
>
> Key: SPARK-2976
> URL: https://issues.apache.org/jira/browse/SPARK-2976
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> Currently, there are too many tabs in source file, which does not correspond 
> to coding style.
> I saw following 3 files have tabs.
> * sorttable.js
> * JavaPageRank.java
> * JavaKinesisWordCountASL.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2976) Replace tabs with spaces

2014-08-25 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2976:
-

Assignee: Kousuke Saruta

> Replace tabs with spaces
> 
>
> Key: SPARK-2976
> URL: https://issues.apache.org/jira/browse/SPARK-2976
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Currently, there are too many tabs in source file, which does not correspond 
> to coding style.
> I saw following 3 files have tabs.
> * sorttable.js
> * JavaPageRank.java
> * JavaKinesisWordCountASL.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2481) The environment variables SPARK_HISTORY_OPTS is covered in start-history-server.sh

2014-08-25 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110187#comment-14110187
 ] 

Andrew Or commented on SPARK-2481:
--

Resolved by https://github.com/apache/spark/pull/1341

> The environment variables SPARK_HISTORY_OPTS is covered in 
> start-history-server.sh
> --
>
> Key: SPARK-2481
> URL: https://issues.apache.org/jira/browse/SPARK-2481
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.0
>
>
> If we have the following code in the conf/spark-env.sh  
> {{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}
> The environment variables SPARK_HISTORY_OPTS is covered in 
> [start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
>  
> {code}
> if [ $# != 0 ]; then
>   echo "Using command line arguments for setting the log directory is 
> deprecated. Please "
>   echo "set the spark.history.fs.logDirectory configuration option instead."
>   export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
> -Dspark.history.fs.logDirectory=$1"
> fi
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2481) The environment variables SPARK_HISTORY_OPTS is covered in start-history-server.sh

2014-08-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-2481.
--

  Resolution: Fixed
   Fix Version/s: 1.1.0
Target Version/s: 1.1.0

> The environment variables SPARK_HISTORY_OPTS is covered in 
> start-history-server.sh
> --
>
> Key: SPARK-2481
> URL: https://issues.apache.org/jira/browse/SPARK-2481
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.0
>
>
> If we have the following code in the conf/spark-env.sh  
> {{export SPARK_HISTORY_OPTS="-DSpark.history.XX=XX"}}
> The environment variables SPARK_HISTORY_OPTS is covered in 
> [start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh]
>  
> {code}
> if [ $# != 0 ]; then
>   echo "Using command line arguments for setting the log directory is 
> deprecated. Please "
>   echo "set the spark.history.fs.logDirectory configuration option instead."
>   export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS 
> -Dspark.history.fs.logDirectory=$1"
> fi
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-08-25 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110183#comment-14110183
 ] 

Matei Zaharia commented on SPARK-3098:
--

Sorry, I don't understand -- what exactly is the bug here? There's no guarantee 
about the ordering of elements in distinct(). If you're relying on zipWithIndex 
creating specific values, that's a wrong assumption to make. The question is 
just whether the *set* of elements returned by zipWithIndex is correct.

I don't think we should change our randomize() to be more deterministic here 
just because you want zipWithIndex. We have to allow shuffle fetches to occur 
in a random order, or else we can get inefficiency when there are hotspots. If 
you'd like to make sure values land in specific partitions and in a specific 
order in each partition, you can partition the data with your own Partitioner, 
and run a mapPartitions that sorts them within each one.

>  In some cases, operation zipWithIndex get a wrong results
> --
>
> Key: SPARK-3098
> URL: https://issues.apache.org/jira/browse/SPARK-3098
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Guoqiang Li
>Priority: Critical
>
> The reproduce code:
> {code}
>  val c = sc.parallelize(1 to 7899).flatMap { i =>
>   (1 to 1).toSeq.map(p => i * 6000 + p)
> }.distinct().zipWithIndex() 
> c.join(c).filter(t => t._2._1 != t._2._2).take(3)
> {code}
>  => 
> {code}
>  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
> (36579712,(13,14)))
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3037) Add ArrayType containing null value support to Parquet.

2014-08-25 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3037:


Assignee: Takuya Ueshin

> Add ArrayType containing null value support to Parquet.
> ---
>
> Key: SPARK-3037
> URL: https://issues.apache.org/jira/browse/SPARK-3037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Blocker
>
> Parquet support should handle {{ArrayType}} when {{containsNull}} is {{true}}.
> When {{containsNull}} is {{true}}, the schema should be as follows:
> {noformat}
> message root {
>   optional group a (LIST) {
> repeated group bag {
>   optional int32 array_element;
> }
>   }
> }
> {noformat}
> FYI:
> Hive's Parquet writer *always* uses this schema, and reader can read only 
> from this schema, i.e. current Parquet support of SparkSQL is not compatible 
> with Hive.
> NOTICE:
> If Hive compatiblity is top priority, we also have to use this schma 
> regardless of {{containsNull}}, which will break backward compatibility.
> But using this schema could affect performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3036) Add MapType containing null value support to Parquet.

2014-08-25 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3036:


Assignee: Takuya Ueshin

> Add MapType containing null value support to Parquet.
> -
>
> Key: SPARK-3036
> URL: https://issues.apache.org/jira/browse/SPARK-3036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Blocker
>
> Current Parquet schema for {{MapType}} is as follows regardless of 
> {{valueContainsNull}}:
> {noformat}
> message root {
>   optional group a (MAP) {
> repeated group map (MAP_KEY_VALUE) {
>   required int32 key;
>   required int32 value;
> }
>   }
> }
> {noformat}
> and if the map contains {{null}} value, it throws runtime exception.
> To handle {{MapType}} containing {{null}} value, the schema should be as 
> follows if {{valueContainsNull}} is {{true}}:
> {noformat}
> message root {
>   optional group a (MAP) {
> repeated group map (MAP_KEY_VALUE) {
>   required int32 key;
>   optional int32 value;
> }
>   }
> }
> {noformat}
> FYI:
> Hive's Parquet writer *always* uses the latter schema, but reader can read 
> from both schema.
> NOTICE:
> This change will break backward compatibility when the schema is read from 
> Parquet metadata ({{"org.apache.spark.sql.parquet.row.metadata"}}).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2636) no where to get job identifier while submit spark job through spark API

2014-08-25 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110172#comment-14110172
 ] 

Rui Li commented on SPARK-2636:
---

Just want to make sure I understand everything correctly:

I think user submits a job via an RDD action, which in turn calls 
{{SparkContex.runJob -> DAGScheduler.runJob -> DAGScheduler.submitJob -> 
DAGScheduler.handleJobSubmitted}}. The requirement is we should return some job 
ID to the user. So I think putting that in a DAGScheduler method doesn't help? 
BTW, {{DAGScheduler.submitJob}} returns a {{JobWaiter}} which contains the job 
ID.

Also, by "job ID", do we mean {{org.apache.spark.streaming.scheduler.Job.id}} 
or {{org.apache.spark.scheduler.ActiveJob.jobId}}?

Please let me know if I misunderstand anything.

> no where to get job identifier while submit spark job through spark API
> ---
>
> Key: SPARK-2636
> URL: https://issues.apache.org/jira/browse/SPARK-2636
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Reporter: Chengxiang Li
>  Labels: hive
>
> In Hive on Spark, we want to track spark job status through Spark API, the 
> basic idea is as following:
> # create an hive-specified spark listener and register it to spark listener 
> bus.
> # hive-specified spark listener generate job status by spark listener events.
> # hive driver track job status through hive-specified spark listener. 
> the current problem is that hive driver need job identifier to track 
> specified job status through spark listener, but there is no spark API to get 
> job identifier(like job id) while submit spark job.
> I think other project whoever try to track job status with spark API would 
> suffer from this as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3221) Support JRuby as a language for using Spark

2014-08-25 Thread Rasik Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110163#comment-14110163
 ] 

Rasik Pandey commented on SPARK-3221:
-

Currently this isn't possible due to closure and object serialization 
limitations, but since JRuby is a JVM language that has closures this should be 
possible. Spark would have to be updated to support JRuby 
serialization/deserialization or marshal/unmarshal or JRuby objects that aren't 
necessarily backed by class files. For example, the current ClosureCleaner code 
expects to resolve actual class files yet in JRuby class files don't always 
exist for objects.

> Support JRuby as a language for using Spark
> ---
>
> Key: SPARK-3221
> URL: https://issues.apache.org/jira/browse/SPARK-3221
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Rasik Pandey
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3221) Support JRuby as a language for using Spark

2014-08-25 Thread Rasik Pandey (JIRA)
Rasik Pandey created SPARK-3221:
---

 Summary: Support JRuby as a language for using Spark
 Key: SPARK-3221
 URL: https://issues.apache.org/jira/browse/SPARK-3221
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Rasik Pandey






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2839) Documentation for statistical functions

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110154#comment-14110154
 ] 

Apache Spark commented on SPARK-2839:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/2123

> Documentation for statistical functions
> ---
>
> Key: SPARK-2839
> URL: https://issues.apache.org/jira/browse/SPARK-2839
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
>
> Add documentation and code examples for statistical functions to MLlib's 
> programming guide.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3213) spark_ec2.py cannot find slave instances launched with "Launch More Like This"

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110149#comment-14110149
 ] 

Vida Ha edited comment on SPARK-3213 at 8/26/14 1:51 AM:
-

Hi Joseph,

Can you tell me more about how you launched these, without copying the tags?  I 
used "Launch More Like This", and the name and tags were copied over correctly 
- see my screenshot above.  I'm wondering if maybe when you were using EC2, if 
perhaps you could have been so unlucky as to have trigger a temporary outage in 
copying tags...

Let's sync up in person tomorrow and figure out if this was a one time problem 
or happens each time "Launch More Like This" is used or perhaps if we used 
different ways to launch more slaves.






was (Author: vidaha):
Hi Joseph,

Can you tell me more about how you launched these, without copying the tags?  I 
used "Launch More Like This", and the name and tags were copied over correctly. 
 I'm wondering if maybe when you were using EC2, if perhaps you could have been 
so unlucky as to have trigger a temporary outage in copying tags...

Let's sync up in person tomorrow and figure out if this was a one time problem 
or happens each time "Launch More Like This" is used.





> spark_ec2.py cannot find slave instances launched with "Launch More Like This"
> --
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
> Attachments: Screen Shot 2014-08-25 at 6.45.35 PM.png
>
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3213) spark_ec2.py cannot find slave instances launched with "Launch More Like This"

2014-08-25 Thread Vida Ha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vida Ha updated SPARK-3213:
---

Attachment: Screen Shot 2014-08-25 at 6.45.35 PM.png

> spark_ec2.py cannot find slave instances launched with "Launch More Like This"
> --
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
> Attachments: Screen Shot 2014-08-25 at 6.45.35 PM.png
>
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances launched with "Launch More Like This"

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110149#comment-14110149
 ] 

Vida Ha commented on SPARK-3213:


Hi Joseph,

Can you tell me more about how you launched these, without copying the tags?  I 
used "Launch More Like This", and the name and tags were copied over correctly. 
 I'm wondering if maybe when you were using EC2, if perhaps you could have been 
so unlucky as to have trigger a temporary outage in copying tags...

Let's sync up in person tomorrow and figure out if this was a one time problem 
or happens each time "Launch 





> spark_ec2.py cannot find slave instances launched with "Launch More Like This"
> --
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3213) spark_ec2.py cannot find slave instances launched with "Launch More Like This"

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110149#comment-14110149
 ] 

Vida Ha edited comment on SPARK-3213 at 8/26/14 1:49 AM:
-

Hi Joseph,

Can you tell me more about how you launched these, without copying the tags?  I 
used "Launch More Like This", and the name and tags were copied over correctly. 
 I'm wondering if maybe when you were using EC2, if perhaps you could have been 
so unlucky as to have trigger a temporary outage in copying tags...

Let's sync up in person tomorrow and figure out if this was a one time problem 
or happens each time "Launch More Like This" is used.






was (Author: vidaha):
Hi Joseph,

Can you tell me more about how you launched these, without copying the tags?  I 
used "Launch More Like This", and the name and tags were copied over correctly. 
 I'm wondering if maybe when you were using EC2, if perhaps you could have been 
so unlucky as to have trigger a temporary outage in copying tags...

Let's sync up in person tomorrow and figure out if this was a one time problem 
or happens each time "Launch 





> spark_ec2.py cannot find slave instances launched with "Launch More Like This"
> --
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110142#comment-14110142
 ] 

Cheng Lian commented on SPARK-3217:
---

[~vanzin] Thanks, I did set {{SPARK_PREPEND_CLASSES}}. Will change the title 
and description of this issue after verifying it.

> Shaded Guava jar doesn't play well with Maven build
> ---
>
> Key: SPARK-3217
> URL: https://issues.apache.org/jira/browse/SPARK-3217
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
> and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
> built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
> classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
> The result is that, when Spark is built with Maven (or 
> {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
> {{ClassNotFoundException}}:
> {code}
> # Build Spark with Maven
> $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
> ...
> # Then spark-shell complains
> $ ./bin/spark-shell
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/google/common/util/concurrent/ThreadFactoryBuilder
> at org.apache.spark.util.Utils$.(Utils.scala:636)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:134)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:65)
> at org.apache.spark.repl.Main$.main(Main.scala:30)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> com.google.common.util.concurrent.ThreadFactoryBuilder
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 13 more
> # Check the assembly jar file
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
> grep -i ThreadFactoryBuilder
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
> {code}
> SBT build is fine since we don't shade Guava with SBT right now (and that's 
> why Jenkins didn't complain about this).
> Possible solutions can be:
> # revert PR #1813 for safe, or
> # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
> Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3220) K-Means clusterer should perform K-Means initialization in parallel

2014-08-25 Thread Derrick Burns (JIRA)
Derrick Burns created SPARK-3220:


 Summary: K-Means clusterer should perform K-Means initialization 
in parallel
 Key: SPARK-3220
 URL: https://issues.apache.org/jira/browse/SPARK-3220
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns


The LocalKMeans method should be replaced with a parallel implementation.  As 
it stands now, it becomes a bottleneck for large data sets. 

I have implemented this functionality in my version of the clusterer.  However, 
I see that there are hundreds of outstanding pull requests.  If someone on the 
team wants to sponsor the pull request, I will create one.  Otherwise, I will 
just maintain my own private fork of the clusterer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2921) Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other things)

2014-08-25 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110127#comment-14110127
 ] 

Cheng Lian commented on SPARK-2921:
---

[~andrewor14] {{spark.executor.extraLibraryPath}} is affected. But 
{{spark.executor.extraClassPath}} should be OK since it's finally added to the 
environment variable {{SPARK_CLASSPATH}}. 

> Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other 
> things)
> ---
>
> Key: SPARK-2921
> URL: https://issues.apache.org/jira/browse/SPARK-2921
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.2
>Reporter: Andrew Or
>Priority: Blocker
> Fix For: 1.1.0
>
>
> The code path to handle this exists only for the coarse grained mode, and 
> even in this mode the java options aren't passed to the executors properly. 
> We currently pass the entire value of spark.executor.extraJavaOptions to the 
> executors as a string without splitting it. We need to use 
> Utils.splitCommandString as in standalone mode.
> I have not confirmed this, but I would assume spark.executor.extraClassPath 
> and spark.executor.extraLibraryPath are also not propagated correctly in 
> either mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3219) K-Means clusterer should support Bregman distance metrics

2014-08-25 Thread Derrick Burns (JIRA)
Derrick Burns created SPARK-3219:


 Summary: K-Means clusterer should support Bregman distance metrics
 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns


The K-Means clusterer supports the Euclidean distance metric.  However, it is 
rather straightforward to support Bregman 
(http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
distance functions which would increase the utility of the clusterer 
tremendously.

I have modified the clusterer to support pluggable distance functions.  
However, I notice that there are hundreds of outstanding pull requests.  If 
someone is willing to work with me to sponsor the work through the process, I 
will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3218) K-Means clusterer can fail on degenerate data

2014-08-25 Thread Derrick Burns (JIRA)
Derrick Burns created SPARK-3218:


 Summary: K-Means clusterer can fail on degenerate data
 Key: SPARK-3218
 URL: https://issues.apache.org/jira/browse/SPARK-3218
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns


The KMeans parallel implementation selects points to be cluster centers with 
probability weighted by their distance to cluster centers.  However, if there 
are fewer than k DISTINCT points in the data set, this approach will fail.  

Further, the recent checkin to work around this problem results in selection of 
the same point repeatedly as a cluster center. 

The fix is to allow fewer than k cluster centers to be selected.  This requires 
several changes to the code, as the number of cluster centers is woven into the 
implementation.

I have a version of the code that addresses this problem, AND generalizes the 
distance metric.  However, I see that there are literally hundreds of 
outstanding pull requests.  If someone will commit to working with me to 
sponsor the pull request, I will create it.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3193) output errer info when Process exitcode not zero

2014-08-25 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei reopened SPARK-3193:



> output errer info when Process exitcode not zero
> 
>
> Key: SPARK-3193
> URL: https://issues.apache.org/jira/browse/SPARK-3193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: wangfei
>
> I noticed that sometimes pr tests failed due to the Process exitcode != 0:
> DriverSuite: 
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath 
> - driver should exit after finishing *** FAILED *** 
>SparkException was thrown during property evaluation. 
> (DriverSuite.scala:40) 
>  Message: Process List(./bin/spark-class, 
> org.apache.spark.DriverWithoutCleanup, local) exited with code 1 
>  Occurred at table row 0 (zero based, not counting headings), which had 
> values ( 
>master = local 
>  ) 
>  
> [info] SparkSubmitSuite:
> [info] - prints usage on empty input
> [info] - prints usage with only --help
> [info] - prints error with unrecognized options
> [info] - handle binary specified but not class
> [info] - handles arguments with --key=val
> [info] - handles arguments to user program
> [info] - handles arguments to user program with name collision
> [info] - handles YARN cluster mode
> [info] - handles YARN client mode
> [info] - handles standalone cluster mode
> [info] - handles standalone client mode
> [info] - handles mesos client mode
> [info] - handles confs with flag equivalents
> [info] - launch simple application with spark-submit *** FAILED ***
> [info]   org.apache.spark.SparkException: Process List(./bin/spark-submit, 
> --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, 
> --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited 
> with code 1
> [info]   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
> [info]   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
> [info]   at org.apacSpark assembly has been built with Hive, including 
> Datanucleus jars on classpath
> refer to 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull
> we should output the process error info when failed, this can be helpful for 
> diagnosis.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero

2014-08-25 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110116#comment-14110116
 ] 

Helena Edelson commented on SPARK-3178:
---

+1 it doesn't look like the input data is validated to fail fast if mb/g is not 
noted

> setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the 
> worker memory limit to zero
> 
>
> Key: SPARK-3178
> URL: https://issues.apache.org/jira/browse/SPARK-3178
> Project: Spark
>  Issue Type: Bug
> Environment: osx
>Reporter: Jon Haddad
>
> This should either default to m or just completely fail.  Starting a worker 
> with zero memory isn't very helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3217:
---

Affects Version/s: 1.2.0

> Shaded Guava jar doesn't play well with Maven build
> ---
>
> Key: SPARK-3217
> URL: https://issues.apache.org/jira/browse/SPARK-3217
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
> and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
> built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
> classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
> The result is that, when Spark is built with Maven (or 
> {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
> {{ClassNotFoundException}}:
> {code}
> # Build Spark with Maven
> $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
> ...
> # Then spark-shell complains
> $ ./bin/spark-shell
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/google/common/util/concurrent/ThreadFactoryBuilder
> at org.apache.spark.util.Utils$.(Utils.scala:636)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:134)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:65)
> at org.apache.spark.repl.Main$.main(Main.scala:30)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> com.google.common.util.concurrent.ThreadFactoryBuilder
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 13 more
> # Check the assembly jar file
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
> grep -i ThreadFactoryBuilder
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
> {code}
> SBT build is fine since we don't shade Guava with SBT right now (and that's 
> why Jenkins didn't complain about this).
> Possible solutions can be:
> # revert PR #1813 for safe, or
> # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
> Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3217:
---

Labels:   (was: 1.2.0)

> Shaded Guava jar doesn't play well with Maven build
> ---
>
> Key: SPARK-3217
> URL: https://issues.apache.org/jira/browse/SPARK-3217
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
> and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
> built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
> classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
> The result is that, when Spark is built with Maven (or 
> {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
> {{ClassNotFoundException}}:
> {code}
> # Build Spark with Maven
> $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
> ...
> # Then spark-shell complains
> $ ./bin/spark-shell
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/google/common/util/concurrent/ThreadFactoryBuilder
> at org.apache.spark.util.Utils$.(Utils.scala:636)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:134)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:65)
> at org.apache.spark.repl.Main$.main(Main.scala:30)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> com.google.common.util.concurrent.ThreadFactoryBuilder
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 13 more
> # Check the assembly jar file
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
> grep -i ThreadFactoryBuilder
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
> {code}
> SBT build is fine since we don't shade Guava with SBT right now (and that's 
> why Jenkins didn't complain about this).
> Possible solutions can be:
> # revert PR #1813 for safe, or
> # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
> Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3058) Support EXTENDED for EXPLAIN command

2014-08-25 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3058.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

> Support EXTENDED for EXPLAIN command
> 
>
> Key: SPARK-3058
> URL: https://issues.apache.org/jira/browse/SPARK-3058
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
> Fix For: 1.1.0
>
>
> Currently, it's no difference when run the command "EXPLAIN" w or w/o 
> "EXTENDED" keywords, this patch will show more details of the query plan when 
> "EXTENDED" keyword provided.
> {panel:title=EXPLAIN with EXTENDED}
> explain extended select key as a1, value as a2 from src where key=1;
> == Parsed Logical Plan ==
> Project ['key AS a1#3,'value AS a2#4]
>  Filter ('key = 1)
>   UnresolvedRelation None, src, None
> == Analyzed Logical Plan ==
> Project [key#8 AS a1#3,value#9 AS a2#4]
>  Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
>   MetastoreRelation default, src, None
> == Optimized Logical Plan ==
> Project [key#8 AS a1#3,value#9 AS a2#4]
>  Filter (CAST(key#8, DoubleType) = 1.0)
>   MetastoreRelation default, src, None
> == Physical Plan ==
> Project [key#8 AS a1#3,value#9 AS a2#4]
>  Filter (CAST(key#8, DoubleType) = 1.0)
>   HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None
> Code Generation: false
> == RDD ==
> (2) MappedRDD[14] at map at HiveContext.scala:350
>   MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
>   MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
>   MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
>   MappedRDD[10] at map at TableReader.scala:240
>   HadoopRDD[9] at HadoopRDD at TableReader.scala:230
> {panel}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3217:
--

Target Version/s: 1.2.0  (was: 1.1.0)

> Shaded Guava jar doesn't play well with Maven build
> ---
>
> Key: SPARK-3217
> URL: https://issues.apache.org/jira/browse/SPARK-3217
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Cheng Lian
>Priority: Blocker
>  Labels: 1.2.0
>
> PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
> and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
> built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
> classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
> The result is that, when Spark is built with Maven (or 
> {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
> {{ClassNotFoundException}}:
> {code}
> # Build Spark with Maven
> $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
> ...
> # Then spark-shell complains
> $ ./bin/spark-shell
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/google/common/util/concurrent/ThreadFactoryBuilder
> at org.apache.spark.util.Utils$.(Utils.scala:636)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:134)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:65)
> at org.apache.spark.repl.Main$.main(Main.scala:30)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> com.google.common.util.concurrent.ThreadFactoryBuilder
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 13 more
> # Check the assembly jar file
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
> grep -i ThreadFactoryBuilder
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
> {code}
> SBT build is fine since we don't shade Guava with SBT right now (and that's 
> why Jenkins didn't complain about this).
> Possible solutions can be:
> # revert PR #1813 for safe, or
> # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
> Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3217:
--

Labels: 1.2.0  (was: )

> Shaded Guava jar doesn't play well with Maven build
> ---
>
> Key: SPARK-3217
> URL: https://issues.apache.org/jira/browse/SPARK-3217
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Cheng Lian
>Priority: Blocker
>  Labels: 1.2.0
>
> PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
> and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
> built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
> classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
> The result is that, when Spark is built with Maven (or 
> {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
> {{ClassNotFoundException}}:
> {code}
> # Build Spark with Maven
> $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
> ...
> # Then spark-shell complains
> $ ./bin/spark-shell
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/google/common/util/concurrent/ThreadFactoryBuilder
> at org.apache.spark.util.Utils$.(Utils.scala:636)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:134)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:65)
> at org.apache.spark.repl.Main$.main(Main.scala:30)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> com.google.common.util.concurrent.ThreadFactoryBuilder
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 13 more
> # Check the assembly jar file
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
> grep -i ThreadFactoryBuilder
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
> {code}
> SBT build is fine since we don't shade Guava with SBT right now (and that's 
> why Jenkins didn't complain about this).
> Possible solutions can be:
> # revert PR #1813 for safe, or
> # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
> Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3217:
--

Affects Version/s: (was: 1.0.2)

> Shaded Guava jar doesn't play well with Maven build
> ---
>
> Key: SPARK-3217
> URL: https://issues.apache.org/jira/browse/SPARK-3217
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Cheng Lian
>Priority: Blocker
>
> PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
> and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
> built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
> classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
> The result is that, when Spark is built with Maven (or 
> {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
> {{ClassNotFoundException}}:
> {code}
> # Build Spark with Maven
> $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
> ...
> # Then spark-shell complains
> $ ./bin/spark-shell
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/google/common/util/concurrent/ThreadFactoryBuilder
> at org.apache.spark.util.Utils$.(Utils.scala:636)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:134)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:65)
> at org.apache.spark.repl.Main$.main(Main.scala:30)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> com.google.common.util.concurrent.ThreadFactoryBuilder
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 13 more
> # Check the assembly jar file
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
> grep -i ThreadFactoryBuilder
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
> {code}
> SBT build is fine since we don't shade Guava with SBT right now (and that's 
> why Jenkins didn't complain about this).
> Possible solutions can be:
> # revert PR #1813 for safe, or
> # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
> Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110035#comment-14110035
 ] 

Marcelo Vanzin commented on SPARK-3217:
---

Just did a "git clean -dfx" on master and rebuilt using maven. This works fine 
for me.

Did you by any chance do one of the following:
- forget to "clean" after pulling that change
- mix sbt and mvn built artifacts in the same build
- set SPARK_PREPEND_CLASSES

I can see any of those causing this issue. I think only the last one is 
something we need to worry about; we now need to figure out a way to add the 
guava jar to the classpath when using that option.

> Shaded Guava jar doesn't play well with Maven build
> ---
>
> Key: SPARK-3217
> URL: https://issues.apache.org/jira/browse/SPARK-3217
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Cheng Lian
>Priority: Blocker
>
> PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
> and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
> built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
> classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
> The result is that, when Spark is built with Maven (or 
> {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
> {{ClassNotFoundException}}:
> {code}
> # Build Spark with Maven
> $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
> ...
> # Then spark-shell complains
> $ ./bin/spark-shell
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/google/common/util/concurrent/ThreadFactoryBuilder
> at org.apache.spark.util.Utils$.(Utils.scala:636)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:134)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:65)
> at org.apache.spark.repl.Main$.main(Main.scala:30)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> com.google.common.util.concurrent.ThreadFactoryBuilder
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 13 more
> # Check the assembly jar file
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
> grep -i ThreadFactoryBuilder
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
> {code}
> SBT build is fine since we don't shade Guava with SBT right now (and that's 
> why Jenkins didn't complain about this).
> Possible solutions can be:
> # revert PR #1813 for safe, or
> # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
> Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.

2014-08-25 Thread Yi Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110026#comment-14110026
 ] 

Yi Tian commented on SPARK-2087:


You mean the "CACHE TABLE ... AS SELECT..." syntax will create temporary table, 
and could not be found by other session? 
I'm still confusing about the different between temporary table and cached 
tables.

> Clean Multi-user semantics for thrift JDBC/ODBC server.
> ---
>
> Key: SPARK-2087
> URL: https://issues.apache.org/jira/browse/SPARK-2087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Zongheng Yang
>Priority: Minor
>
> Configuration and temporary tables should exist per-user.  Cached tables 
> should be shared across users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3061) Maven build fails in Windows OS

2014-08-25 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3061:
--

Affects Version/s: 1.1.0

Maybe we can use a Maven plugin to unzip?  
http://stackoverflow.com/questions/3264064/unpack-zip-in-zip-with-maven

> Maven build fails in Windows OS
> ---
>
> Key: SPARK-3061
> URL: https://issues.apache.org/jira/browse/SPARK-3061
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2, 1.1.0
> Environment: Windows
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> Maven build fails in Windows OS with this error message.
> {noformat}
> [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec 
> (default) on project spark-core_2.10: Command execution failed. Cannot run 
> program "unzip" (in directory "C:\path\to\gitofspark\python"): CreateProcess 
> error=2, Žw’肳‚ꂽƒtƒ@ƒ -> [Help 1]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3179) Add task OutputMetrics

2014-08-25 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109997#comment-14109997
 ] 

Michael Yannakopoulos commented on SPARK-3179:
--

Hi Sandy,

I am willing to help with this issue. I am a new to Apache Spark and I have made
few contributions so far. Under your supervision I can work on this issue.

Thanks,
Michael

> Add task OutputMetrics
> --
>
> Key: SPARK-3179
> URL: https://issues.apache.org/jira/browse/SPARK-3179
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Sandy Ryza
>
> Track the bytes that tasks write to HDFS or other output destinations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2929) Rewrite HiveThriftServer2Suite and CliSuite

2014-08-25 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2929.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

> Rewrite HiveThriftServer2Suite and CliSuite
> ---
>
> Key: SPARK-2929
> URL: https://issues.apache.org/jira/browse/SPARK-2929
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.1, 1.0.2
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 1.1.0
>
>
> {{HiveThriftServer2Suite}} and {{CliSuite}} were inherited from Shark and 
> contain too may hard coded timeouts and timing assumptions when doing IPC. 
> This makes these tests both flaky and slow.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3204) MaxOf would be foldable if both left and right are foldable.

2014-08-25 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3204.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Takuya Ueshin

> MaxOf would be foldable if both left and right are foldable.
> 
>
> Key: SPARK-3204
> URL: https://issues.apache.org/jira/browse/SPARK-3204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)

2014-08-25 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3188:
-

Description: 
Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Tukey bisquare weight function, also referred to as the biweight function, 
produces an M-estimator that is more resistant to regression outliers than the 
Huber M-estimator (Andersen 2008: 19).



  was:
Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces an M-estimator that is more resistant to regression outliers than the 
Huber M-estimator (Andersen 2008: 19).




> Add Robust Regression Algorithm with Tukey bisquare weight  function 
> (Biweight Estimates) 
> --
>
> Key: SPARK-3188
> URL: https://issues.apache.org/jira/browse/SPARK-3188
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Fan Jiang
>Priority: Critical
>  Labels: features
> Fix For: 1.1.1, 1.2.0
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression to employ a 
> fitting criterion that is not as vulnerable as least square.
> The Tukey bisquare weight function, also referred to as the biweight 
> function, produces an M-estimator that is more resistant to regression 
> outliers than the Huber M-estimator (Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)

2014-08-25 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3188:
-

Summary: Add Robust Regression Algorithm with Tukey bisquare weight  
function (Biweight Estimates)   (was: Add Robust Regression Algorithm with 
Turkey bisquare weight  function (Biweight Estimates) )

> Add Robust Regression Algorithm with Tukey bisquare weight  function 
> (Biweight Estimates) 
> --
>
> Key: SPARK-3188
> URL: https://issues.apache.org/jira/browse/SPARK-3188
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Fan Jiang
>Priority: Critical
>  Labels: features
> Fix For: 1.1.1, 1.2.0
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression to employ a 
> fitting criterion that is not as vulnerable as least square.
> The Turkey bisquare weight function, also referred to as the biweight 
> function, produces an M-estimator that is more resistant to regression 
> outliers than the Huber M-estimator (Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0

2014-08-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3216:
-

Description: 
This fails when EC2 tries to clone the most recent version of Spark from 
branch-1.0. I marked this a blocker because this is completely broken, but it 
is technically not "blocking" anything.

This was caused by https://github.com/apache/spark/pull/1831, which broke 
spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was 
only merged into branch-1.1 and master, but not branch-1.0

  was:This fails when EC2 tries to clone the most recent version of Spark from 
branch-1.0. I marked this a blocker because this is completely broken, but it 
is technically not "blocking" anything.


> Spark-shell is broken for branch-1.0
> 
>
> Key: SPARK-3216
> URL: https://issues.apache.org/jira/browse/SPARK-3216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Andrew Or
>Priority: Blocker
>
> This fails when EC2 tries to clone the most recent version of Spark from 
> branch-1.0. I marked this a blocker because this is completely broken, but it 
> is technically not "blocking" anything.
> This was caused by https://github.com/apache/spark/pull/1831, which broke 
> spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
> was only merged into branch-1.1 and master, but not branch-1.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0

2014-08-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3216:
-

Description: 
This fails when EC2 tries to clone the most recent version of Spark from 
branch-1.0. This does not actually affect any released distributions, and so I 
did not set the affected/fix/target versions. I marked this a blocker because 
this is completely broken, but it is technically not "blocking" anything.

This was caused by https://github.com/apache/spark/pull/1831, which broke 
spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was 
only merged into branch-1.1 and master, but not branch-1.0.

  was:
This fails when EC2 tries to clone the most recent version of Spark from 
branch-1.0. I marked this a blocker because this is completely broken, but it 
is technically not "blocking" anything.

This was caused by https://github.com/apache/spark/pull/1831, which broke 
spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was 
only merged into branch-1.1 and master, but not branch-1.0


> Spark-shell is broken for branch-1.0
> 
>
> Key: SPARK-3216
> URL: https://issues.apache.org/jira/browse/SPARK-3216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Andrew Or
>Priority: Blocker
>
> This fails when EC2 tries to clone the most recent version of Spark from 
> branch-1.0. This does not actually affect any released distributions, and so 
> I did not set the affected/fix/target versions. I marked this a blocker 
> because this is completely broken, but it is technically not "blocking" 
> anything.
> This was caused by https://github.com/apache/spark/pull/1831, which broke 
> spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
> was only merged into branch-1.1 and master, but not branch-1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-25 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3189:
-

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-3188

> Add Robust Regression Algorithm with Turkey bisquare weight  function 
> (Biweight Estimates) 
> ---
>
> Key: SPARK-3189
> URL: https://issues.apache.org/jira/browse/SPARK-3189
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Fan Jiang
>Priority: Critical
>  Labels: features
> Fix For: 1.1.1, 1.2.0
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression to employ a 
> fitting criterion that is not as vulnerable as least square.
> The Turkey bisquare weight function, also referred to as the biweight 
> function, produces and M-estimator that is more resistant to regression 
> outliers than the Huber M-estimator ()Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3216) Spark-shell is broken for branch-1.0

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109947#comment-14109947
 ] 

Apache Spark commented on SPARK-3216:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2122

> Spark-shell is broken for branch-1.0
> 
>
> Key: SPARK-3216
> URL: https://issues.apache.org/jira/browse/SPARK-3216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Andrew Or
>Priority: Blocker
>
> This fails when EC2 tries to clone the most recent version of Spark from 
> branch-1.0. I marked this a blocker because this is completely broken, but it 
> is technically not "blocking" anything.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-3217:
-

 Summary: Shaded Guava jar doesn't play well with Maven build
 Key: SPARK-3217
 URL: https://issues.apache.org/jira/browse/SPARK-3217
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Blocker


PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file and 
moved Guava classes to package {{org.spark-project.guava}} when Spark is built 
by Maven. But code in {{org.apache.spark.util.Utils}} still refers to classes 
(e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.

The result is that, when Spark is built with Maven (or 
{{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
{{ClassNotFoundException}}:
{code}
# Build Spark with Maven
$ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
...

# Then spark-shell complains
$ ./bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread "main" java.lang.NoClassDefFoundError: 
com/google/common/util/concurrent/ThreadFactoryBuilder
at org.apache.spark.util.Utils$.(Utils.scala:636)
at org.apache.spark.util.Utils$.(Utils.scala)
at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:134)
at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:65)
at org.apache.spark.repl.Main$.main(Main.scala:30)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: 
com.google.common.util.concurrent.ThreadFactoryBuilder
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 13 more

# Check the assembly jar file
$ jar tf 
assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | grep 
-i ThreadFactoryBuilder
org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
{code}
SBT build is fine since we don't shade Guava with SBT right now (and that's why 
Jenkins didn't complain about this).

Possible solutions can be:
# revert PR #1813 for safe, or
# also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3216) Spark-shell is broken for branch-1.0

2014-08-25 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3216:


 Summary: Spark-shell is broken for branch-1.0
 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Priority: Blocker


This fails when EC2 tries to clone the most recent version of Spark from 
branch-1.0. I marked this a blocker because this is completely broken, but it 
is technically not "blocking" anything.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3215) Add remote interface for SparkContext

2014-08-25 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-3215:
--

Attachment: RemoteSparkContext.pdf

Initial proposal for a remote context interface.

Note that this is not a formal design document, just a high-level proposal, so 
it doesn't go deeply into what APIs would be exposed on anything like that.

> Add remote interface for SparkContext
> -
>
> Key: SPARK-3215
> URL: https://issues.apache.org/jira/browse/SPARK-3215
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>  Labels: hive
> Attachments: RemoteSparkContext.pdf
>
>
> A quick description of the issue: as part of running Hive jobs on top of 
> Spark, it's desirable to have a SparkContext that is running in the 
> background and listening for job requests for a particular user session.
> Running multiple contexts in the same JVM is not a very good solution. Not 
> only SparkContext currently has issues sharing the same JVM among multiple 
> instances, but that turns the JVM running the contexts into a huge bottleneck 
> in the system.
> So I'm proposing a solution where we have a SparkContext that is running in a 
> separate process, and listening for requests from the client application via 
> some RPC interface (most probably Akka).
> I'll attach a document shortly with the current proposal. Let's use this bug 
> to discuss the proposal and any other suggestions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3215) Add remote interface for SparkContext

2014-08-25 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-3215:
-

 Summary: Add remote interface for SparkContext
 Key: SPARK-3215
 URL: https://issues.apache.org/jira/browse/SPARK-3215
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Marcelo Vanzin


A quick description of the issue: as part of running Hive jobs on top of Spark, 
it's desirable to have a SparkContext that is running in the background and 
listening for job requests for a particular user session.

Running multiple contexts in the same JVM is not a very good solution. Not only 
SparkContext currently has issues sharing the same JVM among multiple 
instances, but that turns the JVM running the contexts into a huge bottleneck 
in the system.

So I'm proposing a solution where we have a SparkContext that is running in a 
separate process, and listening for requests from the client application via 
some RPC interface (most probably Akka).

I'll attach a document shortly with the current proposal. Let's use this bug to 
discuss the proposal and any other suggestions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3213) spark_ec2.py cannot find slave instances launched with "Launch More Like This"

2014-08-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3213:
-

Summary: spark_ec2.py cannot find slave instances launched with "Launch 
More Like This"  (was: spark_ec2.py cannot find slave instances)

> spark_ec2.py cannot find slave instances launched with "Launch More Like This"
> --
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109818#comment-14109818
 ] 

Vida Ha edited comment on SPARK-3213 at 8/25/14 9:57 PM:
-

Joseph, Josh, & I discussed in person. 

There is a quick workaround:

1) Use an old version of the spark_ec2 scripts that uses security groups to 
identify the slaves, if using "Launch more like this"

2) Avoid using "Launch more like this"

But now I need to investigate:

If using "launch more like this", it does seem like amazon tries to reuse the 
tags, but I'm wondering if it doesn't like having multiple machines with the 
same "Name" tag.  I will try using a different tag, like "spark-ec2-cluster-id" 
or something like that to identify the machine.  If that tag does copy over, 
then we can properly support "Launch more like this".


was (Author: vidaha):
Joseph, Josh, & I discussed in person. 

There is a quick workarounds:

1) Use an old version of the spark_ec2 scripts that uses security groups to 
identify the slaves, if using "Launch more like this"

But now I need to investigate:

If using "launch more like this", it does seem like amazon tries to reuse the 
tags, but I'm wondering if it doesn't like having multiple machines with the 
same "Name" tag.  I will try using a different tag, like "spark-ec2-cluster-id" 
or something like that to identify the machine.  If that tag does copy over, 
then we can properly support "Launch more like this".

> spark_ec2.py cannot find slave instances
> 
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109828#comment-14109828
 ] 

Vida Ha commented on SPARK-3213:


Can someone rename this issue to:

spark_ec2.py cannot find slave instances launched with "Launch More Like This"

I think that's more indicative of the issue - it's not wider than that.

> spark_ec2.py cannot find slave instances
> 
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109818#comment-14109818
 ] 

Vida Ha commented on SPARK-3213:


Joseph, Josh, & I discussed in person. 

There is a quick workarounds:

1) Use an old version of the spark_ec2 scripts that uses security groups to 
identify the slaves, if using "Launch more like this"

But now I need to investigate:

If using "launch more like this", it does seem like amazon tries to reuse the 
tags, but I'm wondering if it doesn't like having multiple machines with the 
same "Name" tag.  I will try using a different tag, like "spark-ec2-cluster-id" 
or something like that to identify the machine.  If that tag does copy over, 
then we can properly support "Launch more like this".

> spark_ec2.py cannot find slave instances
> 
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3214) Argument parsing loop in make-distribution.sh ends prematurely

2014-08-25 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109765#comment-14109765
 ] 

Cheng Lian commented on SPARK-3214:
---

Didn't realize all Maven options must go after other {{make-distribution.sh}} 
options. Closing this.

> Argument parsing loop in make-distribution.sh ends prematurely
> --
>
> Key: SPARK-3214
> URL: https://issues.apache.org/jira/browse/SPARK-3214
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2
>Reporter: Cheng Lian
>Priority: Minor
>
> Running {{make-distribution.sh}} in this way:
> {code}
> ./make-distribution.sh --hadoop -Pyarn
> {code}
> results in a proper error message:
> {code}
> Error: '--hadoop' is no longer supported:
> Error: use Maven options -Phadoop.version and -Pyarn.version
> {code}
> But if you running it with options placed in reverse order, it just passes:
> {code}
> ./make-distribution.sh -Pyarn --hadoop
> {code}
> The reason is that the {{while}} loop ends prematurely before checking all 
> potentially deprecated command line options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3214) Argument parsing loop in make-distribution.sh ends prematurely

2014-08-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-3214.
-

Resolution: Not a Problem

> Argument parsing loop in make-distribution.sh ends prematurely
> --
>
> Key: SPARK-3214
> URL: https://issues.apache.org/jira/browse/SPARK-3214
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2
>Reporter: Cheng Lian
>Priority: Minor
>
> Running {{make-distribution.sh}} in this way:
> {code}
> ./make-distribution.sh --hadoop -Pyarn
> {code}
> results in a proper error message:
> {code}
> Error: '--hadoop' is no longer supported:
> Error: use Maven options -Phadoop.version and -Pyarn.version
> {code}
> But if you running it with options placed in reverse order, it just passes:
> {code}
> ./make-distribution.sh -Pyarn --hadoop
> {code}
> The reason is that the {{while}} loop ends prematurely before checking all 
> potentially deprecated command line options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2798) Correct several small errors in Flume module pom.xml files

2014-08-25 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109735#comment-14109735
 ] 

Tathagata Das commented on SPARK-2798:
--

Naah, that was already closed by the fix I did on friday 
(https://github.com/apache/spark/pull/2101). Maven and therefore 
make-distribution should work fine with that fix. 

> Correct several small errors in Flume module pom.xml files
> --
>
> Key: SPARK-2798
> URL: https://issues.apache.org/jira/browse/SPARK-2798
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> (EDIT) Since the scalatest issue was since resolved, this is now about a few 
> small problems in the Flume Sink pom.xml 
> - scalatest is not declared as a test-scope dependency
> - Its Avro version doesn't match the rest of the build
> - Its Flume version is not synced with the other Flume module
> - The other Flume module declares its dependency on Flume Sink slightly 
> incorrectly, hard-coding the Scala 2.10 version
> - It depends on Scala Lang directly, which it shouldn't



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3214) Argument parsing loop in make-distribution.sh ends prematurely

2014-08-25 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-3214:
-

 Summary: Argument parsing loop in make-distribution.sh ends 
prematurely
 Key: SPARK-3214
 URL: https://issues.apache.org/jira/browse/SPARK-3214
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Minor


Running {{make-distribution.sh}} in this way:
{code}
./make-distribution.sh --hadoop -Pyarn
{code}
results in a proper error message:
{code}
Error: '--hadoop' is no longer supported:
Error: use Maven options -Phadoop.version and -Pyarn.version
{code}
But if you running it with options placed in reverse order, it just passes:
{code}
./make-distribution.sh -Pyarn --hadoop
{code}
The reason is that the {{while}} loop ends prematurely before checking all 
potentially deprecated command line options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3180) Better control of security groups

2014-08-25 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3180.
---

   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 2088
[https://github.com/apache/spark/pull/2088]

> Better control of security groups
> -
>
> Key: SPARK-3180
> URL: https://issues.apache.org/jira/browse/SPARK-3180
> Project: Spark
>  Issue Type: Improvement
>Reporter: Allan Douglas R. de Oliveira
> Fix For: 1.3.0
>
>
> Two features can be combined together to provide better control of security 
> group policies:
> - The ability to specify the address authorized to access the default 
> security group (instead of letting everyone: 0.0.0.0/0)
> - The possibility to place the created machines on a custom security group
> One can use the combinations of the two flags to restrict external access to 
> the provided security group (e.g by setting the authorized address to 
> 127.0.0.1/32) while maintaining compatibility with the current behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3156) DecisionTree: Order categorical features adaptively

2014-08-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3156:
-

Assignee: Joseph K. Bradley

> DecisionTree: Order categorical features adaptively
> ---
>
> Key: SPARK-3156
> URL: https://issues.apache.org/jira/browse/SPARK-3156
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Improvement: accuracy
> Currently, ordered categorical features use a fixed bin ordering chosen 
> before training based on a subsample of the data.  (See the code using 
> centroids in findSplitsBins().)
> Proposal: Choose the ordering adaptively for every split.  This would require 
> a bit more computation on the master, but could improve results by splitting 
> more intelligently.
> Required changes: The result of aggregation is used in 
> findAggForOrderedFeatureClassification() to compute running totals over the 
> pre-set ordering of categorical feature values.  The stats should instead be 
> used to choose a new ordering of categories, before computing running totals.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109700#comment-14109700
 ] 

Joseph K. Bradley commented on SPARK-3213:
--

The security group name I was using was "joseph-r3.2xlarge-slaves"  It may be a 
regex/matching issue.

> spark_ec2.py cannot find slave instances
> 
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109697#comment-14109697
 ] 

Joseph K. Bradley commented on SPARK-3213:
--

[~vidaha]  Please take a look.  Thanks!

> spark_ec2.py cannot find slave instances
> 
>
> Key: SPARK-3213
> URL: https://issues.apache.org/jira/browse/SPARK-3213
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> spark_ec2.py cannot find all slave instances.  In particular:
> * I created a master & slave and configured them.
> * I created new slave instances from the original slave ("Launch More Like 
> This").
> * I tried to relaunch the cluster, and it could only find the original slave.
> Old versions of the script worked.  The latest working commit which edited 
> that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
> There may be a problem with this PR: 
> [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3213:


 Summary: spark_ec2.py cannot find slave instances
 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker


spark_ec2.py cannot find all slave instances.  In particular:
* I created a master & slave and configured them.
* I created new slave instances from the original slave ("Launch More Like 
This").
* I tried to relaunch the cluster, and it could only find the original slave.

Old versions of the script worked.  The latest working commit which edited that 
.py script is: a0bcbc159e89be868ccc96175dbf1439461557e1

There may be a problem with this PR: 
[https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3044) Create RSS feed for Spark News

2014-08-25 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109677#comment-14109677
 ] 

Nicholas Chammas commented on SPARK-3044:
-

Hi Michael,

I don't know if the site itself is open-source. We might need someone from 
Databricks to update it.

[~pwendell], [~rxin] - Is it possible for contributors to contribute to the 
[main Spark site|http://spark.apache.org/]?

> Create RSS feed for Spark News
> --
>
> Key: SPARK-3044
> URL: https://issues.apache.org/jira/browse/SPARK-3044
> Project: Spark
>  Issue Type: Documentation
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Project updates are often posted here: http://spark.apache.org/news/
> Currently, there is no way to subscribe to a feed of these updates. It would 
> be nice there was a way people could be notified of new posts there without 
> having to check manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2798) Correct several small errors in Flume module pom.xml files

2014-08-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109675#comment-14109675
 ] 

Sean Owen commented on SPARK-2798:
--

[~tdas] Cool, I think this closes SPARK-3169 too if I understand correctly

> Correct several small errors in Flume module pom.xml files
> --
>
> Key: SPARK-2798
> URL: https://issues.apache.org/jira/browse/SPARK-2798
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> (EDIT) Since the scalatest issue was since resolved, this is now about a few 
> small problems in the Flume Sink pom.xml 
> - scalatest is not declared as a test-scope dependency
> - Its Avro version doesn't match the rest of the build
> - Its Flume version is not synced with the other Flume module
> - The other Flume module declares its dependency on Flume Sink slightly 
> incorrectly, hard-coding the Scala 2.10 version
> - It depends on Scala Lang directly, which it shouldn't



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >