[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-24 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601794#comment-15601794
 ] 

Matthew Farrellee commented on SPARK-15487:
---

well, unless you're putting another proxy in front of your master and want it 
to show up in a subsection of your domain, you should only need "/" works and 
it would be a great default. in the case of a site proxy on 
www.mydomain.com/spark i'd expect you only need to set the url to "/spark"

fyi, the master passes the proxy url to the workers - 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L401
 - so you should not need to set it on the workers

if you're continuing to have a problem you should definitely open another issue 
and leave this one as resolved.

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-23 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15600710#comment-15600710
 ] 

Matthew Farrellee commented on SPARK-15487:
---

try just setting the proxy url to "/"

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [openstack-dev] [sahara] Proposing Vitaly Gridnev to core reviewer team

2015-10-13 Thread Matthew Farrellee

+1!

On 10/12/2015 07:19 AM, Sergey Lukjanov wrote:

Hi folks,

I'd like to propose Vitaly Gridnev as a member of the Sahara core
reviewer team.

Vitaly contributing to Sahara for a long time and doing a great job on
reviewing and improving Sahara. Here are the statistics for reviews
[0][1][2] and commits [3].

Existing Sahara core reviewers, please vote +1/-1 for the addition of
Vitaly to the core reviewer team.

Thanks.

[0]
https://review.openstack.org/#/q/reviewer:%22Vitaly+Gridnev+%253Cvgridnev%2540mirantis.com%253E%22,n,z
[1] http://stackalytics.com/report/contribution/sahara-group/180
[2] http://stackalytics.com/?metric=marks_id=vgridnev
[3]
https://review.openstack.org/#/q/status:merged+owner:%22Vitaly+Gridnev+%253Cvgridnev%2540mirantis.com%253E%22,n,z

--
Sincerely yours,
Sergey Lukjanov
Sahara Technical Lead
(OpenStack Data Processing)
Principal Software Engineer
Mirantis Inc.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[jira] [Created] (FLINK-2709) line editing in scala shell

2015-09-18 Thread Matthew Farrellee (JIRA)
Matthew Farrellee created FLINK-2709:


 Summary: line editing in scala shell
 Key: FLINK-2709
 URL: https://issues.apache.org/jira/browse/FLINK-2709
 Project: Flink
  Issue Type: New Feature
  Components: Scala Shell
Reporter: Matthew Farrellee


it would be very helpful to be able to edit lines in the shell. for instance, 
up/down arrow to navigate history and left/right to navigate a line.

bonus for history search and advanced single line editing (e.g. emacs bindings)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2709) line editing in scala shell

2015-09-18 Thread Matthew Farrellee (JIRA)
Matthew Farrellee created FLINK-2709:


 Summary: line editing in scala shell
 Key: FLINK-2709
 URL: https://issues.apache.org/jira/browse/FLINK-2709
 Project: Flink
  Issue Type: New Feature
  Components: Scala Shell
Reporter: Matthew Farrellee


it would be very helpful to be able to edit lines in the shell. for instance, 
up/down arrow to navigate history and left/right to navigate a line.

bonus for history search and advanced single line editing (e.g. emacs bindings)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [openstack-dev] [sahara] Proposing Ethan Gafford for the core reviewer team

2015-08-13 Thread Matthew Farrellee

On 08/13/2015 10:56 AM, Sergey Lukjanov wrote:

Hi folks,

I'd like to propose Ethan Gafford as a member of the Sahara core
reviewer team.

Ethan contributing to Sahara for a long time and doing a great job on
reviewing and improving Sahara. Here are the statistics for reviews
[0][1][2] and commits [3]. BTW Ethan is already stable maint team core
for Sahara.

Existing Sahara core reviewers, please vote +1/-1 for the addition of
Ethan to the core reviewer team.

Thanks.

[0]
https://review.openstack.org/#/q/reviewer:%22Ethan+Gafford+%253Cegafford%2540redhat.com%253E%22,n,z
[1] http://stackalytics.com/report/contribution/sahara-group/90
[2] http://stackalytics.com/?user_id=egaffordmetric=marks
[3]
https://review.openstack.org/#/q/owner:%22Ethan+Gafford+%253Cegafford%2540redhat.com%253E%22+status:merged,n,z

--
Sincerely yours,
Sergey Lukjanov
Sahara Technical Lead
(OpenStack Data Processing)
Principal Software Engineer
Mirantis Inc.


+1 ethan has really taken to sahara, providing valuable input to both 
development and deployments as well has taking on the manila integration



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)

2015-03-23 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377096#comment-14377096
 ] 

Matthew Farrellee commented on SPARK-5368:
--

[~jayunit100] the relevant config is {{LOCAL_HOSTNAME}}

 Spark should  support NAT (via akka improvements)
 -

 Key: SPARK-5368
 URL: https://issues.apache.org/jira/browse/SPARK-5368
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: jay vyas
 Fix For: 1.2.2


 Spark sets up actors for akka with a set of variables which are defined in 
 the {{AkkaUtils.scala}} class.  
 A snippet:
 {noformat}
  98   |akka.loggers = [akka.event.slf4j.Slf4jLogger]
  99   |akka.stdout-loglevel = ERROR
 100   |akka.jvm-exit-on-fatal-error = off
 101   |akka.remote.require-cookie = $requireCookie
 102   |akka.remote.secure-cookie = $secureCookie
 {noformat}
 We should allow users to pass in custom settings, for example, so that 
 arbitrary akka modifications can be used at runtime for security, 
 performance, logging, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)

2015-03-22 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375157#comment-14375157
 ] 

Matthew Farrellee commented on SPARK-5368:
--

[~srowen] i was able to workaround my issue using SPARK-5078. i assume it was 
the core of jay's problem that triggered opening this issue. if he agrees, this 
could simply be closed.

 Spark should  support NAT (via akka improvements)
 -

 Key: SPARK-5368
 URL: https://issues.apache.org/jira/browse/SPARK-5368
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: jay vyas
 Fix For: 1.2.2


 Spark sets up actors for akka with a set of variables which are defined in 
 the {{AkkaUtils.scala}} class.  
 A snippet:
 {noformat}
  98   |akka.loggers = [akka.event.slf4j.Slf4jLogger]
  99   |akka.stdout-loglevel = ERROR
 100   |akka.jvm-exit-on-fatal-error = off
 101   |akka.remote.require-cookie = $requireCookie
 102   |akka.remote.secure-cookie = $secureCookie
 {noformat}
 We should allow users to pass in custom settings, for example, so that 
 arbitrary akka modifications can be used at runtime for security, 
 performance, logging, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

2015-03-22 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375158#comment-14375158
 ] 

Matthew Farrellee commented on SPARK-5113:
--

[~pwendell] would SPARK_INTERNAL_HOSTNAME operate like SPARK_LOCAL_HOSTNAME?

 Audit and document use of hostnames and IP addresses in Spark
 -

 Key: SPARK-5113
 URL: https://issues.apache.org/jira/browse/SPARK-5113
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Critical

 Spark has multiple network components that start servers and advertise their 
 network addresses to other processes.
 We should go through each of these components and make sure they have 
 consistent and/or documented behavior wrt (a) what interface(s) they bind to 
 and (b) what hostname they use to advertise themselves to other processes. We 
 should document this clearly and explain to people what to do in different 
 cases (e.g. EC2, dockerized containers, etc).
 When Spark initializes, it will search for a network interface until it finds 
 one that is not a loopback address. Then it will do a reverse DNS lookup for 
 a hostname associated with that interface. Then the network components will 
 use that hostname to advertise the component to other processes. That 
 hostname is also the one used for the akka system identifier (akka supports 
 only supplying a single name which it uses both as the bind interface and as 
 the actor identifier). In some cases, that hostname is used as the bind 
 hostname also (e.g. I think this happens in the connection manager and 
 possibly akka) - which will likely internally result in a re-resolution of 
 this to an IP address. In other cases (the web UI and netty shuffle) we seem 
 to bind to all interfaces.
 The best outcome would be to have three configs that can be set on each 
 machine:
 {code}
 SPARK_LOCAL_IP # Ip address we bind to for all services
 SPARK_INTERNAL_HOSTNAME # Hostname we advertise to remote processes within 
 the cluster
 SPARK_EXTERNAL_HOSTNAME # Hostname we advertise to processes outside the 
 cluster (e.g. the UI)
 {code}
 It's not clear how easily we can support that scheme while providing 
 backwards compatibility. The last one (SPARK_EXTERNAL_HOSTNAME) is easy - 
 it's just an alias for what is now SPARK_PUBLIC_DNS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6245) jsonRDD() of empty RDD results in exception

2015-03-16 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363290#comment-14363290
 ] 

Matthew Farrellee commented on SPARK-6245:
--

[~srowen] thanks for fixing this. it's nice to file a bug, go on vacation and 
see it fixed when you get back!

what do you think about adding this to 1.3.1?

 jsonRDD() of empty RDD results in exception
 ---

 Key: SPARK-6245
 URL: https://issues.apache.org/jira/browse/SPARK-6245
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Matthew Farrellee
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.4.0


 converting an empty RDD to a JSON RDD results in an exception. this case is 
 common when using spark streaming.
 {code}
 from pyspark import SparkContext
 from pyspark.sql import SQLContext
 sc = SparkContext()
 qsc = SQLContext(sc)
 qsc.jsonRDD(sc.parallelize([]))
 {code}
 exception:
 {noformat}
 Traceback (most recent call last):
   
   File /tmp/bug.py, line 5, in module
 qsc.jsonRDD(sc.parallelize([]))
   File /usr/share/spark/python/pyspark/sql.py, line 1605, in jsonRDD
 srdd = self._ssql_ctx.jsonRDD(jrdd.rdd(), samplingRatio)
   File 
 /usr/share/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 
 538, in __call__
   File /usr/share/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o27.jsonRDD.
 : java.lang.UnsupportedOperationException: empty collection
   at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:886)
   at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:886)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.reduce(RDD.scala:886)
   at org.apache.spark.sql.json.JsonRDD$.inferSchema(JsonRDD.scala:57)
   at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:232)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:745)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6245) jsonRDD() of empty RDD results in exception

2015-03-10 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355843#comment-14355843
 ] 

Matthew Farrellee commented on SPARK-6245:
--

this is an issue for the scala interface as well.

{code}
scala val qsc = new org.apache.spark.sql.SQLContext(sc)
qsc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@36c77da5
scala qsc.jsonRDD(sc.parallelize(List()))
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:886)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:886)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:886)
at org.apache.spark.sql.json.JsonRDD$.inferSchema(JsonRDD.scala:57)
at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:232)
at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:204)
at $iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC.init(console:22)
at $iwC.init(console:24)
at init(console:26)
at .init(console:30)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

{{org.apache.spark.sql.json.JsonRDD$.inferSchema(JsonRDD.scala:57)}} is surely 
the guilty party

 jsonRDD() of empty RDD results in exception
 ---

 Key: SPARK-6245
 URL: https://issues.apache.org/jira/browse/SPARK-6245
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL, Streaming
Affects Versions: 1.2.1
Reporter: Matthew Farrellee

 converting an empty RDD to a JSON RDD results in an exception. this case is 
 common when using spark streaming.
 {code}
 from pyspark import SparkContext
 from pyspark.sql import SQLContext
 sc = SparkContext()
 qsc = SQLContext(sc)
 qsc.jsonRDD(sc.parallelize([]))
 {code}
 exception:
 {noformat}
 Traceback (most recent call last):
   
   File /tmp/bug.py, line 5, in module
 qsc.jsonRDD(sc.parallelize([]))
   File /usr/share/spark/python/pyspark/sql.py, line 1605, in jsonRDD
 srdd = self._ssql_ctx.jsonRDD(jrdd.rdd(), samplingRatio)
   File 
 /usr/share/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 
 538, in __call__
   File /usr/share/spark/python/lib/py4j

[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)

2015-03-09 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353534#comment-14353534
 ] 

Matthew Farrellee commented on SPARK-5368:
--

[~srowen] will you take a look at this? i'm trying to run spark via kubernetes 
(master pod + master service + slave replicationcontroller), and the service 
layer is creating a NAT-like environment.

 Spark should  support NAT (via akka improvements)
 -

 Key: SPARK-5368
 URL: https://issues.apache.org/jira/browse/SPARK-5368
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: jay vyas
 Fix For: 1.2.2


 Spark sets up actors for akka with a set of variables which are defined in 
 the {{AkkaUtils.scala}} class.  
 A snippet:
 {noformat}
  98   |akka.loggers = [akka.event.slf4j.Slf4jLogger]
  99   |akka.stdout-loglevel = ERROR
 100   |akka.jvm-exit-on-fatal-error = off
 101   |akka.remote.require-cookie = $requireCookie
 102   |akka.remote.secure-cookie = $secureCookie
 {noformat}
 We should allow users to pass in custom settings, for example, so that 
 arbitrary akka modifications can be used at runtime for security, 
 performance, logging, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN

2015-02-12 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318221#comment-14318221
 ] 

Matthew Farrellee commented on SPARK-2313:
--

that'd work, also requires a py4j change

 PySpark should accept port via a command line argument rather than STDIN
 

 Key: SPARK-2313
 URL: https://issues.apache.org/jira/browse/SPARK-2313
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Patrick Wendell

 Relying on stdin is a brittle mechanism and has broken several times in the 
 past. From what I can tell this is used only to bootstrap worker.py one time. 
 It would be strictly simpler to just pass it is a command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers

2015-01-05 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265290#comment-14265290
 ] 

Matthew Farrellee commented on SPARK-927:
-

PR #2313 was subsumed by PR #3351, which resolved SPARK-4477 and this issue

the resolution was to remove the use of numpy altogether

 PySpark sample() doesn't work if numpy is installed on master but not on 
 workers
 

 Key: SPARK-927
 URL: https://issues.apache.org/jira/browse/SPARK-927
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.8.0, 0.9.1, 1.0.2, 1.1.2
Reporter: Josh Rosen
Assignee: Matthew Farrellee
Priority: Minor

 PySpark's sample() method crashes with ImportErrors on the workers if numpy 
 is installed on the driver machine but not on the workers.  I'm not sure 
 what's the best way to fix this.  A general mechanism for automatically 
 shipping libraries from the master to the workers would address this, but 
 that could be complicated to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers

2015-01-05 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee resolved SPARK-927.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

 PySpark sample() doesn't work if numpy is installed on master but not on 
 workers
 

 Key: SPARK-927
 URL: https://issues.apache.org/jira/browse/SPARK-927
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.8.0, 0.9.1, 1.0.2, 1.1.2
Reporter: Josh Rosen
Assignee: Matthew Farrellee
Priority: Minor
 Fix For: 1.2.0


 PySpark's sample() method crashes with ImportErrors on the workers if numpy 
 is installed on the driver machine but not on the workers.  I'm not sure 
 what's the best way to fix this.  A general mechanism for automatically 
 shipping libraries from the master to the workers would address this, but 
 that could be complicated to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [openstack-dev] [sahara] team meeting Nov 27 1800 UTC

2014-11-26 Thread Matthew Farrellee

On 11/26/2014 01:10 PM, Sergey Lukjanov wrote:

Hi folks,

We'll be having the Sahara team meeting as usual in
#openstack-meeting-alt channel.

Agenda: https://wiki.openstack.org/wiki/Meetings/SaharaAgenda#Next_meetings

http://www.timeanddate.com/worldclock/fixedtime.html?msg=Sahara+Meetingiso=20141127T18

--
Sincerely yours,
Sergey Lukjanov
Sahara Technical Lead
(OpenStack Data Processing)
Principal Software Engineer
Mirantis Inc.


fyi, it's the Thanksgiving holiday for folks in the US, so we'll be absent.

best,


matt

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [Openstack] Sahara: No images available after registering UbuntuVanilla Image when launching cluster.‏

2014-11-24 Thread Matthew Farrellee

On 11/24/2014 02:28 PM, Edward HUANG wrote:

Hi all,
   I'm setting up a local cloud environment on servers in my lab.
   I installed OpenStack with devstack, and i install it with sahara.
   Data processing appears in the dashboard, and i did add a
ubuntu-vanilla qcow2 images according to
http://docs.openstack.org/developer/sahara/userdoc/vanilla_plugin.html.
I download and register the Ubuntu-Vanilla-2.4.1.qcow2.
   And I set two node template, one master node with namenode, oozie,
resourcemaneger, nodemanager, historyserver.
   One worker node template with datanode.
   And I setup a cluster template with one master node and one worker node.
   But when I try to launch to cluster, I cannot select images. In the
slot where base image should be selected, it shows 'no images available'.
   Does anyone have experience regarding to this? Am i missing something
in my configuration?

Thanks!
Edward ZILONG HUANG
MS@ECE department, Carnegie Mellon University
http://www.andrew.cmu.edu/user/zilongh/


could you have missed the tagging step that's part of the image 
registration?


sahara uses the tags to filter incompatible images. if you select the 
vanilla 2.x plugin you'll only see images that are tagged w/ vanilla and 
2.x.


the tagging step is error prone because you not only have to select 
which tags you want to have on the image, but you also have to apply the 
tags before clicking through to register the image.


best,


matt


___
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


Re: [openstack-dev] [sahara] Nominate Sergey Reshetniak to sahara-core

2014-11-11 Thread Matthew Farrellee

On 11/11/2014 12:35 PM, Sergey Lukjanov wrote:

Hi folks,

I'd like to propose Sergey to sahara-core. He's made a lot of work on
different parts of Sahara and he has a very good knowledge of codebase,
especially in plugins area.  Sergey has been consistently giving us very
well thought out and constructive reviews for Sahara project.

Sahara core team members, please, vote +/- 2.

Thanks.


--
Sincerely yours,
Sergey Lukjanov
Sahara Technical Lead
(OpenStack Data Processing)
Principal Software Engineer
Mirantis Inc.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



+2

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [sahara] Nominate Michael McCune to sahara-core

2014-11-11 Thread Matthew Farrellee

On 11/11/2014 12:37 PM, Sergey Lukjanov wrote:

Hi folks,

I'd like to propose Michael McCune to sahara-core. He has a good
knowledge of codebase and implemented important features such as Swift
auth using trusts. Mike has been consistently giving us very well
thought out and constructive reviews for Sahara project.

Sahara core team members, please, vote +/- 2.

Thanks.


--
Sincerely yours,
Sergey Lukjanov
Sahara Technical Lead
(OpenStack Data Processing)
Principal Software Engineer
Mirantis Inc.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



+2

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[jira] [Closed] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...

2014-10-03 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-2256.

   Resolution: Fixed
Fix Version/s: 1.1.0

 pyspark: RDD.take doesn't work ... sometimes ...
 --

 Key: SPARK-2256
 URL: https://issues.apache.org/jira/browse/SPARK-2256
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: local file/remote HDFS
Reporter: Ángel Álvarez
  Labels: RDD, pyspark, take, windows
 Fix For: 1.1.0

 Attachments: A_test.zip


 If I try to take some lines from a file, sometimes it doesn't work
 Code: 
 myfile = sc.textFile(A_ko)
 print myfile.take(10)
 Stacktrace:
 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
 Traceback (most recent call last):
   File mytest.py, line 19, in module
 print myfile.take(10)
   File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take
 iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, 
 line 537, in __call__
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, 
 line 300, in get_return_value
 Test data:
 START TEST DATA
 A
 A
 A
 
 
 
 
 
 
 
 
 

[jira] [Commented] (SPARK-3733) Support for programmatically submitting Spark jobs

2014-09-30 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153142#comment-14153142
 ] 

Matthew Farrellee commented on SPARK-3733:
--

will you describe what you mean by submitting Spark jobs and what 
expectations you have for supporting this feature?

 Support for programmatically submitting Spark jobs
 --

 Key: SPARK-3733
 URL: https://issues.apache.org/jira/browse/SPARK-3733
 Project: Spark
  Issue Type: New Feature
Affects Versions: 1.1.0
Reporter: Sotos Matzanas

 Right now it's impossible to programmatically submit Spark jobs via a Scala 
 (or Java) API. We would like to see that in a future version of Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

2014-09-29 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151691#comment-14151691
 ] 

Matthew Farrellee commented on SPARK-3685:
--

[~andrewor] thanks for the info. afaik the executor is also in charge of the 
shuffle file life-cycle, and breaking that would be complicated. it's probably 
a cleaner implementation to allow executors to remain and use a policy to prune 
unused/little-used executors where unused/little-used factors in amount of data 
they are holding as well as cpu used. you could also go down the path of 
aging-out executors - let their resources go back to the node's pool for 
reallocation, but don't kill off the process. however, approaches like that 
become very complex and push implementation details of the workload, which 
often don't generalize, into the scheduling system.

[~andrewor] btw, it should be a warning case (hey you might have messed up, i 
see you used hdfs:/ in your file name) instead of an error case.

 Spark's local dir should accept only local paths
 

 Key: SPARK-3685
 URL: https://issues.apache.org/jira/browse/SPARK-3685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

 When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it 
 will try to do is create a folder called hdfs: and put tmp inside it. 
 This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
 of Hadoop's file system to parse this path. We also need to resolve the path 
 appropriately.
 This may not have an urgent use case, but it fails silently and does what is 
 least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

2014-09-29 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152152#comment-14152152
 ] 

Matthew Farrellee commented on SPARK-3685:
--

the root of the resource problem is how they're handed out. yarn is giving you 
a whole cpu, some amount of memory, some amount of network and some amount of 
disk to work with. your executor (like any program) uses different amounts of 
resources throughout its execution. at points in the execution the resource 
profile changes, call the demarcated regions phases. so an executor may 
transition from a high resource phase to a low resource phase. in a low 
resource phase, you may want to free up resources for other executors, but 
maintain enough to do basic operations (say: serve a shuffle file). this is a 
problem that should be solved by the resource manager. in my opinion, a 
solution w/i spark that isn't faciliated by the RN is a workaround/hack and 
should be avoided. an example of a RN facilitated solution might be a message 
the executor can send to yarn to indicate its resources can be free'd, except 
for some minimum amount.

 Spark's local dir should accept only local paths
 

 Key: SPARK-3685
 URL: https://issues.apache.org/jira/browse/SPARK-3685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

 When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it 
 will try to do is create a folder called hdfs: and put tmp inside it. 
 This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
 of Hadoop's file system to parse this path. We also need to resolve the path 
 appropriately.
 This may not have an urgent use case, but it fails silently and does what is 
 least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

2014-09-29 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152526#comment-14152526
 ] 

Matthew Farrellee commented on SPARK-3685:
--

if you're going to go down this path the best (i'd say correct) way to 
implement it is to have support from yarn - a way to tell yarn i'm only going 
to need X,Y,Z resources from now on without giving up the execution container. 
i bet there's a way to re-exec the jvm into a smaller form factor.

 Spark's local dir should accept only local paths
 

 Key: SPARK-3685
 URL: https://issues.apache.org/jira/browse/SPARK-3685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

 When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it 
 will try to do is create a folder called hdfs: and put tmp inside it. 
 This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
 of Hadoop's file system to parse this path. We also need to resolve the path 
 appropriately.
 This may not have an urgent use case, but it fails silently and does what is 
 least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3685) Spark's local dir scheme is not configurable

2014-09-28 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151148#comment-14151148
 ] 

Matthew Farrellee commented on SPARK-3685:
--

i'm skeptical. what would be the benefit of using HDFS for temporary storage?

 Spark's local dir scheme is not configurable
 

 Key: SPARK-3685
 URL: https://issues.apache.org/jira/browse/SPARK-3685
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

 When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it 
 will try to do is create a folder called hdfs: and put tmp inside it. 
 This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
 of Hadoop's file system to parse this path. We also need to resolve the path 
 appropriately.
 This may not have an urgent use case, but it fails silently and does what is 
 least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [openstack-dev] [Sahara] Verbosity of Sahara overview image

2014-09-27 Thread Matthew Farrellee

On 09/26/2014 02:27 PM, Sharan Kumar M wrote:

Hi all,

I am trying to modify the diagram in
http://docs.openstack.org/developer/sahara/overview.html so that it
syncs with the contents. In the diagram, is it nice to mark the
connections between the openstack components like, Nova with Cinder,
Nova with Swift, components with Keystone, Nova with Neutron, etc? Or
would it be too verbose for this diagram and should I be focusing on
links between Sahara and other components?

Thanks,
Sharan Kumar M


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



http://docs.openstack.org/developer/sahara/architecture.html has a 
better diagram imho


i think the diagram should focus on links between sahara and other 
components only.


best,


matt

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[jira] [Commented] (SPARK-3639) Kinesis examples set master as local

2014-09-24 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146245#comment-14146245
 ] 

Matthew Farrellee commented on SPARK-3639:
--

seems reasonable to me

 Kinesis examples set master as local
 

 Key: SPARK-3639
 URL: https://issues.apache.org/jira/browse/SPARK-3639
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Aniket Bhatnagar
Priority: Minor
  Labels: examples

 Kinesis examples set master as local thus not allowing the example to be 
 tested on a cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1443) Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee resolved SPARK-1443.
--
   Resolution: Done
Fix Version/s: (was: 0.9.0)

 Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API
 --

 Key: SPARK-1443
 URL: https://issues.apache.org/jira/browse/SPARK-1443
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Java API, Spark Core
Affects Versions: 0.9.0
 Environment: Java 1.7,Hadoop 2.2.0,Spark 0.9.0,Ubuntu 12.4,
Reporter: Pavan Kumar Varma
Priority: Critical
  Labels: GridFS, MongoDB, Spark, hadoop2, java
   Original Estimate: 12h
  Remaining Estimate: 12h

 I saved a 2GB pdf file into MongoDB using GridFS. now i want process those 
 GridFS collection data using Java Spark Mapreduce API. previously i have 
 successfully processed mongoDB collections with Apache spark using 
 Mongo-Hadoop connector. now i'm unable to GridFS collections with the 
 following code.
 MongoConfigUtil.setInputURI(config, 
 mongodb://localhost:27017/pdfbooks.fs.chunks );
  MongoConfigUtil.setOutputURI(config,mongodb://localhost:27017/+output );
  JavaPairRDDObject, BSONObject mongoRDD = sc.newAPIHadoopRDD(config,
 com.mongodb.hadoop.MongoInputFormat.class, Object.class,
 BSONObject.class);
  JavaRDDString words = mongoRDD.flatMap(new 
 FlatMapFunctionTuple2Object,BSONObject,
String() {
@Override
public IterableString call(Tuple2Object, BSONObject arg) {   
System.out.println(arg._2.toString());
...
 Please suggest/provide  better API methods to access MongoDB GridFS data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1443) Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142446#comment-14142446
 ] 

Matthew Farrellee commented on SPARK-1443:
--

[~PavanKumarVarma] i hope you've been able to resolve your issue over the past 
5 months. since you'll get a better response asking on the spark user list than 
in jira, see http://spark.apache.org/community.html, i'm going to close this 
out.

 Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API
 --

 Key: SPARK-1443
 URL: https://issues.apache.org/jira/browse/SPARK-1443
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Java API, Spark Core
Affects Versions: 0.9.0
 Environment: Java 1.7,Hadoop 2.2.0,Spark 0.9.0,Ubuntu 12.4,
Reporter: Pavan Kumar Varma
Priority: Critical
  Labels: GridFS, MongoDB, Spark, hadoop2, java
   Original Estimate: 12h
  Remaining Estimate: 12h

 I saved a 2GB pdf file into MongoDB using GridFS. now i want process those 
 GridFS collection data using Java Spark Mapreduce API. previously i have 
 successfully processed mongoDB collections with Apache spark using 
 Mongo-Hadoop connector. now i'm unable to GridFS collections with the 
 following code.
 MongoConfigUtil.setInputURI(config, 
 mongodb://localhost:27017/pdfbooks.fs.chunks );
  MongoConfigUtil.setOutputURI(config,mongodb://localhost:27017/+output );
  JavaPairRDDObject, BSONObject mongoRDD = sc.newAPIHadoopRDD(config,
 com.mongodb.hadoop.MongoInputFormat.class, Object.class,
 BSONObject.class);
  JavaRDDString words = mongoRDD.flatMap(new 
 FlatMapFunctionTuple2Object,BSONObject,
String() {
@Override
public IterableString call(Tuple2Object, BSONObject arg) {   
System.out.println(arg._2.toString());
...
 Please suggest/provide  better API methods to access MongoDB GridFS data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1177) Allow SPARK_JAR to be set in system properties

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142447#comment-14142447
 ] 

Matthew Farrellee commented on SPARK-1177:
--

[~epakhomov] it looks like this has been resolved in other change, for instance 
being able to use spark.yarn.jar. i'm going to close this, but feel free to 
re-open if you think it is still important.

 Allow SPARK_JAR to be set in system properties
 --

 Key: SPARK-1177
 URL: https://issues.apache.org/jira/browse/SPARK-1177
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 0.9.0


 I'd like to be able to do from my scala code:
   System.setProperty(SPARK_YARN_APP_JAR, 
 SparkContext.jarOfClass(this.getClass).head)
   System.setProperty(SPARK_JAR, 
 SparkContext.jarOfClass(SparkContext.getClass).head)
 And do nothing on OS level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1177) Allow SPARK_JAR to be set in system properties

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-1177.

   Resolution: Fixed
Fix Version/s: (was: 0.9.0)

 Allow SPARK_JAR to be set in system properties
 --

 Key: SPARK-1177
 URL: https://issues.apache.org/jira/browse/SPARK-1177
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Egor Pakhomov
Priority: Minor

 I'd like to be able to do from my scala code:
   System.setProperty(SPARK_YARN_APP_JAR, 
 SparkContext.jarOfClass(this.getClass).head)
   System.setProperty(SPARK_JAR, 
 SparkContext.jarOfClass(SparkContext.getClass).head)
 And do nothing on OS level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1176) Adding port configuration for HttpBroadcast

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142454#comment-14142454
 ] 

Matthew Farrellee commented on SPARK-1176:
--

[~epakhomov] it looks like this was resolved by SPARK-2157. i'm going to close 
this, but please feel free to re-open if it is still an issue for you.

 Adding port configuration for HttpBroadcast
 ---

 Key: SPARK-1176
 URL: https://issues.apache.org/jira/browse/SPARK-1176
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 0.9.0


 I run spark in big organization, where to open port accessible to other 
 computers in network, I need to create a ticket on DevOps and it executes for 
 days. I can't have port for some spark service to be changed all the time. I 
 need ability to configure this port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1176) Adding port configuration for HttpBroadcast

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee resolved SPARK-1176.
--
   Resolution: Fixed
Fix Version/s: (was: 0.9.0)
   1.1.0

 Adding port configuration for HttpBroadcast
 ---

 Key: SPARK-1176
 URL: https://issues.apache.org/jira/browse/SPARK-1176
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 1.1.0


 I run spark in big organization, where to open port accessible to other 
 computers in network, I need to create a ticket on DevOps and it executes for 
 days. I can't have port for some spark service to be changed all the time. I 
 need ability to configure this port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1748) I installed the spark_standalone,but I did not know how to use sbt to compile the programme of spark?

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-1748.

   Resolution: Done
Fix Version/s: (was: 0.8.1)

 I installed the spark_standalone,but I did not know how to use sbt to compile 
 the programme of spark?
 -

 Key: SPARK-1748
 URL: https://issues.apache.org/jira/browse/SPARK-1748
 Project: Spark
  Issue Type: Test
  Components: Build
Affects Versions: 0.8.1
 Environment: spark standalone 
Reporter: lxflyl

 I installed the mode of spark standalone ,but I did not understand how to use 
 sbt to compile the program of spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1748) I installed the spark_standalone,but I did not know how to use sbt to compile the programme of spark?

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142455#comment-14142455
 ] 

Matthew Farrellee commented on SPARK-1748:
--

thanks for the question. you'll get a better response asking on the mailing 
lists, see http://spark.apache.org/community.html, so i'm going to close this 
out.

 I installed the spark_standalone,but I did not know how to use sbt to compile 
 the programme of spark?
 -

 Key: SPARK-1748
 URL: https://issues.apache.org/jira/browse/SPARK-1748
 Project: Spark
  Issue Type: Test
  Components: Build
Affects Versions: 0.8.1
 Environment: spark standalone 
Reporter: lxflyl

 I installed the mode of spark standalone ,but I did not understand how to use 
 sbt to compile the program of spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-614) Make last 4 digits of framework id in standalone mode logging monotonically increasing

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-614.
---
   Resolution: Unresolved
Fix Version/s: (was: 0.7.1)

 Make last 4 digits of framework id in standalone mode logging monotonically 
 increasing
 --

 Key: SPARK-614
 URL: https://issues.apache.org/jira/browse/SPARK-614
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Denny Britz

 In mesos mode, the work log directories are monotonically increasing, and 
 makes it very easy to spot a folder and go into it (e.g. only need to type 
 *[last4digit]).
 We lost this in the standalone mode, as seen in this example. The last four 
 digits would go up and down 
 drwxr-xr-x 3 root root 4096 Nov  8 08:03 job-20121108080355-
 drwxr-xr-x 3 root root 4096 Nov  8 08:04 job-20121108080450-0001
 drwxr-xr-x 3 root root 4096 Nov  8 08:07 job-20121108080757-0002
 drwxr-xr-x 3 root root 4096 Nov  8 08:10 job-20121108081014-0003
 drwxr-xr-x 3 root root 4096 Nov  8 08:23 job-20121108082316-0004
 drwxr-xr-x 3 root root 4096 Nov  8 08:26 job-20121108082616-0005
 drwxr-xr-x 3 root root 4096 Nov  8 08:30 job-20121108083034-0006
 drwxr-xr-x 3 root root 4096 Nov  8 08:35 job-20121108083514-0007
 drwxr-xr-x 3 root root 4096 Nov  8 08:38 job-20121108083807-0008
 drwxr-xr-x 3 root root 4096 Nov  8 08:41 job-20121108084105-0009
 drwxr-xr-x 3 root root 4096 Nov  8 08:42 job-20121108084242-0010
 drwxr-xr-x 3 root root 4096 Nov  8 08:45 job-20121108084512-0011
 drwxr-xr-x 3 root root 4096 Nov  8 09:01 job-20121108090113-
 drwxr-xr-x 3 root root 4096 Nov  8 09:15 job-20121108091536-0001
 drwxr-xr-x 3 root root 4096 Nov  8 09:24 job-20121108092341-0003
 drwxr-xr-x 3 root root 4096 Nov  8 09:27 job-20121108092703-
 drwxr-xr-x 3 root root 4096 Nov  8 09:46 job-20121108094629-0001
 drwxr-xr-x 3 root root 4096 Nov  8 09:48 job-20121108094809-0002
 drwxr-xr-x 3 root root 4096 Nov  8 10:04 job-20121108100418-0003
 drwxr-xr-x 3 root root 4096 Nov  8 10:18 job-20121108101814-0004
 drwxr-xr-x 3 root root 4096 Nov  8 10:22 job-20121108102207-0005
 drwxr-xr-x 3 root root 4096 Nov  8 18:48 job-20121108184842-0006
 drwxr-xr-x 3 root root 4096 Nov  8 18:49 job-20121108184932-0007
 drwxr-xr-x 3 root root 4096 Nov  8 18:50 job-20121108185007-0008
 drwxr-xr-x 3 root root 4096 Nov  8 18:50 job-20121108185040-0009
 drwxr-xr-x 3 root root 4096 Nov  8 18:51 job-20121108185127-0010
 drwxr-xr-x 3 root root 4096 Nov  8 18:54 job-20121108185428-0011
 drwxr-xr-x 3 root root 4096 Nov  8 18:58 job-20121108185837-0012
 drwxr-xr-x 3 root root 4096 Nov  8 18:58 job-20121108185854-0013
 drwxr-xr-x 3 root root 4096 Nov  8 19:00 job-20121108190005-0014
 drwxr-xr-x 3 root root 4096 Nov  8 19:00 job-20121108190059-0015
 drwxr-xr-x 3 root root 4096 Nov  8 19:10 job-20121108191010-0016
 drwxr-xr-x 3 root root 4096 Nov  8 19:15 job-20121108191508-0017
 drwxr-xr-x 3 root root 4096 Nov  8 19:21 job-20121108192125-0018
 drwxr-xr-x 3 root root 4096 Nov  8 19:23 job-20121108192329-0019
 drwxr-xr-x 3 root root 4096 Nov  8 19:26 job-20121108192638-0020
 drwxr-xr-x 3 root root 4096 Nov  8 19:35 job-20121108193554-0022



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-614) Make last 4 digits of framework id in standalone mode logging monotonically increasing

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142456#comment-14142456
 ] 

Matthew Farrellee commented on SPARK-614:
-

it looks like nothing has happened with this in the past 23 months. i'm going 
to close this, but feel free to re-open.

 Make last 4 digits of framework id in standalone mode logging monotonically 
 increasing
 --

 Key: SPARK-614
 URL: https://issues.apache.org/jira/browse/SPARK-614
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Denny Britz

 In mesos mode, the work log directories are monotonically increasing, and 
 makes it very easy to spot a folder and go into it (e.g. only need to type 
 *[last4digit]).
 We lost this in the standalone mode, as seen in this example. The last four 
 digits would go up and down 
 drwxr-xr-x 3 root root 4096 Nov  8 08:03 job-20121108080355-
 drwxr-xr-x 3 root root 4096 Nov  8 08:04 job-20121108080450-0001
 drwxr-xr-x 3 root root 4096 Nov  8 08:07 job-20121108080757-0002
 drwxr-xr-x 3 root root 4096 Nov  8 08:10 job-20121108081014-0003
 drwxr-xr-x 3 root root 4096 Nov  8 08:23 job-20121108082316-0004
 drwxr-xr-x 3 root root 4096 Nov  8 08:26 job-20121108082616-0005
 drwxr-xr-x 3 root root 4096 Nov  8 08:30 job-20121108083034-0006
 drwxr-xr-x 3 root root 4096 Nov  8 08:35 job-20121108083514-0007
 drwxr-xr-x 3 root root 4096 Nov  8 08:38 job-20121108083807-0008
 drwxr-xr-x 3 root root 4096 Nov  8 08:41 job-20121108084105-0009
 drwxr-xr-x 3 root root 4096 Nov  8 08:42 job-20121108084242-0010
 drwxr-xr-x 3 root root 4096 Nov  8 08:45 job-20121108084512-0011
 drwxr-xr-x 3 root root 4096 Nov  8 09:01 job-20121108090113-
 drwxr-xr-x 3 root root 4096 Nov  8 09:15 job-20121108091536-0001
 drwxr-xr-x 3 root root 4096 Nov  8 09:24 job-20121108092341-0003
 drwxr-xr-x 3 root root 4096 Nov  8 09:27 job-20121108092703-
 drwxr-xr-x 3 root root 4096 Nov  8 09:46 job-20121108094629-0001
 drwxr-xr-x 3 root root 4096 Nov  8 09:48 job-20121108094809-0002
 drwxr-xr-x 3 root root 4096 Nov  8 10:04 job-20121108100418-0003
 drwxr-xr-x 3 root root 4096 Nov  8 10:18 job-20121108101814-0004
 drwxr-xr-x 3 root root 4096 Nov  8 10:22 job-20121108102207-0005
 drwxr-xr-x 3 root root 4096 Nov  8 18:48 job-20121108184842-0006
 drwxr-xr-x 3 root root 4096 Nov  8 18:49 job-20121108184932-0007
 drwxr-xr-x 3 root root 4096 Nov  8 18:50 job-20121108185007-0008
 drwxr-xr-x 3 root root 4096 Nov  8 18:50 job-20121108185040-0009
 drwxr-xr-x 3 root root 4096 Nov  8 18:51 job-20121108185127-0010
 drwxr-xr-x 3 root root 4096 Nov  8 18:54 job-20121108185428-0011
 drwxr-xr-x 3 root root 4096 Nov  8 18:58 job-20121108185837-0012
 drwxr-xr-x 3 root root 4096 Nov  8 18:58 job-20121108185854-0013
 drwxr-xr-x 3 root root 4096 Nov  8 19:00 job-20121108190005-0014
 drwxr-xr-x 3 root root 4096 Nov  8 19:00 job-20121108190059-0015
 drwxr-xr-x 3 root root 4096 Nov  8 19:10 job-20121108191010-0016
 drwxr-xr-x 3 root root 4096 Nov  8 19:15 job-20121108191508-0017
 drwxr-xr-x 3 root root 4096 Nov  8 19:21 job-20121108192125-0018
 drwxr-xr-x 3 root root 4096 Nov  8 19:23 job-20121108192329-0019
 drwxr-xr-x 3 root root 4096 Nov  8 19:26 job-20121108192638-0020
 drwxr-xr-x 3 root root 4096 Nov  8 19:35 job-20121108193554-0022



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-719) Add FAQ page to documentation or webpage

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-719.
---
   Resolution: Done
Fix Version/s: (was: 0.7.1)

 Add FAQ page to documentation or webpage
 

 Key: SPARK-719
 URL: https://issues.apache.org/jira/browse/SPARK-719
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Andy Konwinski
Assignee: Andy Konwinski

 Lots of issues on the mailing list are redundant (e.g., Patrick mentioned 
 this question has been asked/answered multiple times 
 https://groups.google.com/d/msg/spark-users/-mYn6BF-Y5Y/8qeXuxs8_d0J).
 We should put the solutions to common problems on an FAQ page in the 
 documentation or on the webpage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-719) Add FAQ page to documentation or webpage

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142459#comment-14142459
 ] 

Matthew Farrellee commented on SPARK-719:
-

it looks like this has some good content, but it's stale and likely needs 
vetting.

the new FAQ location is http://spark.apache.org/faq.html

i'm going to close this since there has been no progress. note - it'll still be 
available via search

feel free to re-open if you disagree.

 Add FAQ page to documentation or webpage
 

 Key: SPARK-719
 URL: https://issues.apache.org/jira/browse/SPARK-719
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Andy Konwinski
Assignee: Andy Konwinski

 Lots of issues on the mailing list are redundant (e.g., Patrick mentioned 
 this question has been asked/answered multiple times 
 https://groups.google.com/d/msg/spark-users/-mYn6BF-Y5Y/8qeXuxs8_d0J).
 We should put the solutions to common problems on an FAQ page in the 
 documentation or on the webpage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-637) Create troubleshooting checklist

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-637.
---
Resolution: Later

 Create troubleshooting checklist
 

 Key: SPARK-637
 URL: https://issues.apache.org/jira/browse/SPARK-637
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Reporter: Josh Rosen

 We should provide a checklist for troubleshooting common Spark problems.
 For example, it could include steps like check that the Spark code was 
 copied to all nodes and check that the workers successfully connect to the 
 master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-637) Create troubleshooting checklist

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142463#comment-14142463
 ] 

Matthew Farrellee commented on SPARK-637:
-

this is a good idea, and it will take a significant amount of effort. it looks 
like nothing has happened for almost 2 years. i'm going to close this, but feel 
free to re-open and push forward with it.

 Create troubleshooting checklist
 

 Key: SPARK-637
 URL: https://issues.apache.org/jira/browse/SPARK-637
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Reporter: Josh Rosen

 We should provide a checklist for troubleshooting common Spark problems.
 For example, it could include steps like check that the Spark code was 
 copied to all nodes and check that the workers successfully connect to the 
 master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3593) Support Sorting of Binary Type Data

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142468#comment-14142468
 ] 

Matthew Farrellee commented on SPARK-3593:
--

[~pmagid] will you provide some example code that demonstrates your issue?

 Support Sorting of Binary Type Data
 ---

 Key: SPARK-3593
 URL: https://issues.apache.org/jira/browse/SPARK-3593
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: Paul Magid

 If you try sorting on a binary field you currently get an exception.   Please 
 add support for binary data type sorting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-537) driver.run() returned with code DRIVER_ABORTED

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142474#comment-14142474
 ] 

Matthew Farrellee commented on SPARK-537:
-

this should be resolved by a number of fixes in 1.0. please re-open if it still 
reproduces.

 driver.run() returned with code DRIVER_ABORTED
 --

 Key: SPARK-537
 URL: https://issues.apache.org/jira/browse/SPARK-537
 Project: Spark
  Issue Type: Bug
Reporter: yshaw

 Hi there,
 When I try to run Spark on Mesos as a cluster, some error happen like this:
 ```
  ./run spark.examples.SparkPi *.*.*.*:5050
 12/09/07 14:49:28 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes 
 = 994836480
 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Registered actor on port 7077
 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Started slave cache (size 
 948.8MB) on shawpc
 12/09/07 14:49:28 INFO spark.MapOutputTrackerActor: Registered actor on port 
 7077
 12/09/07 14:49:28 INFO spark.ShuffleManager: Shuffle dir: 
 /tmp/spark-local-81220c47-bc43-4809-ac48-5e3e8e023c8a/shuffle
 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:57595 STARTING
 12/09/07 14:49:28 INFO spark.ShuffleManager: Local URI: http://127.0.1.1:57595
 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:60113 STARTING
 12/09/07 14:49:28 INFO broadcast.HttpBroadcast: Broadcast server started at 
 http://127.0.1.1:60113
 12/09/07 14:49:28 INFO spark.MesosScheduler: Temp directory for JARs: 
 /tmp/spark-d541f37c-ae35-476c-b2fc-9908b0739f50
 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:50511 STARTING
 12/09/07 14:49:28 INFO spark.MesosScheduler: JAR server started at 
 http://127.0.1.1:50511
 12/09/07 14:49:28 INFO spark.MesosScheduler: Registered as framework ID 
 201209071448-846324308-5050-26925-
 12/09/07 14:49:29 INFO spark.SparkContext: Starting job...
 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 1 with cache
 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 
 partitions
 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 0 with cache
 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 
 partitions
 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Asked for current cache 
 locations
 12/09/07 14:49:29 INFO spark.MesosScheduler: Final stage: Stage 0
 12/09/07 14:49:29 INFO spark.MesosScheduler: Parents of final stage: List()
 12/09/07 14:49:29 INFO spark.MesosScheduler: Missing parents: List()
 12/09/07 14:49:29 INFO spark.MesosScheduler: Submitting Stage 0, which has no 
 missing parents
 12/09/07 14:49:29 INFO spark.MesosScheduler: Got a job with 2 tasks
 12/09/07 14:49:29 INFO spark.MesosScheduler: Adding job with ID 0
 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 52 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
 took 1 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 0 (task 0:0)
 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 0 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 1 (task 0:1)
 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 2 (task 0:0)
 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 3 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 2 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:1 as TID 4 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:32 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
 took 1 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:32 INFO spark.SimpleJob: Lost TID 3 (task 0:0)
 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:0 as TID 5 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:32 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 0 ms

[jira] [Resolved] (SPARK-537) driver.run() returned with code DRIVER_ABORTED

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee resolved SPARK-537.
-
   Resolution: Fixed
Fix Version/s: 1.0.0

 driver.run() returned with code DRIVER_ABORTED
 --

 Key: SPARK-537
 URL: https://issues.apache.org/jira/browse/SPARK-537
 Project: Spark
  Issue Type: Bug
Reporter: yshaw
 Fix For: 1.0.0


 Hi there,
 When I try to run Spark on Mesos as a cluster, some error happen like this:
 ```
  ./run spark.examples.SparkPi *.*.*.*:5050
 12/09/07 14:49:28 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes 
 = 994836480
 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Registered actor on port 7077
 12/09/07 14:49:28 INFO spark.CacheTrackerActor: Started slave cache (size 
 948.8MB) on shawpc
 12/09/07 14:49:28 INFO spark.MapOutputTrackerActor: Registered actor on port 
 7077
 12/09/07 14:49:28 INFO spark.ShuffleManager: Shuffle dir: 
 /tmp/spark-local-81220c47-bc43-4809-ac48-5e3e8e023c8a/shuffle
 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:57595 STARTING
 12/09/07 14:49:28 INFO spark.ShuffleManager: Local URI: http://127.0.1.1:57595
 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:60113 STARTING
 12/09/07 14:49:28 INFO broadcast.HttpBroadcast: Broadcast server started at 
 http://127.0.1.1:60113
 12/09/07 14:49:28 INFO spark.MesosScheduler: Temp directory for JARs: 
 /tmp/spark-d541f37c-ae35-476c-b2fc-9908b0739f50
 12/09/07 14:49:28 INFO server.Server: jetty-7.5.3.v20111011
 12/09/07 14:49:28 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:50511 STARTING
 12/09/07 14:49:28 INFO spark.MesosScheduler: JAR server started at 
 http://127.0.1.1:50511
 12/09/07 14:49:28 INFO spark.MesosScheduler: Registered as framework ID 
 201209071448-846324308-5050-26925-
 12/09/07 14:49:29 INFO spark.SparkContext: Starting job...
 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 1 with cache
 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 
 partitions
 12/09/07 14:49:29 INFO spark.CacheTracker: Registering RDD ID 0 with cache
 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 
 partitions
 12/09/07 14:49:29 INFO spark.CacheTrackerActor: Asked for current cache 
 locations
 12/09/07 14:49:29 INFO spark.MesosScheduler: Final stage: Stage 0
 12/09/07 14:49:29 INFO spark.MesosScheduler: Parents of final stage: List()
 12/09/07 14:49:29 INFO spark.MesosScheduler: Missing parents: List()
 12/09/07 14:49:29 INFO spark.MesosScheduler: Submitting Stage 0, which has no 
 missing parents
 12/09/07 14:49:29 INFO spark.MesosScheduler: Got a job with 2 tasks
 12/09/07 14:49:29 INFO spark.MesosScheduler: Adding job with ID 0
 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 52 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:29 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:29 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
 took 1 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 0 (task 0:0)
 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 0 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 1 (task 0:1)
 12/09/07 14:49:30 INFO spark.SimpleJob: Lost TID 2 (task 0:0)
 12/09/07 14:49:30 INFO spark.SimpleJob: Starting task 0:0 as TID 3 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:30 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 2 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:1 as TID 4 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:32 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
 took 1 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:32 INFO spark.SimpleJob: Lost TID 3 (task 0:0)
 12/09/07 14:49:32 INFO spark.SimpleJob: Starting task 0:0 as TID 5 on slave 
 201209071448-846324308-5050-26925-0: shawpc (preferred)
 12/09/07 14:49:32 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 0 ms to serialize by spark.JavaSerializerInstance
 12/09/07 14:49:32 INFO

[jira] [Commented] (SPARK-538) INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142475#comment-14142475
 ] 

Matthew Farrellee commented on SPARK-538:
-

this is a reasonable question for the user list, see 
http://spark.apache.org/community.html. i'm going to close this in favor of 
user list interaction. if you disagree, please re-open.

 INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone
 -

 Key: SPARK-538
 URL: https://issues.apache.org/jira/browse/SPARK-538
 Project: Spark
  Issue Type: Bug
Reporter: vince67

 Hi Matei,
Maybe I can't descibe it clearly.
We run masters or slaves on different machines,it is success.
But when we run spark.examples.SparkPi on the master , our 
 process hangs,we have not got the result.
Descirption like these:
  

 12/09/02 16:47:54 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes 
 = 339585269
 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Registered actor on port 7077
 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Started slave cache (size 
 323.9MB) on vince67-ThinkCentre-
 12/09/02 16:47:54 INFO spark.MapOutputTrackerActor: Registered actor on port 
 7077
 12/09/02 16:47:54 INFO spark.ShuffleManager: Shuffle dir: 
 /tmp/spark-local-3e79b235-1b94-44d1-823b-0369f6698688/shuffle
 12/09/02 16:47:54 INFO server.Server: jetty-7.5.3.v20111011
 12/09/02 16:47:54 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:49578 STARTING
 12/09/02 16:47:54 INFO spark.ShuffleManager: Local URI: 
 http://ip.ip.ip.ip:49578
 12/09/02 16:47:55 INFO server.Server: jetty-7.5.3.v20111011
 12/09/02 16:47:55 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:49600 STARTING
 12/09/02 16:47:55 INFO broadcast.HttpBroadcast: Broadcast server started at 
 http://ip.ip.ip.ip:49600
 12/09/02 16:47:55 INFO spark.MesosScheduler: Registered as framework ID 
 201209021640-74572372-5050-16898-0004
 12/09/02 16:47:55 INFO spark.SparkContext: Starting job...
 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 1 with cache
 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 
 partitions
 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 0 with cache
 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 
 partitions
 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Asked for current cache 
 locations
 12/09/02 16:47:55 INFO spark.MesosScheduler: Final stage: Stage 0
 12/09/02 16:47:55 INFO spark.MesosScheduler: Parents of final stage: List()
 12/09/02 16:47:55 INFO spark.MesosScheduler: Missing parents: List()
 12/09/02 16:47:55 INFO spark.MesosScheduler: Submitting Stage 0, which has no 
 missing parents
 12/09/02 16:47:55 INFO spark.MesosScheduler: Got a job with 2 tasks
 12/09/02 16:47:55 INFO spark.MesosScheduler: Adding job with ID 0
 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 151 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
 took 1 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 0 (task 0:0)
 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 1 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 1 (task 0:1)
 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:1 as TID 3 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
 took 5 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 2 (task 0:0)
 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:0 as TID 4 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:57 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 1 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 3 (task 0:1)
 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:1 as TID 5 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:57 INFO spark.SimpleJob: Size of task

[jira] [Closed] (SPARK-538) INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-538.
---
Resolution: Done

 INFO spark.MesosScheduler: Ignoring update from TID 9 because its job is gone
 -

 Key: SPARK-538
 URL: https://issues.apache.org/jira/browse/SPARK-538
 Project: Spark
  Issue Type: Bug
Reporter: vince67

 Hi Matei,
Maybe I can't descibe it clearly.
We run masters or slaves on different machines,it is success.
But when we run spark.examples.SparkPi on the master , our 
 process hangs,we have not got the result.
Descirption like these:
  

 12/09/02 16:47:54 INFO spark.BoundedMemoryCache: BoundedMemoryCache.maxBytes 
 = 339585269
 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Registered actor on port 7077
 12/09/02 16:47:54 INFO spark.CacheTrackerActor: Started slave cache (size 
 323.9MB) on vince67-ThinkCentre-
 12/09/02 16:47:54 INFO spark.MapOutputTrackerActor: Registered actor on port 
 7077
 12/09/02 16:47:54 INFO spark.ShuffleManager: Shuffle dir: 
 /tmp/spark-local-3e79b235-1b94-44d1-823b-0369f6698688/shuffle
 12/09/02 16:47:54 INFO server.Server: jetty-7.5.3.v20111011
 12/09/02 16:47:54 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:49578 STARTING
 12/09/02 16:47:54 INFO spark.ShuffleManager: Local URI: 
 http://ip.ip.ip.ip:49578
 12/09/02 16:47:55 INFO server.Server: jetty-7.5.3.v20111011
 12/09/02 16:47:55 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:49600 STARTING
 12/09/02 16:47:55 INFO broadcast.HttpBroadcast: Broadcast server started at 
 http://ip.ip.ip.ip:49600
 12/09/02 16:47:55 INFO spark.MesosScheduler: Registered as framework ID 
 201209021640-74572372-5050-16898-0004
 12/09/02 16:47:55 INFO spark.SparkContext: Starting job...
 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 1 with cache
 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 1 with 2 
 partitions
 12/09/02 16:47:55 INFO spark.CacheTracker: Registering RDD ID 0 with cache
 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Registering RDD 0 with 2 
 partitions
 12/09/02 16:47:55 INFO spark.CacheTrackerActor: Asked for current cache 
 locations
 12/09/02 16:47:55 INFO spark.MesosScheduler: Final stage: Stage 0
 12/09/02 16:47:55 INFO spark.MesosScheduler: Parents of final stage: List()
 12/09/02 16:47:55 INFO spark.MesosScheduler: Missing parents: List()
 12/09/02 16:47:55 INFO spark.MesosScheduler: Submitting Stage 0, which has no 
 missing parents
 12/09/02 16:47:55 INFO spark.MesosScheduler: Got a job with 2 tasks
 12/09/02 16:47:55 INFO spark.MesosScheduler: Adding job with ID 0
 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:0 as TID 0 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 151 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:55 INFO spark.SimpleJob: Starting task 0:1 as TID 1 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:55 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
 took 1 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 0 (task 0:0)
 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:0 as TID 2 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 1 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:56 INFO spark.SimpleJob: Lost TID 1 (task 0:1)
 12/09/02 16:47:56 INFO spark.SimpleJob: Starting task 0:1 as TID 3 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:56 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
 took 5 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 2 (task 0:0)
 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:0 as TID 4 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:57 INFO spark.SimpleJob: Size of task 0:0 is 1606 bytes and 
 took 1 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:57 INFO spark.SimpleJob: Lost TID 3 (task 0:1)
 12/09/02 16:47:57 INFO spark.SimpleJob: Starting task 0:1 as TID 5 on slave 
 201209021640-74572372-5050-16898-2: lmrspark-G41MT-S2 (preferred)
 12/09/02 16:47:57 INFO spark.SimpleJob: Size of task 0:1 is 1606 bytes and 
 took 2 ms to serialize by spark.JavaSerializerInstance
 12/09/02 16:47:58 INFO spark.SimpleJob: Lost TID 4 (task 0:0)
 12/09/02 16:47:58 INFO spark.SimpleJob: Starting task 0:0 as TID 6 on slave

[jira] [Updated] (SPARK-542) Cache Miss when machine have multiple hostname

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-542:

Component/s: Mesos
   Priority: Blocker

 Cache Miss when machine have multiple hostname
 --

 Key: SPARK-542
 URL: https://issues.apache.org/jira/browse/SPARK-542
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: frankvictor
Priority: Blocker

 HI, I encountered a weird runtime of pagerank in last few day.
 After debugging the job, I found it was caused by the DNS name.
 The machines of my cluster have multiple hostname, for example, slave 1 have 
 name (c001 and c001.cm.cluster)
 when spark adding cache in cacheTracker, it get c001 and add cache use it.
 But when schedule task in SimpleJob, the msos offer give spark 
 c001.cm.cluster.
 so It will never get preferred location!
 I thinks spark should handle the multiple hostname case(by using ip instead 
 of hostname, or some other methods).
 Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-550) Hiding the default spark context in the spark shell creates serialization issues

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-550.
---
Resolution: Done

 Hiding the default spark context in the spark shell creates serialization 
 issues
 

 Key: SPARK-550
 URL: https://issues.apache.org/jira/browse/SPARK-550
 Project: Spark
  Issue Type: Bug
Reporter: tjhunter

 I copy-pasted a piece of code along these lines in the spark shell:
 ...
 val sc = new SparkContext(local[%d] format num_splits,myframework)
 val my_rdd = sc.textFile(...)
 my_rdd.count()
 This leads to the shell crashing with a java.io.NotSerializableException: 
 spark.SparkContext
 It took me a while to realize it was due to the new spark context created. 
 Maybe a warning/error should be triggered if the user tries to change the 
 definition of sc?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-550) Hiding the default spark context in the spark shell creates serialization issues

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142477#comment-14142477
 ] 

Matthew Farrellee commented on SPARK-550:
-

a lot of code has changed in this space over the past 2 years. i'm going to 
close this, but feel free to re-open if you feel it's still an issue.

 Hiding the default spark context in the spark shell creates serialization 
 issues
 

 Key: SPARK-550
 URL: https://issues.apache.org/jira/browse/SPARK-550
 Project: Spark
  Issue Type: Bug
Reporter: tjhunter

 I copy-pasted a piece of code along these lines in the spark shell:
 ...
 val sc = new SparkContext(local[%d] format num_splits,myframework)
 val my_rdd = sc.textFile(...)
 my_rdd.count()
 This leads to the shell crashing with a java.io.NotSerializableException: 
 spark.SparkContext
 It took me a while to realize it was due to the new spark context created. 
 Maybe a warning/error should be triggered if the user tries to change the 
 definition of sc?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-559) Automatically register all classes used in fields of a class with Kryo

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-559.
---
Resolution: Done

 Automatically register all classes used in fields of a class with Kryo
 --

 Key: SPARK-559
 URL: https://issues.apache.org/jira/browse/SPARK-559
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-559) Automatically register all classes used in fields of a class with Kryo

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142479#comment-14142479
 ] 

Matthew Farrellee commented on SPARK-559:
-

the last comment on this, from 2 years ago, suggest this is resolved w/ an 
upgrade to kryo 2.x. i'm going to close this, but please re-open if you 
disagree.

 Automatically register all classes used in fields of a class with Kryo
 --

 Key: SPARK-559
 URL: https://issues.apache.org/jira/browse/SPARK-559
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-567) Unified directory structure for temporary data

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-567.
---
Resolution: Incomplete

please re-open with additional details for how this could be implemented

 Unified directory structure for temporary data
 --

 Key: SPARK-567
 URL: https://issues.apache.org/jira/browse/SPARK-567
 Project: Spark
  Issue Type: Improvement
Reporter: Mosharaf Chowdhury

 Broadcast, shuffle, and unforeseen use cases should use the same directory 
 structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-718) NPE when performing action during transformation

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-718.
---
Resolution: Done

 NPE when performing action during transformation
 

 Key: SPARK-718
 URL: https://issues.apache.org/jira/browse/SPARK-718
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Krzywicki

 Running the spark shell:
 The following code fails with a NPE when trying to collect the resulting RDD:
 {code:java}
 val data = sc.parallelize(1 to 10)
 data.map(i = data.count).collect
 {code}
 {code:java}
 ERROR local.LocalScheduler: Exception in task 0
 java.lang.NullPointerException
 at spark.RDD.count(RDD.scala:490)
 at 
 $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcJI$sp(console:15)
 at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:15)
 at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:15)
 at scala.collection.Iterator$$anon$19.next(Iterator.scala:401)
 at scala.collection.Iterator$class.foreach(Iterator.scala:772)
 at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250)
 at scala.collection.Iterator$$anon$19.toBuffer(Iterator.scala:399)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237)
 at scala.collection.Iterator$$anon$19.toArray(Iterator.scala:399)
 at spark.RDD$$anonfun$1.apply(RDD.scala:389)
 at spark.RDD$$anonfun$1.apply(RDD.scala:389)
 at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610)
 at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610)
 at spark.scheduler.ResultTask.run(ResultTask.scala:76)
 at 
 spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:74)
 at 
 spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:50)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-718) NPE when performing action during transformation

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142506#comment-14142506
 ] 

Matthew Farrellee commented on SPARK-718:
-

Spark simply does not support nesting RDDs in this fashion. you'll get a more 
prompt response and information with the user list, see 
http://spark.apache.org/community.html. i'm going to close this issue, but if 
you want feel free to re-open it.

 NPE when performing action during transformation
 

 Key: SPARK-718
 URL: https://issues.apache.org/jira/browse/SPARK-718
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Krzywicki

 Running the spark shell:
 The following code fails with a NPE when trying to collect the resulting RDD:
 {code:java}
 val data = sc.parallelize(1 to 10)
 data.map(i = data.count).collect
 {code}
 {code:java}
 ERROR local.LocalScheduler: Exception in task 0
 java.lang.NullPointerException
 at spark.RDD.count(RDD.scala:490)
 at 
 $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcJI$sp(console:15)
 at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:15)
 at $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:15)
 at scala.collection.Iterator$$anon$19.next(Iterator.scala:401)
 at scala.collection.Iterator$class.foreach(Iterator.scala:772)
 at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:102)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:250)
 at scala.collection.Iterator$$anon$19.toBuffer(Iterator.scala:399)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:237)
 at scala.collection.Iterator$$anon$19.toArray(Iterator.scala:399)
 at spark.RDD$$anonfun$1.apply(RDD.scala:389)
 at spark.RDD$$anonfun$1.apply(RDD.scala:389)
 at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610)
 at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:610)
 at spark.scheduler.ResultTask.run(ResultTask.scala:76)
 at 
 spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:74)
 at 
 spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:50)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-690) Stack overflow when running pagerank more than 10000 iterators

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142511#comment-14142511
 ] 

Matthew Farrellee commented on SPARK-690:
-

[~andrew xia] this is reported against a very old version. i'm going to close 
it out, but if you can reproduce please re-open

 Stack overflow when running pagerank more than 1 iterators
 --

 Key: SPARK-690
 URL: https://issues.apache.org/jira/browse/SPARK-690
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.6.1
Reporter: xiajunluan

 when I run PageRank example more than 1 iterators, Job client will report 
 stack overflow errors.
 13/02/07 13:41:40 INFO CacheTracker: Registering RDD ID 57993 with cache
 Exception in thread DAGScheduler java.lang.StackOverflowError
   at 
 java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:467)
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1281)
   at 
 java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
   at 
 org.jboss.netty.akka.util.HashedWheelTimer.scheduleTimeout(HashedWheelTimer.java:277)
   at 
 org.jboss.netty.akka.util.HashedWheelTimer.newTimeout(HashedWheelTimer.java:264)
   at akka.actor.DefaultScheduler.scheduleOnce(Scheduler.scala:186)
   at akka.pattern.PromiseActorRef$.apply(AskSupport.scala:274)
   at akka.pattern.AskSupport$class.ask(AskSupport.scala:83)
   at akka.pattern.package$.ask(package.scala:43)
   at akka.pattern.AskSupport$AskableActorRef.ask(AskSupport.scala:123)
   at spark.CacheTracker.askTracker(CacheTracker.scala:121)
   at spark.CacheTracker.communicate(CacheTracker.scala:131)
   at spark.CacheTracker.registerRDD(CacheTracker.scala:142)
   at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:149)
   at 
 spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:155)
   at 
 spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150)
   at 
 scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
   at scala.collection.immutable.List.foreach(List.scala:76)
   at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:150)
   at spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:160)
   at spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:131)
   at 
 spark.scheduler.DAGScheduler.getShuffleMapStage(DAGScheduler.scala:111)
   at 
 spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:153)
   at 
 spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150)
   at 
 scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-690) Stack overflow when running pagerank more than 10000 iterators

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-690.
---
Resolution: Unresolved

 Stack overflow when running pagerank more than 1 iterators
 --

 Key: SPARK-690
 URL: https://issues.apache.org/jira/browse/SPARK-690
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.6.1
Reporter: xiajunluan

 when I run PageRank example more than 1 iterators, Job client will report 
 stack overflow errors.
 13/02/07 13:41:40 INFO CacheTracker: Registering RDD ID 57993 with cache
 Exception in thread DAGScheduler java.lang.StackOverflowError
   at 
 java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:467)
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1281)
   at 
 java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:731)
   at 
 org.jboss.netty.akka.util.HashedWheelTimer.scheduleTimeout(HashedWheelTimer.java:277)
   at 
 org.jboss.netty.akka.util.HashedWheelTimer.newTimeout(HashedWheelTimer.java:264)
   at akka.actor.DefaultScheduler.scheduleOnce(Scheduler.scala:186)
   at akka.pattern.PromiseActorRef$.apply(AskSupport.scala:274)
   at akka.pattern.AskSupport$class.ask(AskSupport.scala:83)
   at akka.pattern.package$.ask(package.scala:43)
   at akka.pattern.AskSupport$AskableActorRef.ask(AskSupport.scala:123)
   at spark.CacheTracker.askTracker(CacheTracker.scala:121)
   at spark.CacheTracker.communicate(CacheTracker.scala:131)
   at spark.CacheTracker.registerRDD(CacheTracker.scala:142)
   at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:149)
   at 
 spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:155)
   at 
 spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150)
   at 
 scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
   at scala.collection.immutable.List.foreach(List.scala:76)
   at spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:150)
   at spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:160)
   at spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:131)
   at 
 spark.scheduler.DAGScheduler.getShuffleMapStage(DAGScheduler.scala:111)
   at 
 spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:153)
   at 
 spark.scheduler.DAGScheduler$$anonfun$visit$1$2.apply(DAGScheduler.scala:150)
   at 
 scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-610) Support master failover in standalone mode

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142528#comment-14142528
 ] 

Matthew Farrellee commented on SPARK-610:
-

[~matei] given YARN and Mesos implementations, is this something the standalone 
mode should strive to do?

 Support master failover in standalone mode
 --

 Key: SPARK-610
 URL: https://issues.apache.org/jira/browse/SPARK-610
 Project: Spark
  Issue Type: New Feature
Reporter: Matei Zaharia

 The standalone deploy mode is quite simple, which shouldn't make it too bad 
 to add support for master failover using ZooKeeper or something similar. This 
 would really up its usefulness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-604) reconnect if mesos slaves dies

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-604:

Component/s: Mesos

 reconnect if mesos slaves dies
 --

 Key: SPARK-604
 URL: https://issues.apache.org/jira/browse/SPARK-604
 Project: Spark
  Issue Type: Bug
  Components: Mesos

 when running on mesos, if a slave goes down, spark doesn't try to reassign 
 the work to another machine.  Even if the slave comes back up, the job is 
 doomed.
 Currently when this happens, we just see this in the driver logs:
 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: Mesos slave lost: 
 201210312057-1560611338-5050-24091-52
 Exception in thread Thread-346 java.util.NoSuchElementException: key not 
 found: value: 201210312057-1560611338-5050-24091-52
 at scala.collection.MapLike$class.default(MapLike.scala:224)
 at scala.collection.mutable.HashMap.default(HashMap.scala:43)
 at scala.collection.MapLike$class.apply(MapLike.scala:135)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:43)
 at 
 spark.scheduler.cluster.ClusterScheduler.slaveLost(ClusterScheduler.scala:255)
 at 
 spark.scheduler.mesos.MesosSchedulerBackend.slaveLost(MesosSchedulerBackend.scala:275)
 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: driver.run() returned 
 with code DRIVER_ABORTED



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-584) Pass slave ip address when starting a cluster

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142545#comment-14142545
 ] 

Matthew Farrellee commented on SPARK-584:
-

what's the use case for this?

 Pass slave ip address when starting a cluster 
 --

 Key: SPARK-584
 URL: https://issues.apache.org/jira/browse/SPARK-584
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.6.0
Priority: Minor
 Attachments: 0001-fix-for-SPARK-584.patch


 Pass slave ip address from conf while starting a cluster:
 bin/start-slaves.sh is used to start all the slaves in the cluster. While the 
 slave class takes a --ip argument, we don't pass the ip address from the 
 conf/slaves. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-575) Maintain a cache of JARs on each node to avoid unnecessary copying

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142553#comment-14142553
 ] 

Matthew Farrellee commented on SPARK-575:
-

[~joshrosen] is quite correct.

this issue looks inactive. i'm going to close it out, but as always feel free 
to re-open. i can think of a few ways this could be done, and not all need 
spark code to be changed.

 Maintain a cache of JARs on each node to avoid unnecessary copying
 --

 Key: SPARK-575
 URL: https://issues.apache.org/jira/browse/SPARK-575
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-575) Maintain a cache of JARs on each node to avoid unnecessary copying

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-575.
---
Resolution: Incomplete

 Maintain a cache of JARs on each node to avoid unnecessary copying
 --

 Key: SPARK-575
 URL: https://issues.apache.org/jira/browse/SPARK-575
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-578) Fix interpreter code generation to only capture needed dependencies

2014-09-21 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142558#comment-14142558
 ] 

Matthew Farrellee commented on SPARK-578:
-

[~matei] is this related to slimming down he assembly?

 Fix interpreter code generation to only capture needed dependencies
 ---

 Key: SPARK-578
 URL: https://issues.apache.org/jira/browse/SPARK-578
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-542) Cache Miss when machine have multiple hostname

2014-09-21 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-542:

Priority: Minor  (was: Blocker)

 Cache Miss when machine have multiple hostname
 --

 Key: SPARK-542
 URL: https://issues.apache.org/jira/browse/SPARK-542
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: frankvictor
Priority: Minor

 HI, I encountered a weird runtime of pagerank in last few day.
 After debugging the job, I found it was caused by the DNS name.
 The machines of my cluster have multiple hostname, for example, slave 1 have 
 name (c001 and c001.cm.cluster)
 when spark adding cache in cacheTracker, it get c001 and add cache use it.
 But when schedule task in SimpleJob, the msos offer give spark 
 c001.cm.cluster.
 so It will never get preferred location!
 I thinks spark should handle the multiple hostname case(by using ip instead 
 of hostname, or some other methods).
 Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark + Mahout

2014-09-19 Thread Matthew Farrellee

On 09/19/2014 05:06 AM, Sean Owen wrote:

No, it is actually a quite different 'alpha' project under the same
name: linear algebra DSL on top of H2O and also Spark. It is not really
about algorithm implementations now.

On Sep 19, 2014 1:25 AM, Matthew Farrellee m...@redhat.com
mailto:m...@redhat.com wrote:

On 09/18/2014 05:40 PM, Sean Owen wrote:

No, the architectures are entirely different. The Mahout
implementations
have been deprecated and are not being updated, so there won't
be a port
or anything. You would have to create these things from scratch
on Spark
if they don't already exist.

On Sep 18, 2014 7:50 PM, Daniel Takabayashi
takabaya...@scanboo.com.br mailto:takabaya...@scanboo.com.br
mailto:takabayashi@scanboo.__com.br
mailto:takabaya...@scanboo.com.br wrote:

 Hi guys,

 Is possible to run a mahout kmeans throws spark infrastructure?


 Thanks,
 taka (Brazil)


from what i've read, mahout isn't accepting changes to MR-based
implementations. would mahout accept an implementation on Spark?

best,


matt


oic. where's a good place to see progress on that?

best,


matt


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Commented] (SPARK-3321) Defining a class within python main script

2014-09-18 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138867#comment-14138867
 ] 

Matthew Farrellee commented on SPARK-3321:
--

[~guoxu1231] i think so too. ok if i close this?

 Defining a class within python main script
 --

 Key: SPARK-3321
 URL: https://issues.apache.org/jira/browse/SPARK-3321
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
 Environment: Python version 2.6.6
 Spark version version 1.0.1
 jdk1.6.0_43
Reporter: Shawn Guo
Priority: Minor

 *leftOuterJoin(self, other, numPartitions=None)*
 Perform a left outer join of self and other.
 For each element (k, v) in self, the resulting RDD will either contain all 
 pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements 
 in other have key k.
 *Background*: leftOuterJoin will produce None element in result dataset.
 I define a new class 'Null' in the main script to replace all python native 
 None to new 'Null' object. 'Null' object overload the [] operator.
 {code:title=Class Null|borderStyle=solid}
 class Null(object):
 def __getitem__(self,key): return None;
 def __getstate__(self): pass;
 def __setstate__(self, dict): pass;
 def convert_to_null(x):
 return Null() if x is None else x
 X = A.leftOuterJoin(B)
 X.mapValues(lambda line: (line[0],convert_to_null(line[1]))
 {code}
 The code seems running good in pyspark console, however spark-submit failed 
 with below error messages:
 /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] 
 /tmp/python_test.py
 {noformat}
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 134, in _write_with_length
 serialized = self.dumps(obj)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 PicklingError: Can't pickle class '__main__.Null': attribute lookup 
 __main__.Null failed
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
 at akka.actor.ActorCell.receiveMessage

[jira] [Commented] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Different Languages

2014-09-18 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139043#comment-14139043
 ] 

Matthew Farrellee commented on SPARK-3580:
--

what do you think about going the other direction, adding a partitions property 
to RDDs in python?

given that an RDD is a list of partitions, a function for computing each split, 
a list of deps on other RDDs, etc, it makes sense that you could access a 
someRDD.partitions, and doing so looks to be the preferred method in scala. so, 
instead of a someRDD.getNumPartitions(), python code could use a more idiomatic 
len(someRDD.partitions).

 Add Consistent Method To Get Number of RDD Partitions Across Different 
 Languages
 

 Key: SPARK-3580
 URL: https://issues.apache.org/jira/browse/SPARK-3580
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Pat McDonough
  Labels: starter

 Programmatically retrieving the number of partitions is not consistent 
 between python and scala. A consistent method should be defined and made 
 public across both languages.
 RDD.partitions.size is also used quite frequently throughout the internal 
 code, so that might be worth refactoring as well once the new method is 
 available.
 What we have today is below.
 In Scala:
 {code}
 scala someRDD.partitions.size
 res0: Int = 30
 {code}
 In Python:
 {code}
 In [2]: someRDD.getNumPartitions()
 Out[2]: 30
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3562) Periodic cleanup event logs

2014-09-18 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139786#comment-14139786
 ] 

Matthew Farrellee commented on SPARK-3562:
--

is logrotate an option for you?

 Periodic cleanup event logs
 ---

 Key: SPARK-3562
 URL: https://issues.apache.org/jira/browse/SPARK-3562
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: xukun

  If we run spark application frequently, it will write many spark event log 
 into spark.eventLog.dir. After a long time later, there will be many spark 
 event log that we do not concern in the spark.eventLog.dir.Periodic cleanups 
 will ensure that logs older than this duration will be forgotten. It is no 
 need to clean logs by hands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3581) RDD API(distinct/subtract) does not work for RDD of Dictionaries

2014-09-18 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-3581.

Resolution: Not a Problem

 RDD API(distinct/subtract) does not work for RDD of Dictionaries
 

 Key: SPARK-3581
 URL: https://issues.apache.org/jira/browse/SPARK-3581
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.0.2, 1.1.0
 Environment: Spark 1.0 1.1
 JDK 1.6
Reporter: Shawn Guo
Priority: Minor

 Construct a RDD of dictionaries(dictRDD), 
 try to use the RDD API, RDD.distinct() or RDD.subtract().
 {code:title=PySpark RDD API Test|borderStyle=solid}
 dictRDD = sc.parallelize(({'MOVIE_ID': 1, 'MOVIE_NAME': 'Lord of the 
 Rings','MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King 
 King', 'MOVIE_DIRECTOR': 'Peter Jackson'},{'MOVIE_ID': 2, 'MOVIE_NAME': 'King 
 King', 'MOVIE_DIRECTOR': 'Peter Jackson'}))
 dictRDD.distinct().collect()
 dictRDD.subtract(dictRDD).collect()
 {code}
 An error occurred while calling, 
 TypeError: unhashable type: 'dict'
 I'm not sure if it is a bug or expected results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3321) Defining a class within python main script

2014-09-18 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-3321.

Resolution: Not a Problem

 Defining a class within python main script
 --

 Key: SPARK-3321
 URL: https://issues.apache.org/jira/browse/SPARK-3321
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
 Environment: Python version 2.6.6
 Spark version version 1.0.1
 jdk1.6.0_43
Reporter: Shawn Guo
Priority: Minor

 *leftOuterJoin(self, other, numPartitions=None)*
 Perform a left outer join of self and other.
 For each element (k, v) in self, the resulting RDD will either contain all 
 pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements 
 in other have key k.
 *Background*: leftOuterJoin will produce None element in result dataset.
 I define a new class 'Null' in the main script to replace all python native 
 None to new 'Null' object. 'Null' object overload the [] operator.
 {code:title=Class Null|borderStyle=solid}
 class Null(object):
 def __getitem__(self,key): return None;
 def __getstate__(self): pass;
 def __setstate__(self, dict): pass;
 def convert_to_null(x):
 return Null() if x is None else x
 X = A.leftOuterJoin(B)
 X.mapValues(lambda line: (line[0],convert_to_null(line[1]))
 {code}
 The code seems running good in pyspark console, however spark-submit failed 
 with below error messages:
 /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] 
 /tmp/python_test.py
 {noformat}
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 134, in _write_with_length
 serialized = self.dumps(obj)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 PicklingError: Can't pickle class '__main__.Null': attribute lookup 
 __main__.Null failed
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456

[jira] [Closed] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true

2014-09-17 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-2022.

Resolution: Fixed

 Spark 1.0.0 is failing if mesos.coarse set to true
 --

 Key: SPARK-2022
 URL: https://issues.apache.org/jira/browse/SPARK-2022
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Marek Wiewiorka
Assignee: Tim Chen
Priority: Critical

 more stderr
 ---
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2
 I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 
 201405220917-134217738-5050-27119-0
 Exception in thread main java.lang.NumberFormatException: For input string: 
 sparkseq003.cloudapp.net
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:492)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
 at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
 more stdout
 ---
 Registered executor on sparkseq003.cloudapp.net
 Starting task 5
 Forked command at 61202
 sh -c '/home/mesos/spark-1.0.0/bin/spark-class 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 -Dspark.mesos.coarse=true 
 akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG
 rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 
 4'
 Command exited with status 1 (pid: 61202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3508) annotate the Spark configs to indicate which ones are meant for the end user

2014-09-16 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135631#comment-14135631
 ] 

Matthew Farrellee commented on SPARK-3508:
--

documented == public is a good metric. to handle the case of committers not 
knowing what should be public, specifically calling out newly documented config 
params at release provides an opportunity for extra review.

+1 config as api

 annotate the Spark configs to indicate which ones are meant for the end user
 

 Key: SPARK-3508
 URL: https://issues.apache.org/jira/browse/SPARK-3508
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Thomas Graves

 Spark has lots of configs floating around.  To me configs are like api's and 
 we should make it clear which ones are meant for the end user and which ones 
 are only used internally.  We should decide on exactly how we want to do this.
 I've seen in the past users looking at the code and then using a config that 
 was meant to be internal and file a jira to document it.  Since there are 
 many comitters its easy for someone who doesn't have the history with that 
 config to just think we forgot to document it and then it becomes public.
 Perhaps we need to name internal configs specially (spark.internal.) or we 
 need to annotate them or something else.
 thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2377) Create a Python API for Spark Streaming

2014-09-15 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134225#comment-14134225
 ] 

Matthew Farrellee commented on SPARK-2377:
--

it's a little tricky. you need to clone tdas' or giwa's repository, make 
changes on master (it's far from current spark master) and submit pull requests 
to giwa or tdas.

imho, it'd be much simpler if the PR was tagged [WIP] and directed toward the 
apache/spark repo! (pls!)

 Create a Python API for Spark Streaming
 ---

 Key: SPARK-2377
 URL: https://issues.apache.org/jira/browse/SPARK-2377
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Nicholas Chammas
Assignee: Kenichi Takagiwa

 [Spark 
 Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html]
  currently offers APIs in Scala and Java. It would be great feature add to 
 have a Python API as well.
 This is probably a large task that will span many issues if undertaken. This 
 ticket should provide some place to track overall progress towards an initial 
 Python API for Spark Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3538) Provide way for workers to log messages to driver's out/err

2014-09-15 Thread Matthew Farrellee (JIRA)
Matthew Farrellee created SPARK-3538:


 Summary: Provide way for workers to log messages to driver's 
out/err
 Key: SPARK-3538
 URL: https://issues.apache.org/jira/browse/SPARK-3538
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, Spark Shell
Reporter: Matthew Farrellee
Priority: Minor


As part of SPARK-927 we encountered a use case for the code running on a worker 
to be able to emit messages back to the driver. The communication channel is 
for trace/debug messages to an application's (shell or app) user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)

2014-09-11 Thread Matthew Farrellee

shane,

is there anything we should do for pull requests that failed, but for 
unrelated issues?


best,


matt

On 09/11/2014 11:29 AM, shane knapp wrote:

...and the restart is done.

On Thu, Sep 11, 2014 at 7:38 AM, shane knapp skn...@berkeley.edu wrote:


jenkins is now in quiet mode, and a restart is happening soon.

On Wed, Sep 10, 2014 at 3:44 PM, shane knapp skn...@berkeley.edu wrote:


that's kinda what we're hoping as well.  :)

On Wed, Sep 10, 2014 at 2:46 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:


I'm looking forward to this. :)

Looks like Jenkins is having trouble triggering builds for new commits
or after user requests (e.g.
https://github.com/apache/spark/pull/2339#issuecomment-55165937).
Hopefully that will be resolved tomorrow.

Nick

On Tue, Sep 9, 2014 at 5:00 PM, shane knapp skn...@berkeley.edu wrote:


since the power incident last thursday, the github pull request builder
plugin is still not really working 100%.  i found an open issue
w/jenkins[1] that could definitely be affecting us, i will be pausing
builds early thursday morning and then restarting jenkins.
i'll send out a reminder tomorrow, and if this causes any problems for
you,
please let me know and we can work out a better time.

but, now for some good news!  yesterday morning, we racked and stacked
the
systems for the new jenkins instance in the berkeley datacenter.
tomorrow
i should be able to log in to them and start getting them set up and
configured.  this is a major step in getting us in to a much more
'production' style environment!

anyways:  thanks for your patience, and i think we've all learned that
hard
powering down your build system is a definite recipe for disaster.  :)

shane

[1] -- https://issues.jenkins-ci.org/browse/JENKINS-22509













-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)

2014-09-11 Thread Matthew Farrellee
it was part of the review queue, but it looks like the runs have been 
gc'd. oh well!


best,


matt

On 09/11/2014 12:18 PM, shane knapp wrote:

you can just click on 'rebuild', if you'd like.  what project
specifically?  (i had forgotten that i'd killed
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/557/,
which i just started a rebuild on)

On Thu, Sep 11, 2014 at 9:15 AM, Matthew Farrellee m...@redhat.com
mailto:m...@redhat.com wrote:

shane,

is there anything we should do for pull requests that failed, but
for unrelated issues?

best,


matt

On 09/11/2014 11:29 AM, shane knapp wrote:

...and the restart is done.

On Thu, Sep 11, 2014 at 7:38 AM, shane knapp
skn...@berkeley.edu mailto:skn...@berkeley.edu wrote:

jenkins is now in quiet mode, and a restart is happening soon.

On Wed, Sep 10, 2014 at 3:44 PM, shane knapp
skn...@berkeley.edu mailto:skn...@berkeley.edu wrote:

that's kinda what we're hoping as well.  :)

On Wed, Sep 10, 2014 at 2:46 PM, Nicholas Chammas 
nicholas.cham...@gmail.com
mailto:nicholas.cham...@gmail.com wrote:

I'm looking forward to this. :)

Looks like Jenkins is having trouble triggering
builds for new commits
or after user requests (e.g.

https://github.com/apache/__spark/pull/2339#issuecomment-__55165937

https://github.com/apache/spark/pull/2339#issuecomment-55165937).
Hopefully that will be resolved tomorrow.

Nick

On Tue, Sep 9, 2014 at 5:00 PM, shane knapp
skn...@berkeley.edu mailto:skn...@berkeley.edu
wrote:

since the power incident last thursday, the
github pull request builder
plugin is still not really working 100%.  i
found an open issue
w/jenkins[1] that could definitely be affecting
us, i will be pausing
builds early thursday morning and then
restarting jenkins.
i'll send out a reminder tomorrow, and if this
causes any problems for
you,
please let me know and we can work out a better
time.

but, now for some good news!  yesterday morning,
we racked and stacked
the
systems for the new jenkins instance in the
berkeley datacenter.
tomorrow
i should be able to log in to them and start
getting them set up and
configured.  this is a major step in getting us
in to a much more
'production' style environment!

anyways:  thanks for your patience, and i think
we've all learned that
hard
powering down your build system is a definite
recipe for disaster.  :)

shane

[1] --
https://issues.jenkins-ci.org/__browse/JENKINS-22509
https://issues.jenkins-ci.org/browse/JENKINS-22509











-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable

2014-09-10 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128751#comment-14128751
 ] 

Matthew Farrellee commented on SPARK-3470:
--

while you can implement Closeable in java 7+ and use try (Closeable c = new 
...) { ... } (at least w/ openjdk 1.8), since spark targets java 7+, why not 
just use AutoCloseable?

 Have JavaSparkContext implement Closeable/AutoCloseable
 ---

 Key: SPARK-3470
 URL: https://issues.apache.org/jira/browse/SPARK-3470
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Shay Rojansky
Priority: Minor

 After discussion in SPARK-2972, it seems like a good idea to allow Java 
 developers to use Java 7 automatic resource management with JavaSparkContext, 
 like so:
 {code:java}
 try (JavaSparkContext ctx = new JavaSparkContext(...)) {
return br.readLine();
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-09 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127016#comment-14127016
 ] 

Matthew Farrellee commented on SPARK-2972:
--

 I suggest having context implement the language-specific dispose patterns 
 ('using' in Java, 'with' in Python), so at least the code looks better?

that's a great idea. i'll spec this out for python, would you care to do it for 
java / scala?

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3458) enable use of python's with statements for SparkContext management

2014-09-09 Thread Matthew Farrellee (JIRA)
Matthew Farrellee created SPARK-3458:


 Summary: enable use of python's with statements for SparkContext 
management
 Key: SPARK-3458
 URL: https://issues.apache.org/jira/browse/SPARK-3458
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Matthew Farrellee


best practice for managing SparkContexts involves exception handling, e.g.

```
try:
  sc = SparkContext()
  app(sc)
finally:
  sc.stop()
```

python provides the with statement to simplify this code, e.g.

```
with SparkContext() as sc:
  app(sc)
```

the SparkContext should be usable in a with statement



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-09 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127187#comment-14127187
 ] 

Matthew Farrellee commented on SPARK-2972:
--

+1 close this and open 2 feature requests, one for java and one for scala that 
mirror SPARK-3458

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-08 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125938#comment-14125938
 ] 

Matthew Farrellee commented on SPARK-2972:
--

 Thanks for answering. I guess it's a debatable question. I admit I expected 
 the context to shut itself down at application exit, a bit in the way that 
 files and other resources get closed.

i can understand that. those resources are ones that are cleaned up by the 
kernel, which doesn't have external dependencies on their cleanup, e.g. closing 
a file handle need not depend on writing to a log. it's always nice to have the 
lower level library handle things like this for you.

 Note that the way the examples are currently written (pi.py), an exception 
 anywhere in the code would bypass sc.stop() and the Spark application 
 disappears without leaving a trace in the history server. For this reason, my 
 scripts all contain try/finally blocks around the application code, which 
 seems like needless boilerplate that complicates life and can easily be 
 forgotten.

you're right! imho, this means your program is written better than the 
examples. it would be good to enhance the examples w/ try/finally semantics. 
however,

 Is there any specific reason not to use the application shutdown hooks 
 available in python/java to close the context(s)?

getting the shutdown semantics right is difficult, and may not apply broadly 
across applications. for instance, your application may want to catch a failure 
in stop() and retry to make sure that a history record is written. another 
application may be ok w/ best effort writing history events. still another 
application may want to exit w/o stop() to avoid having a history event written.

asking the context creator to do context destruction shifts burden to the 
application writer and maintains flexibility for applications.

that's my 2c

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1087) Separate file for traceback and callsite related functions

2014-09-08 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125957#comment-14125957
 ] 

Matthew Farrellee commented on SPARK-1087:
--

[~jyotiska] please do!

 Separate file for traceback and callsite related functions
 --

 Key: SPARK-1087
 URL: https://issues.apache.org/jira/browse/SPARK-1087
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Jyotiska NK

 Right now, _extract_concise_traceback() is written inside rdd.py which 
 provides the callsite information. But for 
 [SPARK-972](https://spark-project.atlassian.net/browse/SPARK-972) in PR #581, 
 we used the function from context.py. Also some issues were faced regarding 
 the return string format. 
 It would be a good idea to move the the traceback function from rdd and 
 create a separate file for future developments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-07 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124872#comment-14124872
 ] 

Matthew Farrellee commented on SPARK-2972:
--

[~roji] this was addressed for a pyspark shell in 
https://issues.apache.org/jira/browse/SPARK-2435. as for applications, it is 
the programmer's responsibility to stop the context before exit. this can be 
seen in all the example code provided with spark. are you looking for the 
SparkContext to stop itself?

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124458#comment-14124458
 ] 

Matthew Farrellee commented on SPARK-1701:
--

slice vs partition has also come up on stackoverflow and just recently the user 
list.

i'm going to write up a patch for the programming-guide to at least clarify the 
situation.

i intend my pr to partially address this jira.

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3425) OpenJDK - when run with jvm 1.8, should not set MaxPermSize

2014-09-06 Thread Matthew Farrellee (JIRA)
Matthew Farrellee created SPARK-3425:


 Summary: OpenJDK - when run with jvm 1.8, should not set 
MaxPermSize
 Key: SPARK-3425
 URL: https://issues.apache.org/jira/browse/SPARK-3425
 Project: Spark
  Issue Type: Improvement
Reporter: Matthew Farrellee
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.2.0


In JVM 1.8.0, MaxPermSize is no longer supported.
In spark stderr output, there would be a line of

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; 
support was removed in 8.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3425) OpenJDK - when run with jvm 1.8, should not set MaxPermSize

2014-09-06 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124467#comment-14124467
 ] 

Matthew Farrellee commented on SPARK-3425:
--

this is still an issue for openjdk

  spark-class: line 111: [: openjdk18: integer expression expected
  OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support 
was removed in 8.0

because the version test is specific to oracle java

 OpenJDK - when run with jvm 1.8, should not set MaxPermSize
 ---

 Key: SPARK-3425
 URL: https://issues.apache.org/jira/browse/SPARK-3425
 Project: Spark
  Issue Type: Improvement
Reporter: Matthew Farrellee
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.2.0


 In JVM 1.8.0, MaxPermSize is no longer supported.
 In spark stderr output, there would be a line of
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; 
 support was removed in 8.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124510#comment-14124510
 ] 

Matthew Farrellee commented on SPARK-1701:
--

ok, and one more

https://github.com/apache/spark/pull/2304 to remove slice terminology from the 
python examples

imho, all 4 of the PRs can be applied to master independently and in any order

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-1701:
-
Comment: was deleted

(was: ok, i also created 2 other PRs

https://github.com/apache/spark/pull/2302 aims to deprecate numSlices

and

https://github.com/apache/spark/pull/2303 is independent, removing the use of 
numSlices in pyspark/tests.py)

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-06 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-1701:
-
Comment: was deleted

(was: ok, and one more

https://github.com/apache/spark/pull/2304 to remove slice terminology from the 
python examples

imho, all 4 of the PRs can be applied to master independently and in any order)

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3321) Defining a class within python main script

2014-09-06 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124704#comment-14124704
 ] 

Matthew Farrellee commented on SPARK-3321:
--

this has come up a few times. it's not a problem with spark, but rather an 
artifact of how python operates.

do you have a specific suggestion on how the python interface to spark could 
work around this python limitation automatically?

 Defining a class within python main script
 --

 Key: SPARK-3321
 URL: https://issues.apache.org/jira/browse/SPARK-3321
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
 Environment: Python version 2.6.6
 Spark version version 1.0.1
 jdk1.6.0_43
Reporter: Shawn Guo
Priority: Critical

 *leftOuterJoin(self, other, numPartitions=None)*
 Perform a left outer join of self and other.
 For each element (k, v) in self, the resulting RDD will either contain all 
 pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements 
 in other have key k.
 *Background*: leftOuterJoin will produce None element in result dataset.
 I define a new class 'Null' in the main script to replace all python native 
 None to new 'Null' object. 'Null' object overload the [] operator.
 {code:title=Class Null|borderStyle=solid}
 class Null(object):
 def __getitem__(self,key): return None;
 def __getstate__(self): pass;
 def __setstate__(self, dict): pass;
 def convert_to_null(x):
 return Null() if x is None else x
 X = A.leftOuterJoin(B)
 X.mapValues(lambda line: (line[0],convert_to_null(line[1]))
 {code}
 The code seems running good in pyspark console, however spark-submit failed 
 with below error messages:
 /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] 
 /tmp/python_test.py
 {noformat}
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 134, in _write_with_length
 serialized = self.dumps(obj)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 PicklingError: Can't pickle class '__main__.Null': attribute lookup 
 __main__.Null failed
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala

[jira] [Commented] (SPARK-3401) Wrong usage of tee command in python/run-tests

2014-09-06 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124705#comment-14124705
 ] 

Matthew Farrellee commented on SPARK-3401:
--

nice catch

 Wrong usage of tee command in python/run-tests
 --

 Key: SPARK-3401
 URL: https://issues.apache.org/jira/browse/SPARK-3401
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
 Fix For: 1.1.1


 In python/run-test, tee command is used with -a option to append 
 unit-tests.log for logging but the usage is wrong.
 In current implementation, the output of tee command is redirected to 
 unit-tests.log like tee -a  unit-tests.log.
 tee command is not needed to redirect its output.
 This issue affects invalid truncate of unit-tests.log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3401) Wrong usage of tee command in python/run-tests

2014-09-06 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-3401:
-
Fix Version/s: 1.1.1

 Wrong usage of tee command in python/run-tests
 --

 Key: SPARK-3401
 URL: https://issues.apache.org/jira/browse/SPARK-3401
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
 Fix For: 1.1.1


 In python/run-test, tee command is used with -a option to append 
 unit-tests.log for logging but the usage is wrong.
 In current implementation, the output of tee command is redirected to 
 unit-tests.log like tee -a  unit-tests.log.
 tee command is not needed to redirect its output.
 This issue affects invalid truncate of unit-tests.log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3401) Wrong usage of tee command in python/run-tests

2014-09-06 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee resolved SPARK-3401.
--
Resolution: Fixed

 Wrong usage of tee command in python/run-tests
 --

 Key: SPARK-3401
 URL: https://issues.apache.org/jira/browse/SPARK-3401
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
 Fix For: 1.1.1


 In python/run-test, tee command is used with -a option to append 
 unit-tests.log for logging but the usage is wrong.
 In current implementation, the output of tee command is redirected to 
 unit-tests.log like tee -a  unit-tests.log.
 tee command is not needed to redirect its output.
 This issue affects invalid truncate of unit-tests.log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: How spark parallelize maps Slices to tasks/executors/workers

2014-09-06 Thread Matthew Farrellee

On 09/04/2014 09:55 PM, Mozumder, Monir wrote:

I have this 2-node cluster setup, where each node has 4-cores.

 MASTER

 (Worker-on-master)  (Worker-on-node1)

(slaves(master,node1))

SPARK_WORKER_INSTANCES=1

I am trying to understand Spark's parallelize behavior. The sparkPi
example has this code:

 val slices = 8

 val n = 10 * slices

 val count = spark.parallelize(1 to n, slices).map { i =

   val x = random * 2 - 1

   val y = random * 2 - 1

   if (x*x + y*y  1) 1 else 0

 }.reduce(_ + _)

As per documentation: Spark will run one task for each slice of the
cluster. Typically you want 2-4 slices for each CPU in your cluster. I
set slices to be 8 which means the workingset will be divided among 8
tasks on the cluster, in turn each worker node gets 4 tasks (1:1 per core)

Questions:

i)  Where can I see task level details? Inside executors I dont see
task breakdown so I can see the effect of slices on the UI.


under http://localhost:4040/stages/ you can drill into individual stages 
to see task details




ii) How to  programmatically find the working set size for the map
function above? I assume it is n/slices (10 above)


it'll be roughly n/slices. you can mapPqrtitions() and check their length



iii) Are the multiple tasks run by an executor run sequentially or
paralelly in multiple threads?


parallel. have a look at 
https://spark.apache.org/docs/latest/cluster-overview.html




iv) Reasoning behind 2-4 slices per CPU.


typically things like 2-4 slices per CPU are general rules of thumb 
because tasks are more io bound than not. depending on your workload 
this might change. it's probably one of the last things you'll want to 
optimize, first being the transformation ordering in your dag.




v) I assume ideally we should tune SPARK_WORKER_INSTANCES to
correspond to number of

Bests,

-Monir



best,


matt

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Matthew Farrellee

+1

built from sha w/ make-distribution.sh
tested basic examples (0 data) w/ local on fedora 20 (openjdk 1.7, 
python 2.7.5)
tested detection and log processing (25GB data) w/ mesos (0.19.0)  nfs 
on rhel 7 (openjdk 1.7, python 2.7.5)


On 09/03/2014 03:24 AM, Patrick Wendell wrote:

Please vote on releasing the following candidate as Apache Spark version 1.1.0!

The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc4/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1031/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/

Please vote on releasing this package as Apache Spark 1.1.0!

The vote is open until Saturday, September 06, at 08:30 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== Regressions fixed since RC3 ==
SPARK-3332 - Issue with tagging in EC2 scripts
SPARK-3358 - Issue with regression for m3.XX instances

== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.0.X will not block
this release.

== What default changes should I be aware of? ==
1. The default value of spark.io.compression.codec is now snappy
-- Old behavior can be restored by switching to lzf

2. PySpark now performs external spilling during aggregations.
-- Old behavior can be restored by setting spark.shuffle.spill to false.

3. PySpark uses a new heuristic for determining the parallelism of
shuffle operations.
-- Old behavior can be restored by setting
spark.default.parallelism to the number of cores in the cluster.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark-ec2 depends on stuff in the Mesos repo

2014-09-03 Thread Matthew Farrellee
that's not a bad idea. it would also break the circular dep in versions 
that results in spark X's ec2 script installing spark X-1 by default.


best,


matt

On 09/03/2014 01:17 PM, Shivaram Venkataraman wrote:

The spark-ec2 repository isn't a part of Mesos. Back in the days, Spark
used to be hosted in the Mesos github organization as well and so we put
scripts that were used by Spark under the same organization.

FWIW I don't think these scripts belong in the Spark repository. They are
helper scripts that setup EC2 clusters with different components like HDFS,
Spark, Tachyon etc. Also one of the motivations for creating this
repository was the ability to change these scripts without requiring a new
Spark release or a new AMI etc.

We can move the repository to a different github organization like AMPLab
if that'll make sense.

Thanks
Shivaram


On Wed, Sep 3, 2014 at 10:06 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:


Spawned by this discussion
https://github.com/apache/spark/pull/1120#issuecomment-54305831.

See these 2 lines in spark_ec2.py:

- spark_ec2 L42

https://github.com/apache/spark/blob/6a72a36940311fcb3429bd34c8818bc7d513115c/ec2/spark_ec2.py#L42



- spark_ec2 L566

https://github.com/apache/spark/blob/6a72a36940311fcb3429bd34c8818bc7d513115c/ec2/spark_ec2.py#L566




Why does the spark-ec2 script depend on stuff in the Mesos repo? Should
they be moved to the Spark repo?

Nick
​






-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark-ec2 depends on stuff in the Mesos repo

2014-09-03 Thread Matthew Farrellee
oh, i see pwendell is did a patch to the release branch to make the 
release version == --spark-version default


best,


matt

On 09/03/2014 01:30 PM, Shivaram Venkataraman wrote:

Actually the circular dependency doesn't depend on the spark-ec2 scripts
-- The scripts contain download links to many Spark versions and you can
configure which one should be used.

Shivaram


On Wed, Sep 3, 2014 at 10:22 AM, Matthew Farrellee m...@redhat.com
mailto:m...@redhat.com wrote:

that's not a bad idea. it would also break the circular dep in
versions that results in spark X's ec2 script installing spark X-1
by default.

best,


matt


On 09/03/2014 01:17 PM, Shivaram Venkataraman wrote:

The spark-ec2 repository isn't a part of Mesos. Back in the
days, Spark
used to be hosted in the Mesos github organization as well and
so we put
scripts that were used by Spark under the same organization.

FWIW I don't think these scripts belong in the Spark repository.
They are
helper scripts that setup EC2 clusters with different components
like HDFS,
Spark, Tachyon etc. Also one of the motivations for creating this
repository was the ability to change these scripts without
requiring a new
Spark release or a new AMI etc.

We can move the repository to a different github organization
like AMPLab
if that'll make sense.

Thanks
Shivaram


On Wed, Sep 3, 2014 at 10:06 AM, Nicholas Chammas 
nicholas.cham...@gmail.com mailto:nicholas.cham...@gmail.com
wrote:

Spawned by this discussion
https://github.com/apache/__spark/pull/1120#issuecomment-__54305831
https://github.com/apache/spark/pull/1120#issuecomment-54305831.

See these 2 lines in spark_ec2.py:

 - spark_ec2 L42
 

https://github.com/apache/__spark/blob/__6a72a36940311fcb3429bd34c8818b__c7d513115c/ec2/spark_ec2.py#__L42

https://github.com/apache/spark/blob/6a72a36940311fcb3429bd34c8818bc7d513115c/ec2/spark_ec2.py#L42


 - spark_ec2 L566
 

https://github.com/apache/__spark/blob/__6a72a36940311fcb3429bd34c8818b__c7d513115c/ec2/spark_ec2.py#__L566

https://github.com/apache/spark/blob/6a72a36940311fcb3429bd34c8818bc7d513115c/ec2/spark_ec2.py#L566



Why does the spark-ec2 script depend on stuff in the Mesos
repo? Should
they be moved to the Spark repo?

Nick
​







-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Ask something about spark

2014-09-03 Thread Matthew Farrellee

reynold,

would you folks be willing to put some creative commons license 
information on the site and its content?


best,


matt

On 09/02/2014 06:32 PM, Reynold Xin wrote:

I think in general that is fine. It would be great if your slides come with
proper attribution.


On Tue, Sep 2, 2014 at 3:31 PM, Sanghoon Lee phoenixl...@gmail.com wrote:


Hi, I am phoenixlee and a Spark programmer in Korea.

And be a good chance this time, it tries to teach college students and
office workers to Spark.
This course will be done with the support of the government. Can I use the
data(pictures, samples, etc.) in the spark homepage for this course? Of
course, I will put the comments in thanks and webpage URL. It would be a
good opportunity, even though the findings were that there is no teaching
materials Spark and education (or community) still in Korea.

Thanks.
ᐧ






-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Ask something about spark

2014-09-03 Thread Matthew Farrellee

CC or Apache, it'd be helpful to have it listed in the footer of pages

best,


matt

On 09/03/2014 02:23 PM, Reynold Xin wrote:

I am not sure if I can just go ahead and update the website with a
creative common license.

IIRC, ASF websites are also Apache 2.0 license. Might need somebody from
legal to chime in.


On Wed, Sep 3, 2014 at 11:15 AM, Matthew Farrellee m...@redhat.com
mailto:m...@redhat.com wrote:

reynold,

would you folks be willing to put some creative commons license
information on the site and its content?

best,


matt


On 09/02/2014 06:32 PM, Reynold Xin wrote:

I think in general that is fine. It would be great if your
slides come with
proper attribution.


On Tue, Sep 2, 2014 at 3:31 PM, Sanghoon Lee
phoenixl...@gmail.com mailto:phoenixl...@gmail.com wrote:

Hi, I am phoenixlee and a Spark programmer in Korea.

And be a good chance this time, it tries to teach college
students and
office workers to Spark.
This course will be done with the support of the government.
Can I use the
data(pictures, samples, etc.) in the spark homepage for this
course? Of
course, I will put the comments in thanks and webpage URL.
It would be a
good opportunity, even though the findings were that there
is no teaching
materials Spark and education (or community) still in Korea.

Thanks.
ᐧ







-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2014-09-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118566#comment-14118566
 ] 

Matthew Farrellee commented on SPARK-3181:
--

pls excuse my changes to this issue, i'm not planning to work on it, but cannot 
appear to remove myself as the assignee.

 Add Robust Regression Algorithm with Huber Estimator
 

 Key: SPARK-3181
 URL: https://issues.apache.org/jira/browse/SPARK-3181
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Assignee: Matthew Farrellee
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression  to employ a 
 fitting criterion that is not as vulnerable as least square.
 In 1973, Huber introduced M-estimation for regression which stands for 
 maximum likelihood type. The method is resistant to outliers in the 
 response variable and has been widely used.
 The new feature for MLlib will contain 3 new files
 /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
 /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
 /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
 and one new class HuberRobustGradient in 
 /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [PySpark] large # of partitions causes OOM

2014-09-02 Thread Matthew Farrellee

On 08/29/2014 06:05 PM, Nick Chammas wrote:

Here’s a repro for PySpark:

|a = sc.parallelize([Nick,John,Bob])
a = a.repartition(24000)
a.keyBy(lambda  x: len(x)).reduceByKey(lambda  x,y: x + y).take(1)
|

When I try this on an EC2 cluster with 1.1.0-rc2 and Python 2.7, this is
what I get:

|a = sc.parallelize([Nick,John,Bob])

a = a.repartition(24000)
a.keyBy(lambda  x: len(x)).reduceByKey(lambda  x,y: x + y).take(1)

14/08/29  21:53:40  WARN BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(0, ip-10-138-29-167.ec2.internal,46252,0)with  no recent heart 
beats:175143ms exceeds45000ms
14/08/29  21:53:50  WARN BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(10, ip-10-138-18-106.ec2.internal,33711,0)with  no recent heart 
beats:175359ms exceeds45000ms
14/08/29  21:54:02  WARN BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(19, ip-10-139-36-207.ec2.internal,52208,0)with  no recent heart 
beats:173061ms exceeds45000ms
14/08/29  21:54:13  WARN BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(5, ip-10-73-142-70.ec2.internal,56162,0)with  no recent heart 
beats:176816ms exceeds45000ms
14/08/29  21:54:22  WARN BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(7, ip-10-236-145-200.ec2.internal,40959,0)with  no recent heart 
beats:182241ms exceeds45000ms
14/08/29  21:54:40  WARN BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(4, ip-10-139-1-195.ec2.internal,49221,0)with  no recent heart 
beats:178406ms exceeds45000ms
14/08/29  21:54:41  ERROR Utils: Uncaught exceptionin  thread Result resolver 
thread-3
java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 
atorg.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 atorg.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 
atorg.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 
atorg.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 
atorg.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 
atorg.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 
atorg.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 atorg.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 
atorg.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
Exceptionin  threadResult resolver thread-3  14/08/29  21:56:26  ERROR 
SendingConnection: Exceptionwhile  reading SendingConnection to 
ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014)
java.nio.channels.ClosedChannelException
 at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
 atorg.apache.spark.network.SendingConnection.read(Connection.scala:390)
 
atorg.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 

  1   2   3   >