[jira] [Created] (SPARK-1678) Compression loses repeated values.

2014-04-30 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-1678:
---

 Summary: Compression loses repeated values.
 Key: SPARK-1678
 URL: https://issues.apache.org/jira/browse/SPARK-1678
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.0.0


Here's a test case:

{code}
  test(all the same strings) {
sparkContext.parallelize(1 to 1000).map(_ = 
StringData(test)).registerAsTable(test1000)
assert(sql(SELECT * FROM test1000).count() === 1000)
cacheTable(test1000)
assert(sql(SELECT * FROM test1000).count() === 1000)
  }
{code}

First assert passes, second one fails.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1004) PySpark on YARN

2014-04-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1004.


Resolution: Fixed

Issue resolved by pull request 30
[https://github.com/apache/spark/pull/30]

 PySpark on YARN
 ---

 Key: SPARK-1004
 URL: https://issues.apache.org/jira/browse/SPARK-1004
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Reporter: Josh Rosen
Assignee: Sandy Ryza
Priority: Blocker
 Fix For: 1.0.0


 This is for tracking progress on supporting YARN in PySpark.
 We might be able to use {{yarn-client}} mode 
 (https://spark.incubator.apache.org/docs/latest/running-on-yarn.html#launch-spark-application-with-yarn-client-mode).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1626) Update Spark YARN docs to use spark-submit

2014-04-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1626.


Resolution: Duplicate

 Update Spark YARN docs to use spark-submit
 --

 Key: SPARK-1626
 URL: https://issues.apache.org/jira/browse/SPARK-1626
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Reporter: Patrick Wendell
Assignee: Sandy Ryza
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1492) running-on-yarn doc should use spark-submit script for examples

2014-04-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1492:
---

Priority: Blocker  (was: Major)

 running-on-yarn doc should use spark-submit script for examples
 ---

 Key: SPARK-1492
 URL: https://issues.apache.org/jira/browse/SPARK-1492
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Sandy Ryza
Priority: Blocker

 the spark-class script puts out lots of warnings telling users to use 
 spark-submit script with new options.  We should update the 
 running-on-yarn.md docs to have examples using the spark-submit script rather 
 then spark-class. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1466) Pyspark doesn't check if gateway process launches correctly

2014-04-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1466:
---

Fix Version/s: (was: 1.0.0)
   1.0.1

 Pyspark doesn't check if gateway process launches correctly
 ---

 Key: SPARK-1466
 URL: https://issues.apache.org/jira/browse/SPARK-1466
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0, 0.9.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Blocker
 Fix For: 1.0.1


 If the gateway process fails to start correctly (e.g., because JAVA_HOME 
 isn't set correctly, there's no Spark jar, etc.), right now pyspark fails 
 because of a very difficult-to-understand error, where we try to parse stdout 
 to get the port where Spark started and there's nothing there.  We should 
 properly catch the error, print it to the user, and exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-922) Update Spark AMI to Python 2.7

2014-04-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-922:
--

Fix Version/s: (was: 1.0.0)
   1.1.0

 Update Spark AMI to Python 2.7
 --

 Key: SPARK-922
 URL: https://issues.apache.org/jira/browse/SPARK-922
 Project: Spark
  Issue Type: Task
  Components: EC2, PySpark
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Josh Rosen
Priority: Blocker
 Fix For: 1.1.0


 Many Python libraries only support Python 2.7+, so we should make Python 2.7 
 the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-922) Update Spark AMI to Python 2.7

2014-04-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-922:
--

Priority: Major  (was: Blocker)

 Update Spark AMI to Python 2.7
 --

 Key: SPARK-922
 URL: https://issues.apache.org/jira/browse/SPARK-922
 Project: Spark
  Issue Type: Task
  Components: EC2, PySpark
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Josh Rosen
 Fix For: 1.1.0


 Many Python libraries only support Python 2.7+, so we should make Python 2.7 
 the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-04-30 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985219#comment-13985219
 ] 

Patrick Wendell commented on SPARK-922:
---

This is no longer a blocker now that we've downgraded the python dependency, 
but would still be nice to have.

 Update Spark AMI to Python 2.7
 --

 Key: SPARK-922
 URL: https://issues.apache.org/jira/browse/SPARK-922
 Project: Spark
  Issue Type: Task
  Components: EC2, PySpark
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Josh Rosen
 Fix For: 1.1.0


 Many Python libraries only support Python 2.7+, so we should make Python 2.7 
 the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1682) LogisticRegressionWithSGD should support svmlight data and gradient descent w/o sampling

2014-04-30 Thread Dong Wang (JIRA)
Dong Wang created SPARK-1682:


 Summary: LogisticRegressionWithSGD should support svmlight data 
and gradient descent w/o sampling
 Key: SPARK-1682
 URL: https://issues.apache.org/jira/browse/SPARK-1682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Dong Wang
 Fix For: 1.0.0


The LogisticRegressionWithSGD example does not expose the following capability 
that already exist inside MLlib:
  * reading svmlight data
  * regularization with l1 and l2
  * add intercept
  * write model to a file
  * read model and generate predictions

The GradientDescent optimizer does sampling before a gradient step. When input 
data is already shuffled beforehand, it is possible to scan data and make 
gradient descent for each data instance. This could be potentially more 
efficient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1644) The org.datanucleus:* should not be packaged into spark-assembly-*.jar

2014-04-30 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1644:
---

Summary: The org.datanucleus:*  should not be packaged into 
spark-assembly-*.jar  (was:  hql(CREATE TABLE IF NOT EXISTS src (key INT, 
value STRING)) throw an exception)

 The org.datanucleus:*  should not be packaged into spark-assembly-*.jar
 ---

 Key: SPARK-1644
 URL: https://issues.apache.org/jira/browse/SPARK-1644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Guoqiang Li 
Assignee: Guoqiang Li
 Attachments: spark.log


 cat conf/hive-site.xml
 {code:xml}
 configuration
   property
 namejavax.jdo.option.ConnectionURL/name
 valuejdbc:postgresql://bj-java-hugedata1:7432/hive/value
   /property
   property
 namejavax.jdo.option.ConnectionDriverName/name
 valueorg.postgresql.Driver/value
   /property
   property
 namejavax.jdo.option.ConnectionUserName/name
 valuehive/value
   /property
   property
 namejavax.jdo.option.ConnectionPassword/name
 valuepasswd/value
   /property
   property
 namehive.metastore.local/name
 valuefalse/value
   /property
   property
 namehive.metastore.warehouse.dir/name
 valuehdfs://host:8020/user/hive/warehouse/value
   /property
 /configuration
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1629) Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS`

2014-04-30 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985352#comment-13985352
 ] 

Guoqiang Li commented on SPARK-1629:


[The PR 569|https://github.com/apache/spark/pull/569]

 Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS` 
 

 Key: SPARK-1629
 URL: https://issues.apache.org/jira/browse/SPARK-1629
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Guoqiang Li 
Assignee: Guoqiang Li
Priority: Minor

 Right now we use this but don't depend on it explicitly (which is wrong). We 
 should probably just inline this function and remove the need to add a 
 dependency.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1629) Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS`

2014-04-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1629.


   Resolution: Fixed
Fix Version/s: 1.0.0

Thanks for this fix

 Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS` 
 

 Key: SPARK-1629
 URL: https://issues.apache.org/jira/browse/SPARK-1629
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Guoqiang Li 
Assignee: Guoqiang Li
Priority: Minor
 Fix For: 1.0.0


 Right now we use this but don't depend on it explicitly (which is wrong). We 
 should probably just inline this function and remove the need to add a 
 dependency.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1683) Display filesystem read statistics with each task

2014-04-30 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1683:
--

 Summary: Display filesystem read statistics with each task
 Key: SPARK-1683
 URL: https://issues.apache.org/jira/browse/SPARK-1683
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kay Ousterhout
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1492) running-on-yarn doc should use spark-submit script for examples

2014-04-30 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985750#comment-13985750
 ] 

Sandy Ryza commented on SPARK-1492:
---

https://github.com/apache/spark/pull/601

 running-on-yarn doc should use spark-submit script for examples
 ---

 Key: SPARK-1492
 URL: https://issues.apache.org/jira/browse/SPARK-1492
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Sandy Ryza
Priority: Blocker

 the spark-class script puts out lots of warnings telling users to use 
 spark-submit script with new options.  We should update the 
 running-on-yarn.md docs to have examples using the spark-submit script rather 
 then spark-class. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1684) Merge script should standardize SPARK-XXX prefix

2014-04-30 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1684:
--

 Summary: Merge script should standardize SPARK-XXX prefix
 Key: SPARK-1684
 URL: https://issues.apache.org/jira/browse/SPARK-1684
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Minor
 Fix For: 1.1.0


If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue 
we should convert it to SPARK XXX: Issue



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1684) Merge script should standardize SPARK-XXX prefix

2014-04-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985770#comment-13985770
 ] 

Sean Owen commented on SPARK-1684:
--

(Can I be pedantic and suggest standardizing to SPARK-XXX ? this is how issues 
are reported in other projects, like HIVE-123 and MAPREDUCE-234)

 Merge script should standardize SPARK-XXX prefix
 

 Key: SPARK-1684
 URL: https://issues.apache.org/jira/browse/SPARK-1684
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Minor
 Fix For: 1.1.0


 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: 
 Issue we should convert it to SPARK XXX: Issue



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1683) Display filesystem read statistics with each task

2014-04-30 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985785#comment-13985785
 ] 

Kay Ousterhout commented on SPARK-1683:
---

I've already done this and will submit a PR this weekend

 Display filesystem read statistics with each task
 -

 Key: SPARK-1683
 URL: https://issues.apache.org/jira/browse/SPARK-1683
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kay Ousterhout
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1684) Merge script should standardize SPARK-XXX prefix

2014-04-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1684:
---

Description: If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or 
SPARK XXX: Issue we should convert it to SPARK-XXX: Issue  (was: If users 
write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should 
convert it to SPARK XXX: Issue)

 Merge script should standardize SPARK-XXX prefix
 

 Key: SPARK-1684
 URL: https://issues.apache.org/jira/browse/SPARK-1684
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Minor
 Fix For: 1.1.0


 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: 
 Issue we should convert it to SPARK-XXX: Issue



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1684) Merge script should standardize SPARK-XXX prefix

2014-04-30 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985796#comment-13985796
 ] 

Patrick Wendell commented on SPARK-1684:


Just a typo in the description! SPARK-XXX is the correct format ala:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 Merge script should standardize SPARK-XXX prefix
 

 Key: SPARK-1684
 URL: https://issues.apache.org/jira/browse/SPARK-1684
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Minor
 Fix For: 1.1.0


 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: 
 Issue we should convert it to SPARK-XXX: Issue



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient

2014-04-30 Thread Mark Hamstra (JIRA)
Mark Hamstra created SPARK-1685:
---

 Summary: retryTimer not canceled on actor restart in Worker and 
AppClient
 Key: SPARK-1685
 URL: https://issues.apache.org/jira/browse/SPARK-1685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Mark Hamstra
Assignee: Mark Hamstra


Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actors restarting, registerWithMaster being called again on 
restart, and another retryTimer being scheduled without canceling the already 
running retryTimer.  Assuming that all of the rest of the restart logic is 
correct for these actors (which I don't believe is actually a given), having 
multiple retryTimers running presents at least a condition in which the 
restarted actor will not be able to make the full number of retry attempts 
before an earlier retryTimer takes the give up action.

Canceling the retryTimer in the actor's postStop hook should suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2014-04-30 Thread Diana Carroll (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985880#comment-13985880
 ] 

Diana Carroll commented on SPARK-823:
-

Yes, please clarify the documentation, I just ran into this.  the Configuration 
guide (http://spark.apache.org/docs/latest/configuration.html) says the default 
is 8.

In testing this on Standalone Spark, there actually is no default value for the 
variable:
sc.getConf.contains(spark.default.parallelism)
res1: Boolean = false

It looks like if the variable is not set, then the default behavior is decided 
in code, e.g. Partitioner.scala:
{code}
if (rdd.context.conf.contains(spark.default.parallelism)) {
  new HashPartitioner(rdd.context.defaultParallelism)
} else {
  new HashPartitioner(bySize.head.partitions.size)
}
{code}

 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 0.8.0, 0.7.3
Reporter: Josh Rosen
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2014-04-30 Thread Diana Carroll (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985934#comment-13985934
 ] 

Diana Carroll commented on SPARK-823:
-

Okay, this is definitely more than a documentation bug, because PySpark and 
Scala work differently if spark.default.parallelism isn't set by the user.  I'm 
testing using wordcount.

Pyspark: reduceByKey will use the value of sc.defaultParallelism.  That value 
is set to the number of threads when running locally.  On my Spark Standalone 
cluster which has a single node with a single core, the value is 2.  If I set 
spark.default.parallelism, it will set sc.defaultParallelism to that value and 
use that.

Scala: reduceByKey will use the number of partitions in my file/map stage and 
ignore the value of sc.defaultParallelism.  sc.defaultParallism is set by the 
same logic as pyspark (number of threads for local, 2 for my microcluster), it 
is just ignored.

I'm not sure which approach is correct.  Scala works as described here: 
http://spark.apache.org/docs/latest/tuning.html
{quote}
Spark automatically sets the number of “map” tasks to run on each file 
according to its size (though you can control it through optional parameters to 
SparkContext.textFile, etc), and for distributed “reduce” operations, such as 
groupByKey and reduceByKey, it uses the largest parent RDD’s number of 
partitions. You can pass the level of parallelism as a second argument (see the 
spark.PairRDDFunctions documentation), or set the config property 
spark.default.parallelism to change the default. In general, we recommend 2-3 
tasks per CPU core in your cluster.
{quote}



 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 0.8.0, 0.7.3
Reporter: Josh Rosen
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2014-04-30 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll updated SPARK-823:


Component/s: PySpark

 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark, Spark Core
Affects Versions: 0.8.0, 0.7.3, 0.9.1
Reporter: Josh Rosen
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-823) spark.default.parallelism's default is inconsistent across scheduler backends

2014-04-30 Thread Diana Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll updated SPARK-823:


Affects Version/s: 0.9.1

 spark.default.parallelism's default is inconsistent across scheduler backends
 -

 Key: SPARK-823
 URL: https://issues.apache.org/jira/browse/SPARK-823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark, Spark Core
Affects Versions: 0.8.0, 0.7.3, 0.9.1
Reporter: Josh Rosen
Priority: Minor

 The [0.7.3 configuration 
 guide|http://spark-project.org/docs/latest/configuration.html] says that 
 {{spark.default.parallelism}}'s default is 8, but the default is actually 
 max(totalCoreCount, 2) for the standalone scheduler backend, 8 for the Mesos 
 scheduler, and {{threads}} for the local scheduler:
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L157
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/mesos/MesosSchedulerBackend.scala#L317
 https://github.com/mesos/spark/blob/v0.7.3/core/src/main/scala/spark/scheduler/local/LocalScheduler.scala#L150
 Should this be clarified in the documentation?  Should the Mesos scheduler 
 backend's default be revised?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient

2014-04-30 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1685:


Description: 
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actor restarting, registerWithMaster being called again on restart, 
and another retryTimer being scheduled without canceling the already running 
retryTimer.  Assuming that all of the rest of the restart logic is correct for 
these actors (which I don't believe is actually a given), having multiple 
retryTimers running presents at least a condition in which the restarted actor 
will not be able to make the full number of retry attempts before an earlier 
retryTimer takes the give up action.

Canceling the retryTimer in the actor's postStop hook should suffice. 

  was:
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actors restarting, registerWithMaster being called again on 
restart, and another retryTimer being scheduled without canceling the already 
running retryTimer.  Assuming that all of the rest of the restart logic is 
correct for these actors (which I don't believe is actually a given), having 
multiple retryTimers running presents at least a condition in which the 
restarted actor will not be able to make the full number of retry attempts 
before an earlier retryTimer takes the give up action.

Canceling the retryTimer in the actor's postStop hook should suffice. 


 retryTimer not canceled on actor restart in Worker and AppClient
 

 Key: SPARK-1685
 URL: https://issues.apache.org/jira/browse/SPARK-1685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Mark Hamstra
Assignee: Mark Hamstra

 Both deploy.worker.Worker and deploy.client.AppClient try to 
 registerWithMaster when those Actors start.  The attempt at registration is 
 accomplished by starting a retryTimer via the Akka scheduler that will use 
 the registered timeout interval and retry number to make repeated attempts to 
 register with all known Masters before giving up and either marking as dead 
 or calling System.exit.
 The receive methods of these actors can, however, throw exceptions, which 
 will lead to the actor restarting, registerWithMaster being called again on 
 restart, and another retryTimer being scheduled without canceling the already 
 running retryTimer.  Assuming that all of the rest of the restart logic is 
 correct for these actors (which I don't believe is actually a given), having 
 multiple retryTimers running presents at least a condition in which the 
 restarted actor will not be able to make the full number of retry attempts 
 before an earlier retryTimer takes the give up action.
 Canceling the retryTimer in the actor's postStop hook should suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient

2014-04-30 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1685:


Description: 
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actor restarting, registerWithMaster being called again on restart, 
and another retryTimer being scheduled without canceling the already running 
retryTimer.  Assuming that all of the rest of the restart logic is correct for 
these actors (which I don't believe is actually a given), having multiple 
retryTimers running presents at least a condition in which the restarted actor 
may not be able to make the full number of retry attempts before an earlier 
retryTimer takes the give up action.

Canceling the retryTimer in the actor's postStop hook should suffice. 

  was:
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actor restarting, registerWithMaster being called again on restart, 
and another retryTimer being scheduled without canceling the already running 
retryTimer.  Assuming that all of the rest of the restart logic is correct for 
these actors (which I don't believe is actually a given), having multiple 
retryTimers running presents at least a condition in which the restarted actor 
will not be able to make the full number of retry attempts before an earlier 
retryTimer takes the give up action.

Canceling the retryTimer in the actor's postStop hook should suffice. 


 retryTimer not canceled on actor restart in Worker and AppClient
 

 Key: SPARK-1685
 URL: https://issues.apache.org/jira/browse/SPARK-1685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Mark Hamstra
Assignee: Mark Hamstra

 Both deploy.worker.Worker and deploy.client.AppClient try to 
 registerWithMaster when those Actors start.  The attempt at registration is 
 accomplished by starting a retryTimer via the Akka scheduler that will use 
 the registered timeout interval and retry number to make repeated attempts to 
 register with all known Masters before giving up and either marking as dead 
 or calling System.exit.
 The receive methods of these actors can, however, throw exceptions, which 
 will lead to the actor restarting, registerWithMaster being called again on 
 restart, and another retryTimer being scheduled without canceling the already 
 running retryTimer.  Assuming that all of the rest of the restart logic is 
 correct for these actors (which I don't believe is actually a given), having 
 multiple retryTimers running presents at least a condition in which the 
 restarted actor may not be able to make the full number of retry attempts 
 before an earlier retryTimer takes the give up action.
 Canceling the retryTimer in the actor's postStop hook should suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-1620) Uncaught exception from Akka scheduler

2014-04-30 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra reopened SPARK-1620:
-


On further investigation, it looks like there really is a problem with 
exceptions thrown by scheduled functions not being caught by any uncaught 
exception handler.

 Uncaught exception from Akka scheduler
 --

 Key: SPARK-1620
 URL: https://issues.apache.org/jira/browse/SPARK-1620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Mark Hamstra
Priority: Blocker

 I've been looking at this one in the context of a BlockManagerMaster that 
 OOMs and doesn't respond to heartBeat(), but I suspect that there may be 
 problems elsewhere where we use Akka's scheduler.
 The basic nature of the problem is that we are expecting exceptions thrown 
 from a scheduled function to be caught in the thread where 
 _ActorSystem_.scheduler.schedule() or scheduleOnce() has been called.  In 
 fact, the scheduled function runs on its own thread, so any exceptions that 
 it throws are not caught in the thread that called schedule() -- e.g., 
 unanswered BlockManager heartBeats (scheduled in BlockManager#initialize) 
 that end up throwing exceptions in BlockManagerMaster#askDriverWithReply do 
 not cause those exceptions to be handled by the Executor thread's 
 UncaughtExceptionHandler. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1686) Master switches thread when ElectedLeader

2014-04-30 Thread Mark Hamstra (JIRA)
Mark Hamstra created SPARK-1686:
---

 Summary: Master switches thread when ElectedLeader
 Key: SPARK-1686
 URL: https://issues.apache.org/jira/browse/SPARK-1686
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Mark Hamstra


In deploy.master.Master, the completeRecovery method is the last thing to be 
called when a standalone Master is recovering from failure.  It is responsible 
for resetting some state, relaunching drivers, and eventually resuming its 
scheduling duties.

There are currently four places in Master.scala where completeRecovery is 
called.  Three of them are from within the actor's receive method, and aren't 
problems.  The last starts from within receive when the ElectedLeader message 
is received, but the actual completeRecovery() call is made from the Akka 
scheduler.  That means that it will execute on a different scheduler thread, 
and Master itself will end up running (i.e., schedule() ) from that Akka 
scheduler thread.  Among other things, that means that uncaught exception 
handling will be different -- https://issues.apache.org/jira/browse/SPARK-1620 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1687) Support NamedTuples in RDDs

2014-04-30 Thread Pat McDonough (JIRA)
Pat McDonough created SPARK-1687:


 Summary: Support NamedTuples in RDDs
 Key: SPARK-1687
 URL: https://issues.apache.org/jira/browse/SPARK-1687
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
 Environment: Spark version 1.0.0-SNAPSHOT
Python 2.7.5
Reporter: Pat McDonough


Add Support for NamedTuples in RDDs. Some sample code is below, followed by the 
current error that comes from it.

Based on a quick conversation with [~ahirreddy], 
[Dill|https://github.com/uqfoundation/dill] might be a good solution here.

{code}
In [26]: from collections import namedtuple
...
In [33]: Person = namedtuple('Person', 'id firstName lastName')

In [34]: jon = Person(1, Jon, Doe)

In [35]: jane = Person(2, Jane, Doe)

In [36]: theDoes = sc.parallelize((jon, jane))

In [37]: theDoes.collect()
Out[37]: 
[Person(id=1, firstName='Jon', lastName='Doe'),
 Person(id=2, firstName='Jane', lastName='Doe')]

In [38]: theDoes.count()
PySpark worker failed with exception:
PySpark worker failed with exception:
Traceback (most recent call last):
  File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
def func(s, iterator): return f(iterator)
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in 
load_stream
yield self._read_with_length(stream)
  File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in 
_read_with_length
return self.loads(obj)
AttributeError: 'module' object has no attribute 'Person'

Traceback (most recent call last):
  File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
def func(s, iterator): return f(iterator)
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in 
load_stream
yield self._read_with_length(stream)
  File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in 
_read_with_length
return self.loads(obj)
AttributeError: 'module' object has no attribute 'Person'

14/04/30 14:43:53 ERROR Executor: Exception in task ID 23
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
pipeline_func
return func(split, prev_func(split, iterator))
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
def func(s, iterator): return f(iterator)
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in lambda
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in genexpr
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, in 
load_stream
yield self._read_with_length(stream)
  File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, in 
_read_with_length
return self.loads(obj)
AttributeError: 'module' object has no attribute 'Person'

at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:214)
at 

[jira] [Commented] (SPARK-1361) DAGScheduler throws NullPointerException occasionally

2014-04-30 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986174#comment-13986174
 ] 

Kousuke Saruta commented on SPARK-1361:
---

Hi [~scwf], are there any update for this issue?
Do you mind my taking over addressing this issue?

 DAGScheduler throws NullPointerException occasionally
 -

 Key: SPARK-1361
 URL: https://issues.apache.org/jira/browse/SPARK-1361
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: wangfei
 Fix For: 1.0.0


 DAGScheduler throws this NullPointerException below occasionally when running 
 spark apps.
 java.lang.NullPointerException
 at 
 org.apache.spark.scheduler.DAGScheduler.executorAdded(DAGScheduler.scala:186)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.executorAdded(TaskSchedulerImpl.scala:409)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:210)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:206)
 at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:206)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:130)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrEls



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1620) Uncaught exception from Akka scheduler

2014-04-30 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986180#comment-13986180
 ] 

Mark Hamstra commented on SPARK-1620:
-

And one last instance: In scheduler.TaskSchedulerImpl#start(), 
checkSpeculatableTasks() is scheduled to be called every SPECULATION_INTERVAL.  
If checkSpeculatableTasks() throws an exception, that exception will not be 
caught and no more invocations of checkSpeculatableTasks() will occur. 

 Uncaught exception from Akka scheduler
 --

 Key: SPARK-1620
 URL: https://issues.apache.org/jira/browse/SPARK-1620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Mark Hamstra
Priority: Blocker

 I've been looking at this one in the context of a BlockManagerMaster that 
 OOMs and doesn't respond to heartBeat(), but I suspect that there may be 
 problems elsewhere where we use Akka's scheduler.
 The basic nature of the problem is that we are expecting exceptions thrown 
 from a scheduled function to be caught in the thread where 
 _ActorSystem_.scheduler.schedule() or scheduleOnce() has been called.  In 
 fact, the scheduled function runs on its own thread, so any exceptions that 
 it throws are not caught in the thread that called schedule() -- e.g., 
 unanswered BlockManager heartBeats (scheduled in BlockManager#initialize) 
 that end up throwing exceptions in BlockManagerMaster#askDriverWithReply do 
 not cause those exceptions to be handled by the Executor thread's 
 UncaughtExceptionHandler. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-30 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986220#comment-13986220
 ] 

Tathagata Das commented on SPARK-1645:
--

This makes sense from the integration point of view. Though I wonder from 
thePOV of Flume's deployment configuration does it make things more complex? 
Like for example, if someone has a the flume system already setup, in the 
current situation, the configuration change to add a new sink seems standard 
and easy. However, in the proposed model, since Flume's data pushing node has 
to run a sink, how much complicated does this configuration process get?

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1688) PySpark throws unhelpful exception when pyspark cannot be loaded

2014-04-30 Thread Andrew Or (JIRA)
Andrew Or created SPARK-1688:


 Summary: PySpark throws unhelpful exception when pyspark cannot be 
loaded
 Key: SPARK-1688
 URL: https://issues.apache.org/jira/browse/SPARK-1688
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 0.9.1
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1689) AppClient does not respond correctly to RemoveApplication

2014-04-30 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-1689:
-

 Summary: AppClient does not respond correctly to RemoveApplication
 Key: SPARK-1689
 URL: https://issues.apache.org/jira/browse/SPARK-1689
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson


When the Master removes an application (usually due to too many executor 
failures), it means no future executors will be assigned to that app. 
Currently, the AppClient just marks the application as disconnected, which is 
intended as a transient state during a period of reconnection. Thus, 
RemoveApplication just causes the application to enter a state where it has no 
executors and it doesn't die.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1667) re-fetch fails occasionally

2014-04-30 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-1667:
--

Summary: re-fetch fails occasionally  (was: Should re-fetch when 
intermediate data for shuffle is lost)

 re-fetch fails occasionally
 ---

 Key: SPARK-1667
 URL: https://issues.apache.org/jira/browse/SPARK-1667
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Kousuke Saruta

 I met a case that re-fetch wouldn't occur although that should occur.
 When intermediate data (phisical file of intermediate data on local file 
 system) which is used for shuffle is lost from a Executor, 
 FileNotFoundException was thrown and refetch wouldn't occur.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1690) RDD.saveAsTextFile throws scala.MatchError if RDD contains empty elements

2014-04-30 Thread Glenn K. Lockwood (JIRA)
Glenn K. Lockwood created SPARK-1690:


 Summary: RDD.saveAsTextFile throws scala.MatchError if RDD 
contains empty elements
 Key: SPARK-1690
 URL: https://issues.apache.org/jira/browse/SPARK-1690
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0
 Environment: Linux/CentOS6, Spark 0.9.1, standalone mode against HDFS 
from Hadoop 1.2.1
Reporter: Glenn K. Lockwood
Priority: Minor


The following pyspark code fails with a scala.MatchError exception if 
sample.txt contains any empty lines:

file = sc.textFile('hdfs://gcn-3-45.ibnet0:54310/user/glock/sample.txt')
file.saveAsTextFile('hdfs://gcn-3-45.ibnet0:54310/user/glock/sample.out')

The resulting stack trace:

14/04/30 17:02:46 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
14/04/30 17:02:46 WARN scheduler.TaskSetManager: Loss was due to 
scala.MatchError
scala.MatchError: 0 (of class java.lang.Integer)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:129)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:732)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:741)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:741)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)

This can be reproduced with a sample.txt containing


foo

bar


and disappears if sample.txt is


foo
bar




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-30 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986275#comment-13986275
 ] 

Hari Shreedharan commented on SPARK-1645:
-

Yes, it is better to add new methods rather than reusing the old ones and 
confusing existing users.

In fact, I think we should add a new receiver for the time being and only 
deprecate the old one initially. We can remove the old one in a later release.

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume

2014-04-30 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986312#comment-13986312
 ] 

Tathagata Das commented on SPARK-1645:
--

I agree. New receiver for the new API is the best way. 

 Improve Spark Streaming compatibility with Flume
 

 Key: SPARK-1645
 URL: https://issues.apache.org/jira/browse/SPARK-1645
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently the following issues affect Spark Streaming and Flume compatibilty:
 * If a spark worker goes down, it needs to be restarted on the same node, 
 else Flume cannot send data to it. We can fix this by adding a Flume receiver 
 that is polls Flume, and a Flume sink that supports this.
 * Receiver sends acks to Flume before the driver knows about the data. The 
 new receiver should also handle this case.
 * Data loss when driver goes down - This is true for any streaming ingest, 
 not just Flume. I will file a separate jira for this and we should work on it 
 there. This is a longer term project and requires considerable development 
 work.
 I intend to start working on these soon. Any input is appreciated. (It'd be 
 great if someone can add me as a contributor on jira, so I can assign the 
 jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1678) Compression loses repeated values.

2014-04-30 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986348#comment-13986348
 ] 

Cheng Lian commented on SPARK-1678:
---

Pull request: https://github.com/apache/spark/pull/608

 Compression loses repeated values.
 --

 Key: SPARK-1678
 URL: https://issues.apache.org/jira/browse/SPARK-1678
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.0.0


 Here's a test case:
 {code}
   test(all the same strings) {
 sparkContext.parallelize(1 to 1000).map(_ = 
 StringData(test)).registerAsTable(test1000)
 assert(sql(SELECT * FROM test1000).count() === 1000)
 cacheTable(test1000)
 assert(sql(SELECT * FROM test1000).count() === 1000)
   }
 {code}
 First assert passes, second one fails.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1679) In-Memory compression needs to be configurable.

2014-04-30 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986347#comment-13986347
 ] 

Cheng Lian commented on SPARK-1679:
---

Pull request: https://github.com/apache/spark/pull/608

 In-Memory compression needs to be configurable.
 ---

 Key: SPARK-1679
 URL: https://issues.apache.org/jira/browse/SPARK-1679
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.0.0


 Since we are still finding bugs in the compression code I think we should 
 make it configurable in SparkConf and turn it off by default for the 1.0 
 release.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1691) Support quoted arguments inside of spark-submit

2014-04-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1691:
---

Description: 
Currently due to the way we send arguments on to spark-class, it doesn't work 
with quoted strings. For instance:

{code}
./bin/spark-submit --name My app --spark-driver-extraJavaOptions -Dfoo=x 
-Dbar=y
{code}

  was:Currently due to the way we send arguments on to spark-class, it doesn't 
work with quoted strings.


 Support quoted arguments inside of spark-submit
 ---

 Key: SPARK-1691
 URL: https://issues.apache.org/jira/browse/SPARK-1691
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.0.0


 Currently due to the way we send arguments on to spark-class, it doesn't work 
 with quoted strings. For instance:
 {code}
 ./bin/spark-submit --name My app --spark-driver-extraJavaOptions -Dfoo=x 
 -Dbar=y
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1691) Support quoted arguments inside of spark-submit

2014-04-30 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1691:
--

 Summary: Support quoted arguments inside of spark-submit
 Key: SPARK-1691
 URL: https://issues.apache.org/jira/browse/SPARK-1691
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.0.0


Currently due to the way we send arguments on to spark-class, it doesn't work 
with quoted strings.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-975) Spark Replay Debugger

2014-04-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-975:
-

Attachment: RDD DAG.png

Drawback of GraphViz

 Spark Replay Debugger
 -

 Key: SPARK-975
 URL: https://issues.apache.org/jira/browse/SPARK-975
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Cheng Lian
  Labels: arthur, debugger
 Attachments: RDD DAG.png


 The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical 
 report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf].
 [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur 
 Dave|https://github.com/ankurdave], is an old implementation of the Spark 
 debugger, which demonstrated both the elegance and power behind the RDD 
 abstraction.  Unfortunately, the corresponding GitHub branch was not merged 
 into the master branch and had stopped 2 years ago.  For more information 
 about Arthur, please refer to [the Spark Debugger Wiki 
 page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub 
 repository.
 As a useful tool for Spark application debugging and analysis, it would be 
 nice to have a complete Spark debugger.  In 
 [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new 
 implementation of the Spark debugger, the Spark Replay Debugger (SRD).
 [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview 
 for discussion.  In the current version, I only implemented features that can 
 illustrate the basic mechanisms.  There are still features appeared in Arthur 
 but missing in SRD, such as checksum based nondeterminsm detection and single 
 task debugging with conventional debugger (like {{jdb}}).  However, these 
 features can be easily built upon current SRD framework.  To minimize code 
 review effort, I didn't include them into the current version intentionally.
 Attached is the visualization of the MLlib ALS application (with 1 iteration) 
 generated by SRD.  For more information, please refer to [the SRD overview 
 document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/].



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1692) Enable external sorting in Spark SQL aggregates

2014-04-30 Thread Sudip Roy (JIRA)
Sudip Roy created SPARK-1692:


 Summary: Enable external sorting in Spark SQL aggregates
 Key: SPARK-1692
 URL: https://issues.apache.org/jira/browse/SPARK-1692
 Project: Spark
  Issue Type: Improvement
Reporter: Sudip Roy
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)