from:"Andrew Or \\\\\\\(JIRA\\\\\\\)"

[jira] [Updated] (SPARK-11638) Run Spark on Mesos with bridge networking

2016-02-01 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11638:
--
Summary: Run Spark on Mesos with bridge networking  (was: Apache Spark in 
Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge 
networking)

> Run Spark on Mesos with bridge networking
> -
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and 
> {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile 
> with Akka 2.4.x yet.
> What we want is the back port of mentioned {{akka-remote}} settings to 
> {{2.3.x}} versions. These patches are attached to this ticket - 
> {{2.3.4.patch}} and {{2.3.11.patch}} files provide patches for

[jira] [Resolved] (SPARK-12790) Remove HistoryServer old multiple files format

2016-02-01 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12790.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove HistoryServer old multiple files format
> --
>
> Key: SPARK-12790
> URL: https://issues.apache.org/jira/browse/SPARK-12790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Reporter: Andrew Or
>Assignee: Felix Cheung
> Fix For: 2.0.0
>
>
> HistoryServer has 2 formats. The old one makes a directory and puts multiple 
> files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just 
> 1 file called local_2593759238651.log or something.
> It's been a nightmare to maintain both code paths. We should just remove the 
> old legacy format (which has been out of use for many versions now) when we 
> still have the chance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12637) Print stage info of finished stages properly

2016-02-01 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12637:
--
Assignee: Navis

> Print stage info of finished stages properly
> 
>
> Key: SPARK-12637
> URL: https://issues.apache.org/jira/browse/SPARK-12637
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Navis
>Assignee: Navis
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Currently it prints hashcode of stage info, which seemed not that useful.
> {noformat}
> INFO scheduler.StatsReportListener: Finished stage: 
> org.apache.spark.scheduler.StageInfo@2eb47d79
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12637) Print stage info of finished stages properly

2016-02-01 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12637.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Print stage info of finished stages properly
> 
>
> Key: SPARK-12637
> URL: https://issues.apache.org/jira/browse/SPARK-12637
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Navis
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Currently it prints hashcode of stage info, which seemed not that useful.
> {noformat}
> INFO scheduler.StatsReportListener: Finished stage: 
> org.apache.spark.scheduler.StageInfo@2eb47d79
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13088) DAG viz does not work with latest version of chrome

2016-01-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-13088.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.1
   1.5.3
   1.4.2

> DAG viz does not work with latest version of chrome
> ---
>
> Key: SPARK-13088
> URL: https://issues.apache.org/jira/browse/SPARK-13088
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.4.2, 1.5.3, 1.6.1, 2.0.0
>
> Attachments: Screen Shot 2016-01-29 at 10.54.14 AM.png
>
>
> See screenshot. This is because dagre-d3.js is using a function that chrome 
> no longer supports:
> {code}
> Uncaught TypeError: elem.getTransformToElement is not a function
> {code}
> We need to upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13096) Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky

2016-01-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-13096.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky
> -
>
> Key: SPARK-13096
> URL: https://issues.apache.org/jira/browse/SPARK-13096
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> We have a method in AccumulatorSuite called verifyPeakExecutionMemorySet. 
> This is used in a variety of other test suites, including (but not limited 
> to):
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - sql.SQLQuerySuite
> Lately it's been flaky ever since https://github.com/apache/spark/pull/10835 
> was merged. Note: this was an existing problem even before that patch, but it 
> was uncovered there because previously we never failed the test even if an 
> assertion error failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13096) Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky

2016-01-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13096:
--
Affects Version/s: 1.6.0
 Target Version/s: 2.0.0

> Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky
> -
>
> Key: SPARK-13096
> URL: https://issues.apache.org/jira/browse/SPARK-13096
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We have a method in AccumulatorSuite called verifyPeakExecutionMemorySet. 
> This is used in a variety of other test suites, including (but not limited 
> to):
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - sql.SQLQuerySuite
> Lately it's been flaky ever since https://github.com/apache/spark/pull/10835 
> was merged. Note: this was an existing problem even before that patch, but it 
> was uncovered there because previously we never failed the test even if an 
> assertion error failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13096) Make AccumulatorSuite#verifyPeakExecutionMemorySet less flaky

2016-01-29 Thread Andrew Or (JIRA)

Andrew Or created SPARK-13096:
-

 Summary: Make AccumulatorSuite#verifyPeakExecutionMemorySet less 
flaky
 Key: SPARK-13096
 URL: https://issues.apache.org/jira/browse/SPARK-13096
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Andrew Or
Assignee: Andrew Or


We have a method in AccumulatorSuite called verifyPeakExecutionMemorySet. This 
is used in a variety of other test suites, including (but not limited to):

- ExternalAppendOnlyMapSuite
- ExternalSorterSuite
- sql.SQLQuerySuite

Lately it's been flaky ever since https://github.com/apache/spark/pull/10835 
was merged. Note: this was an existing problem even before that patch, but it 
was uncovered there because previously we never failed the test even if an 
assertion error failed!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-01-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10620.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 2.0.0
>
> Attachments: accums-and-task-metrics.pdf
>
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13088) DAG viz does not work with latest version of chrome

2016-01-29 Thread Andrew Or (JIRA)

Andrew Or created SPARK-13088:
-

 Summary: DAG viz does not work with latest version of chrome
 Key: SPARK-13088
 URL: https://issues.apache.org/jira/browse/SPARK-13088
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.6.0, 1.5.0, 1.4.0, 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker


See screenshot. This is because dagre-d3.js is using a function that chrome no 
longer supports:
{code}
Uncaught TypeError: elem.getTransformToElement is not a function
{code}
We need to upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13088) DAG viz does not work with latest version of chrome

2016-01-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13088:
--
Attachment: Screen Shot 2016-01-29 at 10.54.14 AM.png

> DAG viz does not work with latest version of chrome
> ---
>
> Key: SPARK-13088
> URL: https://issues.apache.org/jira/browse/SPARK-13088
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: Screen Shot 2016-01-29 at 10.54.14 AM.png
>
>
> See screenshot. This is because dagre-d3.js is using a function that chrome 
> no longer supports:
> {code}
> Uncaught TypeError: elem.getTransformToElement is not a function
> {code}
> We need to upgrade it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-01-28 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-13065:
---

 Summary: streaming-twitter pass twitter4j.FilterQuery argument to 
TwitterUtils.createStream()
 Key: SPARK-13065
 URL: https://issues.apache.org/jira/browse/SPARK-13065
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.6.0
 Environment: all
Reporter: Andrew Davidson
Priority: Minor


The twitter stream api is very powerful provides a lot of support for 
twitter.com side filtering of status objects. When ever possible we want to let 
twitter do as much work as possible for us.

currently the spark twitter api only allows you to configure a small sub set of 
possible filters 

String{} filters = {"tag1", tag2"}
JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
filters);

The current implemenation does 

private[streaming]
class TwitterReceiver(
twitterAuth: Authorization,
filters: Seq[String],
storageLevel: StorageLevel
  ) extends Receiver[Status](storageLevel) with Logging {

. . .


  val query = new FilterQuery
  if (filters.size > 0) {
query.track(filters.mkString(","))
newTwitterStream.filter(query)
  } else {
newTwitterStream.sample()
  }

...

rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
should be able to pass a FilterQueryObject

looks like an easy fix. See source code links bellow

kind regards

Andy

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13055) SQLHistoryListener throws ClassCastException

2016-01-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13055:
--
Component/s: SQL

> SQLHistoryListener throws ClassCastException
> 
>
> Key: SPARK-13055
> URL: https://issues.apache.org/jira/browse/SPARK-13055
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> 16/01/27 18:46:28 ERROR ReplayListenerBus: Listener SQLHistoryListener threw 
> an exception
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Long
> at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1$$anonfun$5.apply(SQLListener.scala:334)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1$$anonfun$5.apply(SQLListener.scala:334)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:334)
> at 
> org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:332)
> {code}
> SQLHistoryListener listens on SparkListenerTaskEnd events, which contain 
> non-SQL accumulators as well. We try to cast all accumulators we encounter to 
> Long, resulting in an error like this one.
> Note: this was a problem even before internal accumulators were introduced. 
> If  the task used a user accumulator of a type other than Long, we would 
> still see this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13071) Coalescing HadoopRDD overwrites existing input metrics

2016-01-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13071:
--
Description: 
Currently if we do the following in Hadoop 2.5+:
{code}
sc.textFile(..., 4).coalesce(2).collect()
{code}

We'll get incorrect `InputMetrics#bytesRead`. This is because HadoopRDD's 
compute method overwrites any existing value of `bytesRead`, and in the case of 
coalesce we'll end up calling `compute` multiple times in the same task.

This is why InputOutputMetricsSuite is failing in master, but only for 
hadoop2.6/2.7: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/67/

This was caused by https://github.com/apache/spark/pull/10835.

  was:
Currently if we do the following in Hadoop 2.5+:
{code}
sc.textFile(..., 4).coalesce(2).collect()
{code}

We'll get incorrect `InputMetrics#bytesRead`. This is because HadoopRDD's 
compute method overwrites any existing value of `bytesRead`, and in the case of 
coalesce we'll end up calling `compute` multiple times in the same task.

This is why InputOutputMetricsSuite is failing in master, but only for 
hadoop2.6/2.7: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/67/


> Coalescing HadoopRDD overwrites existing input metrics
> --
>
> Key: SPARK-13071
> URL: https://issues.apache.org/jira/browse/SPARK-13071
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Currently if we do the following in Hadoop 2.5+:
> {code}
> sc.textFile(..., 4).coalesce(2).collect()
> {code}
> We'll get incorrect `InputMetrics#bytesRead`. This is because HadoopRDD's 
> compute method overwrites any existing value of `bytesRead`, and in the case 
> of coalesce we'll end up calling `compute` multiple times in the same task.
> This is why InputOutputMetricsSuite is failing in master, but only for 
> hadoop2.6/2.7: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/67/
> This was caused by https://github.com/apache/spark/pull/10835.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13071) Coalescing HadoopRDD overwrites existing input metrics

2016-01-28 Thread Andrew Or (JIRA)

Andrew Or created SPARK-13071:
-

 Summary: Coalescing HadoopRDD overwrites existing input metrics
 Key: SPARK-13071
 URL: https://issues.apache.org/jira/browse/SPARK-13071
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


Currently if we do the following in Hadoop 2.5+:
{code}
sc.textFile(..., 4).coalesce(2).collect()
{code}

We'll get incorrect `InputMetrics#bytesRead`. This is because HadoopRDD's 
compute method overwrites any existing value of `bytesRead`, and in the case of 
coalesce we'll end up calling `compute` multiple times in the same task.

This is why InputOutputMetricsSuite is failing in master, but only for 
hadoop2.6/2.7: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/67/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13054) Always post TaskEnd event for tasks in cancelled stages

2016-01-27 Thread Andrew Or (JIRA)

Andrew Or created SPARK-13054:
-

 Summary: Always post TaskEnd event for tasks in cancelled stages
 Key: SPARK-13054
 URL: https://issues.apache.org/jira/browse/SPARK-13054
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or


{code}
// The success case is dealt with separately below.
// TODO: Why post it only for failed tasks in cancelled stages? Clarify 
semantics here.
if (event.reason != Success) {
  val attemptId = task.stageAttemptId
  listenerBus.post(SparkListenerTaskEnd(
stageId, attemptId, taskType, event.reason, event.taskInfo, 
taskMetrics))
}
{code}

Today we only post task end events for canceled stages if the task failed. 
There is no reason why we shouldn't just post it for all the tasks, including 
the ones that succeeded. If we do that we will be able to simplify another 
branch in the DAGScheduler, which needs a lot of simplification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13053) Rectify ignored tests in InternalAccumulatorSuite

2016-01-27 Thread Andrew Or (JIRA)

Andrew Or created SPARK-13053:
-

 Summary: Rectify ignored tests in InternalAccumulatorSuite
 Key: SPARK-13053
 URL: https://issues.apache.org/jira/browse/SPARK-13053
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Reporter: Andrew Or
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13053) Rectify ignored tests in InternalAccumulatorSuite

2016-01-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13053:
--
Target Version/s: 2.0.0

> Rectify ignored tests in InternalAccumulatorSuite
> -
>
> Key: SPARK-13053
> URL: https://issues.apache.org/jira/browse/SPARK-13053
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> // TODO: these two tests are incorrect; they don't actually trigger stage 
> retries.
> ignore("internal accumulators in fully resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => true) // fail all tasks
> }
> ignore("internal accumulators in partially resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => i % 2 == 0) // fail a 
> subset
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13053) Rectify ignored tests in InternalAccumulatorSuite

2016-01-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13053:
--
Description: 
{code}
// TODO: these two tests are incorrect; they don't actually trigger stage 
retries.
ignore("internal accumulators in fully resubmitted stages") {
  testInternalAccumulatorsWithFailedTasks((i: Int) => true) // fail all tasks
}

ignore("internal accumulators in partially resubmitted stages") {
  testInternalAccumulatorsWithFailedTasks((i: Int) => i % 2 == 0) // fail a 
subset
}
{code}

> Rectify ignored tests in InternalAccumulatorSuite
> -
>
> Key: SPARK-13053
> URL: https://issues.apache.org/jira/browse/SPARK-13053
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> // TODO: these two tests are incorrect; they don't actually trigger stage 
> retries.
> ignore("internal accumulators in fully resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => true) // fail all tasks
> }
> ignore("internal accumulators in partially resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => i % 2 == 0) // fail a 
> subset
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13051) Do not maintain global singleton map for accumulators

2016-01-27 Thread Andrew Or (JIRA)

Andrew Or created SPARK-13051:
-

 Summary: Do not maintain global singleton map for accumulators
 Key: SPARK-13051
 URL: https://issues.apache.org/jira/browse/SPARK-13051
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or


Right now we register accumulators through `Accumulators.register`, and then 
read these accumulators in DAGScheduler through `Accumulators.get`. This design 
is very awkward and makes it hard to associate a list of accumulators with a 
particular SparkContext. Global singleton is not a good pattern in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13053) Rectify ignored tests in InternalAccumulatorSuite

2016-01-27 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13053:
--
Affects Version/s: 2.0.0

> Rectify ignored tests in InternalAccumulatorSuite
> -
>
> Key: SPARK-13053
> URL: https://issues.apache.org/jira/browse/SPARK-13053
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> // TODO: these two tests are incorrect; they don't actually trigger stage 
> retries.
> ignore("internal accumulators in fully resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => true) // fail all tasks
> }
> ignore("internal accumulators in partially resubmitted stages") {
>   testInternalAccumulatorsWithFailedTasks((i: Int) => i % 2 == 0) // fail a 
> subset
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13055) SQLHistoryListener throws ClassCastException

2016-01-27 Thread Andrew Or (JIRA)

Andrew Or created SPARK-13055:
-

 Summary: SQLHistoryListener throws ClassCastException
 Key: SPARK-13055
 URL: https://issues.apache.org/jira/browse/SPARK-13055
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or


{code}
16/01/27 18:46:28 ERROR ReplayListenerBus: Listener SQLHistoryListener threw an 
exception
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
at 
org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1$$anonfun$5.apply(SQLListener.scala:334)
at 
org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1$$anonfun$5.apply(SQLListener.scala:334)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:334)
at 
org.apache.spark.sql.execution.ui.SQLHistoryListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:332)
{code}

SQLHistoryListener listens on SparkListenerTaskEnd events, which contain 
non-SQL accumulators as well. We try to cast all accumulators we encounter to 
Long, resulting in an error like this one.

Note: this was a problem even before internal accumulators were introduced. If  
the task used a user accumulator of a type other than Long, we would still see 
this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13009) spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json

2016-01-26 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-13009:
---

 Summary: spark-streaming-twitter_2.10 does not make it possible to 
access the raw twitter json
 Key: SPARK-13009
 URL: https://issues.apache.org/jira/browse/SPARK-13009
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.6.0
Reporter: Andrew Davidson
Priority: Blocker


The Streaming-twitter package makes it easy for Java programmers to work with 
twitter. The implementation returns the raw twitter data in JSON formate as a 
twitter4J StatusJSONImpl object

JavaDStream tweets = TwitterUtils.createStream(ssc, twitterAuth);

The status class is different then the raw JSON. I.E. serializing the status 
object will be the same as the original json. I have down stream systems that 
can only process raw tweets not twitter4J Status objects. 

Here is my bug/RFE request made to Twitter4J . They 
asked  I create a spark tracking issue.


On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote:
Hi All

Quick problem summary:

My system uses the Status objects to do some analysis how ever I need to store 
the raw JSON. There are other systems that process that data that are not 
written in Java.
Currently we are serializing the Status Object. The JSON is going to break down 
stream systems.
I am using the Apache Spark Streaming spark-streaming-twitter_2.10  
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Request For Enhancement:
I imagine easy access to the raw JSON is a common requirement. Would it be 
possible to add a member function to StatusJSONImpl getRawJson(). By default 
the returned value would be null unless jsonStoreEnabled=True  is set in the 
config.


Alternative implementations:
 

It should be possible to modify the spark-streaming-twitter_2.10 to provide 
this support. The solutions is not very clean

It would required apache spark to define their own Status Pojo. The current 
StatusJSONImpl class is marked final
The Wrapper is not going to work nicely with existing code.
spark-streaming-twitter_2.10  does not expose all of the twitter streaming API 
so many developers are writing their implementations of 
org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance 
difficult. Its not easy to know when the spark implementation for twitter has 
changed. 
Code listing for 
spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala

private[streaming]
class TwitterReceiver(
twitterAuth: Authorization,
filters: Seq[String],
storageLevel: StorageLevel
  ) extends Receiver[Status](storageLevel) with Logging {

  @volatile private var twitterStream: TwitterStream = _
  @volatile private var stopped = false

  def onStart() {
try {
  val newTwitterStream = new TwitterStreamFactory().getInstance(twitterAuth)
  newTwitterStream.addListener(new StatusListener {
def onStatus(status: Status): Unit = {
  store(status)
}
Ref: https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html

What do people think?

Kind regards

Andy

From:  on behalf of Igor Brigadir 

Reply-To: 
Date: Tuesday, January 19, 2016 at 5:55 AM
To: Twitter4J 
Subject: Re: [Twitter4J] trouble writing unit test

Main issue is that the Json object is in the wrong json format.

eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 
+ 2015", ...

It looks like the json you have was serialized from a java Status object, which 
makes json objects different to what you get from the API, TwitterObjectFactory 
expects json from Twitter (I haven't had any problems using 
TwitterObjectFactory instead of the Deprecated DataObjectFactory).

You could "fix" it by matching the keys & values you have with the correct, 
twitter API json - it should look like the example here: 
https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid

But it might be easier to download the tweets again, but this time use 
TwitterObjectFactory.getRawJSON(status) to get the Original Json from the 
Twitter API, and save that for later. (You must have jsonStoreEnabled=True in 
your config, and call getRawJSON in the same thread as .showStatus() or 
lookup() or whatever you're using to load tweets.)







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

2016-01-20 Thread Andrew Ray (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109269#comment-15109269
 ] 

Andrew Ray commented on SPARK-12911:


In the current master this happens even without caching. The generated code 
calls UnsafeArrayData.equals for each row with an argument of type 
GenericArrayData (the filter value) which returns false. Presumably the 
generated code needs to convert the filter value to an UnsafeArrayData but I'm 
at a loss for where to do so.

[~sdicocco] This is a bug, not something a developer should have to worry about.

> Cacheing a dataframe causes array comparisons to fail (in filter / where) 
> after 1.6
> ---
>
> Key: SPARK-12911
> URL: https://issues.apache.org/jira/browse/SPARK-12911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0
>Reporter: Jesse English
>
> When doing a *where* operation on a dataframe and testing for equality on an 
> array type, after 1.6 no valid comparisons are made if the dataframe has been 
> cached.  If it has not been cached, the results are as expected.
> This appears to be related to the underlying unsafe array data types.
> {code:title=test.scala|borderStyle=solid}
> test("test array comparison") {
> val vectors: Vector[Row] =  Vector(
>   Row.fromTuple("id_1" -> Array(0L, 2L)),
>   Row.fromTuple("id_2" -> Array(0L, 5L)),
>   Row.fromTuple("id_3" -> Array(0L, 9L)),
>   Row.fromTuple("id_4" -> Array(1L, 0L)),
>   Row.fromTuple("id_5" -> Array(1L, 8L)),
>   Row.fromTuple("id_6" -> Array(2L, 4L)),
>   Row.fromTuple("id_7" -> Array(5L, 6L)),
>   Row.fromTuple("id_8" -> Array(6L, 2L)),
>   Row.fromTuple("id_9" -> Array(7L, 0L))
> )
> val data: RDD[Row] = sc.parallelize(vectors, 3)
> val schema = StructType(
>   StructField("id", StringType, false) ::
> StructField("point", DataTypes.createArrayType(LongType, false), 
> false) ::
> Nil
> )
> val sqlContext = new SQLContext(sc)
> val dataframe = sqlContext.createDataFrame(data, schema)
> val targetPoint:Array[Long] = Array(0L,9L)
> //Cacheing is the trigger to cause the error (no cacheing causes no error)
> dataframe.cache()
> //This is the line where it fails
> //java.util.NoSuchElementException: next on empty iterator
> //However we know that there is a valid match
> val targetRow = dataframe.where(dataframe("point") === 
> array(targetPoint.map(value => lit(value)): _*)).first()
> assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12790) Remove HistoryServer old multiple files format

2016-01-19 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12790:
--
Assignee: Felix Cheung

> Remove HistoryServer old multiple files format
> --
>
> Key: SPARK-12790
> URL: https://issues.apache.org/jira/browse/SPARK-12790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Reporter: Andrew Or
>Assignee: Felix Cheung
>
> HistoryServer has 2 formats. The old one makes a directory and puts multiple 
> files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just 
> 1 file called local_2593759238651.log or something.
> It's been a nightmare to maintain both code paths. We should just remove the 
> old legacy format (which has been out of use for many versions now) when we 
> still have the chance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12790) Remove HistoryServer old multiple files format

2016-01-19 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107167#comment-15107167
 ] 

Andrew Or commented on SPARK-12790:
---

also updating all the tests that rely on the old format. If you look under 
core/src/test/resources there are a bunch of those

> Remove HistoryServer old multiple files format
> --
>
> Key: SPARK-12790
> URL: https://issues.apache.org/jira/browse/SPARK-12790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Reporter: Andrew Or
>
> HistoryServer has 2 formats. The old one makes a directory and puts multiple 
> files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just 
> 1 file called local_2593759238651.log or something.
> It's been a nightmare to maintain both code paths. We should just remove the 
> old legacy format (which has been out of use for many versions now) when we 
> still have the chance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12790) Remove HistoryServer old multiple files format

2016-01-19 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107169#comment-15107169
 ] 

Andrew Or commented on SPARK-12790:
---

I've assigned this to you.

> Remove HistoryServer old multiple files format
> --
>
> Key: SPARK-12790
> URL: https://issues.apache.org/jira/browse/SPARK-12790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Reporter: Andrew Or
>Assignee: Felix Cheung
>
> HistoryServer has 2 formats. The old one makes a directory and puts multiple 
> files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just 
> 1 file called local_2593759238651.log or something.
> It's been a nightmare to maintain both code paths. We should just remove the 
> old legacy format (which has been out of use for many versions now) when we 
> still have the chance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12887) Do not expose var's in TaskMetrics

2016-01-19 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12887.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Do not expose var's in TaskMetrics
> --
>
> Key: SPARK-12887
> URL: https://issues.apache.org/jira/browse/SPARK-12887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> TaskMetrics has a bunch of var's, some are fully public, some are 
> private[spark]. This is bad coding style that makes it easy to accidentally 
> overwrite previously set metrics. This has happened a few times in the past 
> and caused bugs that were difficult to debug.
> Instead, we should have get-or-create semantics, which are more readily 
> understandable. This makes sense in the case of TaskMetrics because these are 
> just aggregated metrics that we want to collect throughout the task, so it 
> doesn't matter *who*'s incrementing them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2016-01-19 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107151#comment-15107151
 ] 

Andrew Or commented on SPARK-12485:
---

[~srowen] to answer your question no I don't feel super strongly about changing 
it. Naming is difficult in general and I think both "dynamic allocation" and 
"elastic scaling" do mean roughly the same thing. It's just that I slightly 
prefer the latter (or something shorter) after giving a few talks on this topic 
and chatting with a few people about it in real life. I'm also totally cool 
with closing this as a Won't Fix if you or [~markhamstra] prefer.

> Rename "dynamic allocation" to "elastic scaling"
> 
>
> Key: SPARK-12485
> URL: https://issues.apache.org/jira/browse/SPARK-12485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-01-19 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10620:
--
Attachment: AccumulatorsandTaskMetricsinSpark2.0.pdf

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Attachments: AccumulatorsandTaskMetricsinSpark2.0.pdf
>
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-01-19 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10620:
--
Attachment: accums-and-task-metrics.pdf

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Attachments: accums-and-task-metrics.pdf
>
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-01-19 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10620:
--
Attachment: (was: AccumulatorsandTaskMetricsinSpark2.0.pdf)

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12895) Implement TaskMetrics using accumulators

2016-01-18 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12895:
-

 Summary: Implement TaskMetrics using accumulators
 Key: SPARK-12895
 URL: https://issues.apache.org/jira/browse/SPARK-12895
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or


We need to first do this before we can avoid sending TaskMetrics from the 
executors to the driver. After we do this, we can send only accumulator updates 
instead of both that AND TaskMetrics.

But first, we need to express everything in TaskMetrics as accumulators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12895) Implement TaskMetrics using accumulators

2016-01-18 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12895:
--
Description: 
We need to first do this before we can avoid sending TaskMetrics from the 
executors to the driver. After we do this, we can send only accumulator updates 
instead of both that AND TaskMetrics.

By the end of this issue TaskMetrics will be a wrapper of accumulators. It will 
be only syntactic sugar for setting these accumulators.

But first, we need to express everything in TaskMetrics as accumulators.

  was:
We need to first do this before we can avoid sending TaskMetrics from the 
executors to the driver. After we do this, we can send only accumulator updates 
instead of both that AND TaskMetrics.

But first, we need to express everything in TaskMetrics as accumulators.


> Implement TaskMetrics using accumulators
> 
>
> Key: SPARK-12895
> URL: https://issues.apache.org/jira/browse/SPARK-12895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We need to first do this before we can avoid sending TaskMetrics from the 
> executors to the driver. After we do this, we can send only accumulator 
> updates instead of both that AND TaskMetrics.
> By the end of this issue TaskMetrics will be a wrapper of accumulators. It 
> will be only syntactic sugar for setting these accumulators.
> But first, we need to express everything in TaskMetrics as accumulators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12896) Send only accumulator updates, not TaskMetrics, to the driver

2016-01-18 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12896:
--
Description: 
Currently in Spark there are two different things that the executors send to 
the driver, accumulators and TaskMetrics. It would allow us to clean up a lot 
of things if we send ONLY accumulator updates instead of both that and 
TaskMetrics. Eventually we will be able to make TaskMetrics private and maybe 
even remove it.

This blocks on SPARK-12895.

  was:This blocks on


> Send only accumulator updates, not TaskMetrics, to the driver
> -
>
> Key: SPARK-12896
> URL: https://issues.apache.org/jira/browse/SPARK-12896
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Currently in Spark there are two different things that the executors send to 
> the driver, accumulators and TaskMetrics. It would allow us to clean up a lot 
> of things if we send ONLY accumulator updates instead of both that and 
> TaskMetrics. Eventually we will be able to make TaskMetrics private and maybe 
> even remove it.
> This blocks on SPARK-12895.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12896) Send only accumulator updates, not TaskMetrics, to the driver

2016-01-18 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12896:
-

 Summary: Send only accumulator updates, not TaskMetrics, to the 
driver
 Key: SPARK-12896
 URL: https://issues.apache.org/jira/browse/SPARK-12896
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or


This blocks on



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12885) Rename 3 fields in ShuffleWriteMetrics

2016-01-18 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12885:
-

 Summary: Rename 3 fields in ShuffleWriteMetrics
 Key: SPARK-12885
 URL: https://issues.apache.org/jira/browse/SPARK-12885
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


Today we have:

{code}
inputMetrics.recordsRead
outputMetrics.bytesWritten
shuffleReadMetrics.localBlocksFetched
...
shuffleWriteMetrics.shuffleRecordsWritten
shuffleWriteMetrics.shuffleBytesWritten
shuffleWriteMetrics.shuffleWriteTime
{code}

The shuffle write ones are kind of redundant. We can drop the shuffle part in 
the method names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager

2016-01-18 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10985.
---
  Resolution: Fixed
Assignee: Josh Rosen
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Avoid passing evicted blocks throughout BlockManager / CacheManager
> ---
>
> Key: SPARK-10985
> URL: https://issues.apache.org/jira/browse/SPARK-10985
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Andrew Or
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 2.0.0
>
>
> This is a minor refactoring task.
> Currently when we attempt to put a block in, we get back an array buffer of 
> blocks that are dropped in the process. We do this to propagate these blocks 
> back to our TaskContext, which will add them to its TaskMetrics so we can see 
> them in the SparkUI storage tab properly.
> Now that we have TaskContext.get, we can just use that to propagate this 
> information. This simplifies a lot of the signatures and gets rid of weird 
> return types like the following everywhere:
> {code}
> ArrayBuffer[(BlockId, BlockStatus)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-01-18 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10620:
--
Comment: was deleted

(was: User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/10811)

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-01-18 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10620:
--
Comment: was deleted

(was: User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/10810)

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12887) Do not expose var's in TaskMetrics

2016-01-18 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12887:
-

 Summary: Do not expose var's in TaskMetrics
 Key: SPARK-12887
 URL: https://issues.apache.org/jira/browse/SPARK-12887
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or


TaskMetrics has a bunch of var's, some are fully public, some are 
private[spark]. This is bad coding style that makes it easy to accidentally 
overwrite previously set metrics. This has happened a few times in the past and 
caused bugs that were difficult to debug.

Instead, we should have get-or-create semantics, which are more readily 
understandable. This makes sense in the case of TaskMetrics because these are 
just aggregated metrics that we want to collect throughout the task, so it 
doesn't matter *who*'s incrementing them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12884) Move Metrics and Accum* classes to their own files

2016-01-18 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12884:
-

 Summary: Move *Metrics and *Accum* classes to their own files
 Key: SPARK-12884
 URL: https://issues.apache.org/jira/browse/SPARK-12884
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


TaskMetrics.scala and Accumulators.scala are big files. The former contains 8 
classes and objects, and the latter contains 6.

When working through this code I found that it was difficult to navigate these 
files because of all the unrelated things them. We should move most of the 
classes to their own files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12174) Slow test: BlockManagerSuite."SPARK-9591: getRemoteBytes from another location when Exception throw"

2016-01-14 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12174.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Slow test: BlockManagerSuite."SPARK-9591: getRemoteBytes from another 
> location when Exception throw"
> 
>
> Key: SPARK-12174
> URL: https://issues.apache.org/jira/browse/SPARK-12174
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> BlockManagerSuite's "SPARK-9591: getRemoteBytes from another location when 
> Exception throw" test seems to take roughly 45 seconds to run on my laptop, 
> which seems way too long. We should see whether we can change timeout 
> configurations to reduce this test's time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12790) Remove HistoryServer old multiple files format

2016-01-12 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12790:
-

 Summary: Remove HistoryServer old multiple files format
 Key: SPARK-12790
 URL: https://issues.apache.org/jira/browse/SPARK-12790
 Project: Spark
  Issue Type: Sub-task
  Components: Deploy
Reporter: Andrew Or


HistoryServer has 2 formats. The old one makes a directory and puts multiple 
files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just 1 
file called local_2593759238651.log or something.

It's been a nightmare to maintain both code paths. We should just remove the 
old legacy format (which has been out of use for many versions now) when we 
still have the chance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12627) Remove SparkContext preferred locations constructor

2016-01-05 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084438#comment-15084438
 ] 

Andrew Or commented on SPARK-12627:
---

Resolved through https://github.com/apache/spark/pull/10569/files

> Remove SparkContext preferred locations constructor
> ---
>
> Key: SPARK-12627
> URL: https://issues.apache.org/jira/browse/SPARK-12627
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> In the code:
> {code}
> logWarning("Passing in preferred locations has no effect at all, see 
> SPARK-8949")
> {code}
> We kept it there only for backward compatibility. Now is the time to remove 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12627) Remove SparkContext preferred locations constructor

2016-01-05 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12627.
---
   Resolution: Fixed
 Assignee: Reynold Xin  (was: Andrew Or)
Fix Version/s: 2.0.0

> Remove SparkContext preferred locations constructor
> ---
>
> Key: SPARK-12627
> URL: https://issues.apache.org/jira/browse/SPARK-12627
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> In the code:
> {code}
> logWarning("Passing in preferred locations has no effect at all, see 
> SPARK-8949")
> {code}
> We kept it there only for backward compatibility. Now is the time to remove 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12627) Remove SparkContext preferred locations constructor

2016-01-04 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12627:
-

 Summary: Remove SparkContext preferred locations constructor
 Key: SPARK-12627
 URL: https://issues.apache.org/jira/browse/SPARK-12627
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Andrew Or
Priority: Blocker


In the code:
{code}
logWarning("Passing in preferred locations has no effect at all, see 
SPARK-8949")
{code}

We kept it there only for backward compatibility. Now is the time to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12627) Remove SparkContext preferred locations constructor

2016-01-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-12627:
-

Assignee: Andrew Or

> Remove SparkContext preferred locations constructor
> ---
>
> Key: SPARK-12627
> URL: https://issues.apache.org/jira/browse/SPARK-12627
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> In the code:
> {code}
> logWarning("Passing in preferred locations has no effect at all, see 
> SPARK-8949")
> {code}
> We kept it there only for backward compatibility. Now is the time to remove 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12628) SparkUI: weird formatting on additional metrics tooltip

2016-01-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12628:
--
Attachment: Screen Shot 2016-01-04 at 1.18.16 PM.png

> SparkUI: weird formatting on additional metrics tooltip
> ---
>
> Key: SPARK-12628
> URL: https://issues.apache.org/jira/browse/SPARK-12628
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Andrew Or
> Attachments: Screen Shot 2016-01-04 at 1.18.16 PM.png
>
>
> See screenshot. I'm on chrome 47.0.2526, OSX 10.9.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12628) SparkUI: weird formatting on additional metrics tooltip

2016-01-04 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12628:
-

 Summary: SparkUI: weird formatting on additional metrics tooltip
 Key: SPARK-12628
 URL: https://issues.apache.org/jira/browse/SPARK-12628
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Andrew Or


See screenshot. I'm on chrome 47.0.2526, OSX 10.9.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?

2016-01-02 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-12606:
---

 Summary: Scala/Java compatibility issue Re: how to extend java 
transformer from Scala UnaryTransformer ?
 Key: SPARK-12606
 URL: https://issues.apache.org/jira/browse/SPARK-12606
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.2
 Environment: Java 8, Mac OS, Spark-1.5.2
Reporter: Andrew Davidson



Hi Andy,

I suspect that you hit the Scala/Java compatibility issue, I can also reproduce 
this issue, so could you file a JIRA to track this issue?

Yanbo

2016-01-02 3:38 GMT+08:00 Andy Davidson :
I am trying to write a trivial transformer I use use in my pipeline. I am using 
java and spark 1.5.2. It was suggested that I use the Tokenize.scala class as 
an example. This should be very easy how ever I do not understand Scala, I am 
having trouble debugging the following exception.

Any help would be greatly appreciated.

Happy New Year

Andy

java.lang.IllegalArgumentException: requirement failed: Param null__inputCol 
does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557)
at org.apache.spark.ml.param.Params$class.set(params.scala:436)
at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
at org.apache.spark.ml.param.Params$class.set(params.scala:422)
at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
at 
org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83)
at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30)



public class StemmerTest extends AbstractSparkTest {
@Test
public void test() {
Stemmer stemmer = new Stemmer()
.setInputCol("raw”) //line 30
.setOutputCol("filtered");
}
}

/**
 * @ see 
spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
 * @ see 
https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
 * @ see 
http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/
 * 
 * @author andrewdavidson
 *
 */
public class Stemmer extends UnaryTransformer implements Serializable{
static Logger logger = LoggerFactory.getLogger(Stemmer.class);
private static final long serialVersionUID = 1L;
private static final  ArrayType inputType = 
DataTypes.createArrayType(DataTypes.StringType, true);
private final String uid = Stemmer.class.getSimpleName() + "_" + 
UUID.randomUUID().toString();

@Override
public String uid() {
return uid;
}

/*
   override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == StringType, s"Input type must be string type but got 
$inputType.")
  }
 */
@Override
public void validateInputType(DataType inputTypeArg) {
String msg = "inputType must be " + inputType.simpleString() + " but 
got " + inputTypeArg.simpleString();
assert (inputType.equals(inputTypeArg)) : msg; 
}

@Override
public Function1 createTransformFunc() {
// 
http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters
Function1 f = new 
AbstractFunction1() {
public List apply(List words) {
for(String word : words) {
logger.error("AEDWIP input word: {}", word);
}
return words;
}
};

return f;
}

@Override
public DataType outputDataType() {
return DataTypes.createArrayType(DataTypes.StringType, true);
}
}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12414) Remove closure serializer

2015-12-30 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075814#comment-15075814
 ] 

Andrew Or edited comment on SPARK-12414 at 12/31/15 7:51 AM:
-

It's also for code cleanup. Right now SparkEnv has a "closure serializer" and a 
"serializer", which is kind of confusing. We should just use Java serializer 
since it's worked for such a long time. I don't know much about Kryo 3.0 but 
I'm not sure if upgrading would be sufficient.


was (Author: andrewor14):
It's also for code cleanup. Right now SparkEnv has a "closure serializer" and a 
"serializer", which is kind of confusing. We should just use Java serializer 
since it's worked for such a long time. Not sure about Kryo 3.0 but I'm not 
sure if upgrading would be sufficient.

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12414) Remove closure serializer

2015-12-30 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075814#comment-15075814
 ] 

Andrew Or commented on SPARK-12414:
---

It's also for code cleanup. Right now SparkEnv has a "closure serializer" and a 
"serializer", which is kind of confusing. We should just use Java serializer 
since it's worked for such a long time. Not sure about Kryo 3.0 but I'm not 
sure if upgrading would be sufficient.

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows

2015-12-30 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-12582:
-

Assignee: Andrew Or

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>Assignee: Andrew Or
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows

2015-12-30 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12582:
--
Component/s: (was: Shuffle)
 Windows
 Tests

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows

2015-12-30 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12582:
--
Assignee: (was: Apache Spark)
Target Version/s: 1.6.1, 2.0.0

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows

2015-12-30 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12582:
--
Assignee: yucai  (was: Andrew Or)

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>Assignee: yucai
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12554) Standalone mode may hang if max cores is not a multiple of executor cores

2015-12-30 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12554:
--
Priority: Minor  (was: Major)

> Standalone mode may hang if max cores is not a multiple of executor cores
> -
>
> Key: SPARK-12554
> URL: https://issues.apache.org/jira/browse/SPARK-12554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 1.5.2
>Reporter: Lijie Xu
>Priority: Minor
>
> In scheduleExecutorsOnWorker() in Master.scala,
> {{val keepScheduling = coresToAssign >= minCoresPerExecutor}} should be 
> changed to {{val keepScheduling = coresToAssign > 0}}
> Case 1: 
> Suppose that an app's requested cores is 10 (i.e., {{spark.cores.max = 10}}) 
> and app.coresPerExecutor is 4 (i.e., {{spark.executor.cores = 4}}). 
> After allocating two executors (each has 4 cores) to this app, the 
> {{app.coresToAssign = 2}} and {{minCoresPerExecutor = coresPerExecutor = 4}}, 
> so {{keepScheduling = false}} and no extra executor will be allocated to this 
> app. If {{spark.scheduler.minRegisteredResourcesRatio}} is set to a large 
> number (e.g., > 0.8 in this case), the app will hang and never finish.
> Case 2: if a small app's coresPerExecutor is larger than its requested cores 
> (e.g., {{spark.cores.max = 10}}, {{spark.executor.cores = 16}}), {{val 
> keepScheduling = coresToAssign >= minCoresPerExecutor}} is always FALSE. As a 
> result, this app will never get an executor to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12554) Standalone mode may hang if max cores is not a multiple of executor cores

2015-12-30 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12554:
--
Summary: Standalone mode may hang if max cores is not a multiple of 
executor cores  (was: Standalone app scheduler will hang when app.coreToAssign 
< minCoresPerExecutor)

> Standalone mode may hang if max cores is not a multiple of executor cores
> -
>
> Key: SPARK-12554
> URL: https://issues.apache.org/jira/browse/SPARK-12554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 1.5.2
>Reporter: Lijie Xu
>
> In scheduleExecutorsOnWorker() in Master.scala,
> {{val keepScheduling = coresToAssign >= minCoresPerExecutor}} should be 
> changed to {{val keepScheduling = coresToAssign > 0}}
> Case 1: 
> Suppose that an app's requested cores is 10 (i.e., {{spark.cores.max = 10}}) 
> and app.coresPerExecutor is 4 (i.e., {{spark.executor.cores = 4}}). 
> After allocating two executors (each has 4 cores) to this app, the 
> {{app.coresToAssign = 2}} and {{minCoresPerExecutor = coresPerExecutor = 4}}, 
> so {{keepScheduling = false}} and no extra executor will be allocated to this 
> app. If {{spark.scheduler.minRegisteredResourcesRatio}} is set to a large 
> number (e.g., > 0.8 in this case), the app will hang and never finish.
> Case 2: if a small app's coresPerExecutor is larger than its requested cores 
> (e.g., {{spark.cores.max = 10}}, {{spark.executor.cores = 16}}), {{val 
> keepScheduling = coresToAssign >= minCoresPerExecutor}} is always FALSE. As a 
> result, this app will never get an executor to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12554) Standalone app scheduler will hang when app.coreToAssign < minCoresPerExecutor

2015-12-30 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075353#comment-15075353
 ] 

Andrew Or commented on SPARK-12554:
---

[~jerrylead] The change you are proposing changes the semantics incorrectly.

The right fix here is to adjust the wait behavior. Right now it doesn't take 
into account the fact that executors can have fixed number of cores, such that 
we never get to `spark.cores.max`. All we need to do is to use the right max 
when waiting for resources, e.g. in your example instead of waiting for all 10 
cores, we wait for the nearest multiple of 4, i.e. 8.

Your case 2 is not at all a bug. The user chose settings that are impossible to 
fulfill. Although there's nothing to fix we can throw an exception to fail the 
application quickly, but no one really runs into this so it's probably not 
worth doing.

> Standalone app scheduler will hang when app.coreToAssign < minCoresPerExecutor
> --
>
> Key: SPARK-12554
> URL: https://issues.apache.org/jira/browse/SPARK-12554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 1.5.2
>Reporter: Lijie Xu
>
> In scheduleExecutorsOnWorker() in Master.scala,
> {{val keepScheduling = coresToAssign >= minCoresPerExecutor}} should be 
> changed to {{val keepScheduling = coresToAssign > 0}}
> Case 1: 
> Suppose that an app's requested cores is 10 (i.e., {{spark.cores.max = 10}}) 
> and app.coresPerExecutor is 4 (i.e., {{spark.executor.cores = 4}}). 
> After allocating two executors (each has 4 cores) to this app, the 
> {{app.coresToAssign = 2}} and {{minCoresPerExecutor = coresPerExecutor = 4}}, 
> so {{keepScheduling = false}} and no extra executor will be allocated to this 
> app. If {{spark.scheduler.minRegisteredResourcesRatio}} is set to a large 
> number (e.g., > 0.8 in this case), the app will hang and never finish.
> Case 2: if a small app's coresPerExecutor is larger than its requested cores 
> (e.g., {{spark.cores.max = 10}}, {{spark.executor.cores = 16}}), {{val 
> keepScheduling = coresToAssign >= minCoresPerExecutor}} is always FALSE. As a 
> result, this app will never get an executor to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12554) Standalone app scheduler will hang when app.coreToAssign < minCoresPerExecutor

2015-12-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12554:
--
Priority: Major  (was: Critical)

> Standalone app scheduler will hang when app.coreToAssign < minCoresPerExecutor
> --
>
> Key: SPARK-12554
> URL: https://issues.apache.org/jira/browse/SPARK-12554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 1.5.2
>Reporter: Lijie Xu
>
> In scheduleExecutorsOnWorker() in Master.scala,
> *val keepScheduling = coresToAssign >= minCoresPerExecutor* should be changed 
> to *val keepScheduling = coresToAssign > 0*
> Suppose that an app's requested cores is 10 (i.e., spark.cores.max = 10) and 
> app.coresPerExecutor is 4 (i.e., spark.executor.cores = 4). 
> After allocating two executors (each has 4 cores) to this app, the 
> *app.coresToAssign = 2* and *minCoresPerExecutor = coresPerExecutor = 4*, so 
> *keepScheduling = false* and no extra executor will be allocated to this app. 
> If *spark.scheduler.minRegisteredResourcesRatio* is set to a large number 
> (e.g., > 0.8 in this case), the app will hang and never finish.
> Another case: if a small app's coresPerExecutor is larger than its requested 
> cores (e.g., spark.cores.max = 10, spark.executor.cores = 16), *val 
> keepScheduling = coresToAssign >= minCoresPerExecutor* is always FALSE. As a 
> result, this app will never get an executor to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12559) Standalone cluster mode doesn't work with --packages

2015-12-29 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12559:
-

 Summary: Standalone cluster mode doesn't work with --packages
 Key: SPARK-12559
 URL: https://issues.apache.org/jira/browse/SPARK-12559
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.3.0
Reporter: Andrew Or


>From the mailing list:
{quote}
Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12554) Standalone app scheduler will hang when app.coreToAssign < minCoresPerExecutor

2015-12-29 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074189#comment-15074189
 ] 

Andrew Or commented on SPARK-12554:
---

I've downgraded the issue.

I also think the behavior is by design. The semantics of `spark.executor.cores` 
is that each executor has exactly N number of cores. Your cluster "hanging" 
means it doesn't have enough resources to launch the expected number of 
executors, so you should provision more resources.

> Standalone app scheduler will hang when app.coreToAssign < minCoresPerExecutor
> --
>
> Key: SPARK-12554
> URL: https://issues.apache.org/jira/browse/SPARK-12554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 1.5.2
>Reporter: Lijie Xu
>
> In scheduleExecutorsOnWorker() in Master.scala,
> *val keepScheduling = coresToAssign >= minCoresPerExecutor* should be changed 
> to *val keepScheduling = coresToAssign > 0*
> Suppose that an app's requested cores is 10 (i.e., spark.cores.max = 10) and 
> app.coresPerExecutor is 4 (i.e., spark.executor.cores = 4). 
> After allocating two executors (each has 4 cores) to this app, the 
> *app.coresToAssign = 2* and *minCoresPerExecutor = coresPerExecutor = 4*, so 
> *keepScheduling = false* and no extra executor will be allocated to this app. 
> If *spark.scheduler.minRegisteredResourcesRatio* is set to a large number 
> (e.g., > 0.8 in this case), the app will hang and never finish.
> Another case: if a small app's coresPerExecutor is larger than its requested 
> cores (e.g., spark.cores.max = 10, spark.executor.cores = 16), *val 
> keepScheduling = coresToAssign >= minCoresPerExecutor* is always FALSE. As a 
> result, this app will never get an executor to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12559) Standalone cluster mode doesn't work with --packages

2015-12-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12559:
--
Description: 
>From the mailing list:
{quote}
Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.
{quote}

The problem is that we currently don't upload jars to the cluster. It seems to 
fix this we either (1) do upload jars, or (2) just run the packages code on the 
driver side. I slightly prefer (2) because it's simpler.

  was:
>From the mailing list:
{quote}
Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.
{quote}


> Standalone cluster mode doesn't work with --packages
> 
>
> Key: SPARK-12559
> URL: https://issues.apache.org/jira/browse/SPARK-12559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>
> From the mailing list:
> {quote}
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
> {quote}
> The problem is that we currently don't upload jars to the cluster. It seems 
> to fix this we either (1) do upload jars, or (2) just run the packages code 
> on the driver side. I slightly prefer (2) because it's simpler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12565) Document standalone mode jar distribution more clearly

2015-12-29 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12565:
-

 Summary: Document standalone mode jar distribution more clearly
 Key: SPARK-12565
 URL: https://issues.apache.org/jira/browse/SPARK-12565
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Andrew Or


There's a lot of confusion on the mailing list about how jars are distributed. 
The existing docs are partially responsible for this:

{quote}
If your application is launched through Spark submit, then the application jar 
is automatically distributed to all worker nodes.
{quote}

This makes it seem like even in cluster mode, the jars are uploaded from the 
client (machine running Spark submit) to the cluster, which is not true.

We should clarify the documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-12415) Do not use closure serializer to serialize task result

2015-12-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-12415.
-
Resolution: Won't Fix

> Do not use closure serializer to serialize task result
> --
>
> Key: SPARK-12415
> URL: https://issues.apache.org/jira/browse/SPARK-12415
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> As the name suggests, closure serializer is for closures. We should be able 
> to use the generic serializer for task results. If we want to do this we need 
> to register `org.apache.spark.scheduler.TaskResult` if we use Kryo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12415) Do not use closure serializer to serialize task result

2015-12-29 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074201#comment-15074201
 ] 

Andrew Or commented on SPARK-12415:
---

Closed as Won't Fix. We can't use Kryo to serialize task result because there 
are all these dependent classes in TaskMetrics. If the user adds one but 
forgets to register it with Kryo then suddenly all apps with Kryo will fail.

This issue will be superseded by SPARK-12414.

> Do not use closure serializer to serialize task result
> --
>
> Key: SPARK-12415
> URL: https://issues.apache.org/jira/browse/SPARK-12415
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> As the name suggests, closure serializer is for closures. We should be able 
> to use the generic serializer for task results. If we want to do this we need 
> to register `org.apache.spark.scheduler.TaskResult` if we use Kryo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12554) Standalone app scheduler will hang when app.coreToAssign < minCoresPerExecutor

2015-12-29 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074198#comment-15074198
 ] 

Andrew Or commented on SPARK-12554:
---

If anything I would say it's the calculation of 
{{spark.scheduler.maxRegisteredResourcesWaitingTime}} that's a problem. It 
shouldn't use {{spark.cores.max}} as the threshold max; it should round 
{{spark.cores.max}} to the nearest multiple of {{spark.executor.cores}} if the 
latter is set.

> Standalone app scheduler will hang when app.coreToAssign < minCoresPerExecutor
> --
>
> Key: SPARK-12554
> URL: https://issues.apache.org/jira/browse/SPARK-12554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 1.5.2
>Reporter: Lijie Xu
>
> In scheduleExecutorsOnWorker() in Master.scala,
> *val keepScheduling = coresToAssign >= minCoresPerExecutor* should be changed 
> to *val keepScheduling = coresToAssign > 0*
> Suppose that an app's requested cores is 10 (i.e., spark.cores.max = 10) and 
> app.coresPerExecutor is 4 (i.e., spark.executor.cores = 4). 
> After allocating two executors (each has 4 cores) to this app, the 
> *app.coresToAssign = 2* and *minCoresPerExecutor = coresPerExecutor = 4*, so 
> *keepScheduling = false* and no extra executor will be allocated to this app. 
> If *spark.scheduler.minRegisteredResourcesRatio* is set to a large number 
> (e.g., > 0.8 in this case), the app will hang and never finish.
> Another case: if a small app's coresPerExecutor is larger than its requested 
> cores (e.g., spark.cores.max = 10, spark.executor.cores = 16), *val 
> keepScheduling = coresToAssign >= minCoresPerExecutor* is always FALSE. As a 
> result, this app will never get an executor to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12561) Remove JobLogger

2015-12-29 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12561:
-

 Summary: Remove JobLogger
 Key: SPARK-12561
 URL: https://issues.apache.org/jira/browse/SPARK-12561
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or


It was research code and has been deprecated since 1.0.0. No one really uses it 
since they can just use event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12411) Reconsider executor heartbeats rpc timeout

2015-12-23 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12411:
--
Target Version/s: 1.6.1, 2.0.0  (was: 2.0.0)

> Reconsider executor heartbeats rpc timeout
> --
>
> Key: SPARK-12411
> URL: https://issues.apache.org/jira/browse/SPARK-12411
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 1.6.1, 2.0.0
>
>
> Currently, the timeout for checking when an executor is failed is the same as 
> the timeout of the sender ("spark.network.timeout") which defaults to 120s. 
> This means if there is a network issue, the executor only gets one try to 
> heartbeat which probably causes the failure detection to be flaky. 
> The executor has a config to control how often to heartbeat 
> (spark.executor.heartbeatInterval) which defaults to 10s. This combination of 
> configs doesn't seem to make sense. The heartbeat rpc timeout should probably 
> be less than or equal to the heartbeatInterval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12411) Reconsider executor heartbeats rpc timeout

2015-12-23 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12411:
--
Fix Version/s: 1.6.1

> Reconsider executor heartbeats rpc timeout
> --
>
> Key: SPARK-12411
> URL: https://issues.apache.org/jira/browse/SPARK-12411
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 1.6.1, 2.0.0
>
>
> Currently, the timeout for checking when an executor is failed is the same as 
> the timeout of the sender ("spark.network.timeout") which defaults to 120s. 
> This means if there is a network issue, the executor only gets one try to 
> heartbeat which probably causes the failure detection to be flaky. 
> The executor has a config to control how often to heartbeat 
> (spark.executor.heartbeatInterval) which defaults to 10s. This combination of 
> configs doesn't seem to make sense. The heartbeat rpc timeout should probably 
> be less than or equal to the heartbeatInterval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-12483:
---

 Summary: Data Frame as() does not work in Java
 Key: SPARK-12483
 URL: https://issues.apache.org/jira/browse/SPARK-12483
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
 Environment: Mac El Cap 10.11.2
Java 8
Reporter: Andrew Davidson


Following unit test demonstrates a bug in as(). The column name for aliasDF was 
not changed

   @Test
public void bugDataFrameAsTest() {
DataFrame df = createData();
df.printSchema();
df.show();

 DataFrame aliasDF = df.select("id").as("UUID");
 aliasDF.printSchema();
 aliasDF.show();
}

DataFrame createData() {
Features f1 = new Features(1, category1);
Features f2 = new Features(2, category2);
ArrayList data = new ArrayList(2);
data.add(f1);
data.add(f2);
//JavaRDD rdd = 
javaSparkContext.parallelize(Arrays.asList(f1, f2));
JavaRDD rdd = javaSparkContext.parallelize(data);
DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
return df;
}

This is the output I got (without the spark log msgs)

root
 |-- id: integer (nullable = false)
 |-- labelStr: string (nullable = true)

+---++
| id|labelStr|
+---++
|  1|   noise|
|  2|questionable|
+---++

root
 |-- id: integer (nullable = false)

+---+
| id|
+---+
|  1|
|  2|
+---+




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068696#comment-15068696
 ] 

Andrew Davidson commented on SPARK-12484:
-

What I am really trying to do is rewrite the following python code in Java. 
Ideally I would implement this code as a MLib.transformation how ever that does 
not seem possible at this point in time using the Java API

Kind regards

Andy

def convertMultinomialLabelToBinary(dataFrame):
newColName = "binomialLabel"
binomial = udf(lambda labelStr: labelStr if (labelStr == "noise") else 
"signal", StringType())
ret = dataFrame.withColumn(newColName, binomial(dataFrame["label"]))
return ret

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2015-12-22 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12485:
--
Target Version/s: 2.0.0

> Rename "dynamic allocation" to "elastic scaling"
> 
>
> Key: SPARK-12485
> URL: https://issues.apache.org/jira/browse/SPARK-12485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: SPARK_12483_unitTest.java

added a unit test file

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12484:

Attachment: UDFTest.java

Add a unit test file

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068698#comment-15068698
 ] 

Andrew Davidson commented on SPARK-12484:
-

releated issue https://issues.apache.org/jira/browse/SPARK-12483

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069027#comment-15069027
 ] 

Andrew Davidson commented on SPARK-12483:
-

Hi Xiao

thanks for looking at the issue

is there a way to change a column name? If you do a select() using a data 
frame, the column name is really strange

see attachement for https://issues.apache.org/jira/browse/SPARK-12484

 // get column from data frame call df.withColumnName
Column newCol = udfDF.col("_c0"); 

renaming data frame columns is very common in R

Kind regards

Andy

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069061#comment-15069061
 ] 

Andrew Davidson commented on SPARK-12484:
-

Hi Xiao

thanks for looking into this

Andy

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: SPARK_12483_unitTest.java

add a unit test file 

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-12484:
---

 Summary: DataFrame withColumn() does not work in Java
 Key: SPARK-12484
 URL: https://issues.apache.org/jira/browse/SPARK-12484
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
 Environment: mac El Cap. 10.11.2
Java 8
Reporter: Andrew Davidson


DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises

 org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
transformedByUDF#3];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068684#comment-15068684
 ] 

Andrew Davidson commented on SPARK-12484:
-

you can find some more back ground on the email thread 'should I file a bug? 
Re: trouble implementing Transformer and calling DataFrame.withColumn()' 



> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Comment: was deleted

(was: add a unit test file )

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: (was: SPARK_12483_unitTest.java)

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2015-12-22 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12485:
-

 Summary: Rename "dynamic allocation" to "elastic scaling"
 Key: SPARK-12485
 URL: https://issues.apache.org/jira/browse/SPARK-12485
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or


Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12414) Remove closure serializer

2015-12-22 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-12414:
-

Assignee: Andrew Or

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12464) Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url

2015-12-21 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067062#comment-15067062
 ] 

Andrew Or commented on SPARK-12464:
---

By the way for future reference you probably don't need a separate issue for 
each config. Just have an issue that says
`Remove spark.deploy.mesos.* and use spark.deploy.* instead`. Since you already 
opened these we can just keep them.

> Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url
> --
>
> Key: SPARK-12464
> URL: https://issues.apache.org/jira/browse/SPARK-12464
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Remove spark.deploy.mesos.zookeeper.url and use existing configuration 
> spark.deploy.zookeeper.url for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12466) Harmless Master NPE in tests

2015-12-21 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12466:
-

 Summary: Harmless Master NPE in tests
 Key: SPARK-12466
 URL: https://issues.apache.org/jira/browse/SPARK-12466
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Tests
Affects Versions: 1.6.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.6.1, 2.0.0


{code}
[info] ReplayListenerSuite:
[info] - Simple replay (58 milliseconds)
java.lang.NullPointerException
at 
org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
at 
org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at 
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[info] - End-to-end replay (10 seconds, 755 milliseconds)
{code}
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull

caused by https://github.com/apache/spark/pull/10284

Thanks to [~ted_yu] for reporting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5882) Add a test for GraphLoader.edgeListFile

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-5882.
--
  Resolution: Fixed
Assignee: Takeshi Yamamuro
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Add a test for GraphLoader.edgeListFile
> ---
>
> Key: SPARK-5882
> URL: https://issues.apache.org/jira/browse/SPARK-5882
> Project: Spark
>  Issue Type: Test
>  Components: GraphX
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Trivial
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12414) Remove closure serializer

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12414:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-11806

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12473) Reuse serializer instances for performance

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12473:
--
Description: 
After commit de02782 of page rank regressed from 242s to 260s, about 7%. 
Although currently it's only 7%, we will likely register more classes in the 
future so this will only increase.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.

We should explore the alternative of reusing thread-local serializer instances, 
which would lead to much fewer calls to `kryo.register`.

  was:
After commit de02782 of page rank regressed from 242s to 260s, about 7%. 
Although currently it's only 7%, we will likely register more classes in the 
future so we should do this the right way.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.

We should explore the alternative of reusing thread-local serializer instances, 
which would lead to much fewer calls to `kryo.register`.


> Reuse serializer instances for performance
> --
>
> Key: SPARK-12473
> URL: https://issues.apache.org/jira/browse/SPARK-12473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> After commit de02782 of page rank regressed from 242s to 260s, about 7%. 
> Although currently it's only 7%, we will likely register more classes in the 
> future so this will only increase.
> The commit added 26 types to register every time we create a Kryo serializer 
> instance. I ran a small microbenchmark to prove that this is noticeably 
> expensive:
> {code}
> import org.apache.spark.serializer._
> import org.apache.spark.SparkConf
> def makeMany(num: Int): Long = {
>   val start = System.currentTimeMillis
>   (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
>   System.currentTimeMillis - start
> }
> // before commit de02782, averaged over multiple runs
> makeMany(5000) == 1500
> // after commit de02782, averaged over multiple runs
> makeMany(5000) == 2750
> {code}
> Since we create multiple serializer instances per partition, this means a 
> 5000-partition stage will unconditionally see an increase of > 1s for the 
> stage. In page rank, we may run many such stages.
> We should explore the alternative of reusing thread-local serializer 
> instances, which would lead to much fewer calls to `kryo.register`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12339) NullPointerException on stage kill from web UI

2015-12-21 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067154#comment-15067154
 ] 

Andrew Or commented on SPARK-12339:
---

I've updated the affected version to 2.0 since SPARK-11206 was merged only 
there. Please let me know if this is not the case.

> NullPointerException on stage kill from web UI
> --
>
> Key: SPARK-12339
> URL: https://issues.apache.org/jira/browse/SPARK-12339
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> The following message is in the logs after killing a stage:
> {code}
> scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33)
> INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32)
> WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): 
> TaskKilled (killed intentionally)
> WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): 
> TaskKilled (killed intentionally)
> INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, 
> from pool
> ERROR LiveListenerBus: Listener SQLListener threw an exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> ERROR LiveListenerBus: Listener SQLListener threw an exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> {code}
> To reproduce, start a job and kill the stage from web UI, e.g.:
> {code}
> val rdd = sc.parallelize(0 to 9, 2)
> rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it 
> }.count
> {code}
> Go to web UI and in Stages tab click "kill" for the stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12339) NullPointerException on stage kill from web UI

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12339:
--
Affects Version/s: (was: 1.6.0)
   2.0.0

> NullPointerException on stage kill from web UI
> --
>
> Key: SPARK-12339
> URL: https://issues.apache.org/jira/browse/SPARK-12339
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> The following message is in the logs after killing a stage:
> {code}
> scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33)
> INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32)
> WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): 
> TaskKilled (killed intentionally)
> WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): 
> TaskKilled (killed intentionally)
> INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, 
> from pool
> ERROR LiveListenerBus: Listener SQLListener threw an exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> ERROR LiveListenerBus: Listener SQLListener threw an exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> {code}
> To reproduce, start a job and kill the stage from web UI, e.g.:
> {code}
> val rdd = sc.parallelize(0 to 9, 2)
> rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it 
> }.count
> {code}
> Go to web UI and in Stages tab click "kill" for the stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12392) Optimize a location order of broadcast blocks by considering preferred local hosts

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12392.
---
  Resolution: Fixed
Assignee: Takeshi Yamamuro
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Optimize a location order of broadcast blocks by considering preferred local 
> hosts
> --
>
> Key: SPARK-12392
> URL: https://issues.apache.org/jira/browse/SPARK-12392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.0.0
>
>
> When multiple workers exist in a host, we can bypass unnecessary remote 
> access for broadcasts; block managers fetch broadcast blocks from the same 
> host instead of remote hosts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12473) Reuse serializer instances for performance

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12473:
--
Description: 
After commit de02782 of page rank regressed from 242s to 260s, about 7%. 
Although currently it's only 7%, we will likely register more classes in the 
future so we should do this the right way.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.

We should explore the alternative of reusing thread-local serializer instances, 
which would lead to much fewer calls to `kryo.register`.

  was:
After commit de02782 of page rank regressed from 242s to 260s, about 7%.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.

We should explore the alternative of reusing thread-local serializer instances, 
which would lead to much fewer calls to `kryo.register`.


> Reuse serializer instances for performance
> --
>
> Key: SPARK-12473
> URL: https://issues.apache.org/jira/browse/SPARK-12473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> After commit de02782 of page rank regressed from 242s to 260s, about 7%. 
> Although currently it's only 7%, we will likely register more classes in the 
> future so we should do this the right way.
> The commit added 26 types to register every time we create a Kryo serializer 
> instance. I ran a small microbenchmark to prove that this is noticeably 
> expensive:
> {code}
> import org.apache.spark.serializer._
> import org.apache.spark.SparkConf
> def makeMany(num: Int): Long = {
>   val start = System.currentTimeMillis
>   (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
>   System.currentTimeMillis - start
> }
> // before commit de02782, averaged over multiple runs
> makeMany(5000) == 1500
> // after commit de02782, averaged over multiple runs
> makeMany(5000) == 2750
> {code}
> Since we create multiple serializer instances per partition, this means a 
> 5000-partition stage will unconditionally see an increase of > 1s for the 
> stage. In page rank, we may run many such stages.
> We should explore the alternative of reusing thread-local serializer 
> instances, which would lead to much fewer calls to `kryo.register`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12466) Harmless Master NPE in tests

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12466.
---
Resolution: Fixed

> Harmless Master NPE in tests
> 
>
> Key: SPARK-12466
> URL: https://issues.apache.org/jira/browse/SPARK-12466
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.6.1, 2.0.0
>
>
> {code}
> [info] ReplayListenerSuite:
> [info] - Simple replay (58 milliseconds)
> java.lang.NullPointerException
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
>   at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
>   at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> [info] - End-to-end replay (10 seconds, 755 milliseconds)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull
> caused by https://github.com/apache/spark/pull/10284
> Thanks to [~ted_yu] for reporting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-2331.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
> --
>
> Key: SPARK-2331
> URL: https://issues.apache.org/jira/browse/SPARK-2331
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Ian Hummel
>Assignee: Reynold Xin
>Priority: Minor
> Fix For: 2.0.0
>
>
> The return type for SparkContext.emptyRDD is EmptyRDD[T].
> It should be RDD[T].  That means you have to add extra type annotations on 
> code like the below (which creates a union of RDDs over some subset of paths 
> in a folder)
> {code}
> val rdds = Seq("a", "b", "c").foldLeft[RDD[String]](sc.emptyRDD[String]) { 
> (rdd, path) ⇒
>   rdd.union(sc.textFile(path))
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12339) NullPointerException on stage kill from web UI

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12339.
---
  Resolution: Fixed
Assignee: Alex Bozarth  (was: Apache Spark)
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> NullPointerException on stage kill from web UI
> --
>
> Key: SPARK-12339
> URL: https://issues.apache.org/jira/browse/SPARK-12339
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> The following message is in the logs after killing a stage:
> {code}
> scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33)
> INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32)
> WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): 
> TaskKilled (killed intentionally)
> WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): 
> TaskKilled (killed intentionally)
> INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, 
> from pool
> ERROR LiveListenerBus: Listener SQLListener threw an exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> ERROR LiveListenerBus: Listener SQLListener threw an exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> {code}
> To reproduce, start a job and kill the stage from web UI, e.g.:
> {code}
> val rdd = sc.parallelize(0 to 9, 2)
> rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it 
> }.count
> {code}
> Go to web UI and in Stages tab click "kill" for the stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12473) Reuse serializer instances for performance

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12473:
--
Description: 
After commit de02782 of page rank regressed from 242s to 260s, about 7%.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.

We should explore the alternative of reusing thread-local serializer instances, 
which would lead to much fewer calls to `kryo.register`.

  was:
After commit de02782 of page rank regressed from 242s to 260s, about 7%.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.


> Reuse serializer instances for performance
> --
>
> Key: SPARK-12473
> URL: https://issues.apache.org/jira/browse/SPARK-12473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> After commit de02782 of page rank regressed from 242s to 260s, about 7%.
> The commit added 26 types to register every time we create a Kryo serializer 
> instance. I ran a small microbenchmark to prove that this is noticeably 
> expensive:
> {code}
> import org.apache.spark.serializer._
> import org.apache.spark.SparkConf
> def makeMany(num: Int): Long = {
>   val start = System.currentTimeMillis
>   (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
>   System.currentTimeMillis - start
> }
> // before commit de02782, averaged over multiple runs
> makeMany(5000) == 1500
> // after commit de02782, averaged over multiple runs
> makeMany(5000) == 2750
> {code}
> Since we create multiple serializer instances per partition, this means a 
> 5000-partition stage will unconditionally see an increase of > 1s for the 
> stage. In page rank, we may run many such stages.
> We should explore the alternative of reusing thread-local serializer 
> instances, which would lead to much fewer calls to `kryo.register`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12473) Reuse serializer instances for performance

2015-12-21 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12473:
-

 Summary: Reuse serializer instances for performance
 Key: SPARK-12473
 URL: https://issues.apache.org/jira/browse/SPARK-12473
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Andrew Or
Assignee: Andrew Or


After commit de02782 of page rank regressed from 242s to 260s, about 7%.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 3551 matches

Mail list logo