[jira] [Commented] (SPARK-13305) With SPARK_WORKER_WEBUI_PORT and --webui-port set for start-slave.sh script, --webui-port is used twice
[ https://issues.apache.org/jira/browse/SPARK-13305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145512#comment-15145512 ] Jacek Laskowski commented on SPARK-13305: - I was explicit about the different ways of setting the port of the worker's web UI, but it could be harder to figure out. {{SPARK_WORKER_WEBUI_PORT=1}} could be in {{conf/spark-env.sh}}. The point is that [WorkerArguments|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala#L48] is doing the env var lookup anyway and there is no need to have the env var mapped to {{--webui-port}} inside {{sbin/start-slave.sh}}. I think {{sbin/start-slave.sh}} should *not* set it as {{--webui-port}}, but just {{export}} it (as {{bin/load-spark-env.sh}} does with the other env vars). It should be an easy fix that would make the Spark command easier on eyes, i.e. without {{--webui-port}} used twice (that looks just...messy). > With SPARK_WORKER_WEBUI_PORT and --webui-port set for start-slave.sh script, > --webui-port is used twice > --- > > Key: SPARK-13305 > URL: https://issues.apache.org/jira/browse/SPARK-13305 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > > Executing the following command to start a worker: > {code} > SPARK_WORKER_WEBUI_PORT=1 ./sbin/start-slave.sh spark://localhost:7077 > --webui-port 2 > {code} > ends up with the following Spark command (in the log file) -- some characters > cut off to make it relevant: > {code} > Spark Command: [cut] org.apache.spark.deploy.worker.Worker --webui-port 1 > spark://localhost:7077 --webui-port 2 > {code} > Note {{--webui-port}} set twice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6166) Limit number of in flight outbound requests for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-6166: Assignee: Sanket Reddy > Limit number of in flight outbound requests for shuffle fetch > - > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Assignee: Sanket Reddy >Priority: Minor > Fix For: 2.0.0 > > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound requests. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13306) Uncorrelated scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13306: --- Component/s: SQL > Uncorrelated scalar subquery > > > Key: SPARK-13306 > URL: https://issues.apache.org/jira/browse/SPARK-13306 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > A scalar subquery is a subquery that only generate single row and single > column, could be used as part of expression. > Uncorrelated scalar subquery means it does not has a reference to external > table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7367) spark-submit CLI --help -h overrides the application arguments
[ https://issues.apache.org/jira/browse/SPARK-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145520#comment-15145520 ] Apache Spark commented on SPARK-7367: - User 'BimalTandel' has created a pull request for this issue: https://github.com/apache/spark/pull/11191 > spark-submit CLI --help -h overrides the application arguments > -- > > Key: SPARK-7367 > URL: https://issues.apache.org/jira/browse/SPARK-7367 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Gianmario Spacagna >Priority: Minor > > The spark-submit script will parse the --help argument even if is provided as > application argument. > E.g. > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar --help > or > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -h > If my application is using a parsing library, such as Scallop, then it will > never be able to run the application with --help as argument. > I think the spark-submit script should only print the help message when is > provided as single argument like this: > spark-submit --help > or it should provide a separator for trailing arguments: > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -- --help --arg1 --arg2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7367) spark-submit CLI --help -h overrides the application arguments
[ https://issues.apache.org/jira/browse/SPARK-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7367: --- Assignee: Apache Spark > spark-submit CLI --help -h overrides the application arguments > -- > > Key: SPARK-7367 > URL: https://issues.apache.org/jira/browse/SPARK-7367 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Gianmario Spacagna >Assignee: Apache Spark >Priority: Minor > > The spark-submit script will parse the --help argument even if is provided as > application argument. > E.g. > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar --help > or > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -h > If my application is using a parsing library, such as Scallop, then it will > never be able to run the application with --help as argument. > I think the spark-submit script should only print the help message when is > provided as single argument like this: > spark-submit --help > or it should provide a separator for trailing arguments: > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -- --help --arg1 --arg2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7367) spark-submit CLI --help -h overrides the application arguments
[ https://issues.apache.org/jira/browse/SPARK-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7367: --- Assignee: (was: Apache Spark) > spark-submit CLI --help -h overrides the application arguments > -- > > Key: SPARK-7367 > URL: https://issues.apache.org/jira/browse/SPARK-7367 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Gianmario Spacagna >Priority: Minor > > The spark-submit script will parse the --help argument even if is provided as > application argument. > E.g. > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar --help > or > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -h > If my application is using a parsing library, such as Scallop, then it will > never be able to run the application with --help as argument. > I think the spark-submit script should only print the help message when is > provided as single argument like this: > spark-submit --help > or it should provide a separator for trailing arguments: > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -- --help --arg1 --arg2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13305) With SPARK_WORKER_WEBUI_PORT and --webui-port set for start-slave.sh script, --webui-port is used twice
[ https://issues.apache.org/jira/browse/SPARK-13305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145396#comment-15145396 ] Sean Owen commented on SPARK-13305: --- That looks like what I'd expect it to do. You set the value twice, two different ways for some reason, and it's set twice. It's not what you want to do, so can this be a problem? > With SPARK_WORKER_WEBUI_PORT and --webui-port set for start-slave.sh script, > --webui-port is used twice > --- > > Key: SPARK-13305 > URL: https://issues.apache.org/jira/browse/SPARK-13305 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Minor > > Executing the following command to start a worker: > {code} > SPARK_WORKER_WEBUI_PORT=1 ./sbin/start-slave.sh spark://localhost:7077 > --webui-port 2 > {code} > ends up with the following Spark command (in the log file) -- some characters > cut off to make it relevant: > {code} > Spark Command: [cut] org.apache.spark.deploy.worker.Worker --webui-port 1 > spark://localhost:7077 --webui-port 2 > {code} > Note {{--webui-port}} set twice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13307) TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-13307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-13307: --- Summary: TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1 (was: TPCDS query 66 degraded by 35% in 1.6.0 compared to 1.4.1) > TPCDS query 66 degraded by 30% in 1.6.0 compared to 1.4.1 > - > > Key: SPARK-13307 > URL: https://issues.apache.org/jira/browse/SPARK-13307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: JESSE CHEN > Labels: spark,, sql, > > Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average > about 9% faster. There are a few degraded, and one that is definitely not > within error margin is query 66. > Query 66 in 1.4.1: 699 seconds > Query 66 in 1.6.0: 918 seconds > 30% worse. > Collected the physical plans from both versions - drastic difference maybe > partially from using Tungsten in 1.6, but anything else at play here? > Please see plans here: > https://ibm.box.com/spark-sql-q66-debug-160plan > https://ibm.box.com/spark-sql-q66-debug-141plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13307) TPCDS query 66 degraded by 35% in 1.6.0 compared to 1.4.1
JESSE CHEN created SPARK-13307: -- Summary: TPCDS query 66 degraded by 35% in 1.6.0 compared to 1.4.1 Key: SPARK-13307 URL: https://issues.apache.org/jira/browse/SPARK-13307 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: JESSE CHEN Majority of the TPCDS queries ran faster in 1.6.0 than in 1.4.1, average about 9% faster. There are a few degraded, and one that is definitely not within error margin is query 66. Query 66 in 1.4.1: 699 seconds Query 66 in 1.6.0: 918 seconds 30% worse. Collected the physical plans from both versions - drastic difference maybe partially from using Tungsten in 1.6, but anything else at play here? Please see plans here: https://ibm.box.com/spark-sql-q66-debug-160plan https://ibm.box.com/spark-sql-q66-debug-141plan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode
[ https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145211#comment-15145211 ] Apache Spark commented on SPARK-5095: - User 'mgummelt' has created a pull request for this issue: https://github.com/apache/spark/pull/11164 > Support launching multiple mesos executors in coarse grained mesos mode > --- > > Key: SPARK-5095 > URL: https://issues.apache.org/jira/browse/SPARK-5095 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Timothy Chen >Assignee: Timothy Chen > Fix For: 2.0.0 > > > Currently in coarse grained mesos mode, it's expected that we only launch one > Mesos executor that launches one JVM process to launch multiple spark > executors. > However, this become a problem when the JVM process launched is larger than > an ideal size (30gb is recommended value from databricks), which causes GC > problems reported on the mailing list. > We should support launching mulitple executors when large enough resources > are available for spark to use, and these resources are still under the > configured limit. > This is also applicable when users want to specifiy number of executors to be > launched on each node -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9763) Minimize exposure of internal SQL classes
[ https://issues.apache.org/jira/browse/SPARK-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145229#comment-15145229 ] Reynold Xin commented on SPARK-9763: [~flysjy] is that caused by this ticket? > Minimize exposure of internal SQL classes > - > > Key: SPARK-9763 > URL: https://issues.apache.org/jira/browse/SPARK-9763 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12544) Support window functions in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-12544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145442#comment-15145442 ] Davies Liu commented on SPARK-12544: [~hvanhovell] Does window functions sill require HiveContext? Or we should update the docs/comments for Window functions. > Support window functions in SQLContext > -- > > Key: SPARK-12544 > URL: https://issues.apache.org/jira/browse/SPARK-12544 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Herman van Hovell > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7367) spark-submit CLI --help -h overrides the application arguments
[ https://issues.apache.org/jira/browse/SPARK-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145490#comment-15145490 ] bimal tandel commented on SPARK-7367: - I had a same problem today and I wrote the patch for it. I am creating a pull request. Based on my analysis there is an unintended consequences of printing help if --help is the only argument passed. example spark-submit --verbose --help wont print help anymore. Instead it prints this, spark-submit --verbose --help Error: Must specify a primary resource (JAR or Python or R file) Run with --help for usage help or --verbose for debug output > spark-submit CLI --help -h overrides the application arguments > -- > > Key: SPARK-7367 > URL: https://issues.apache.org/jira/browse/SPARK-7367 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Gianmario Spacagna >Priority: Minor > > The spark-submit script will parse the --help argument even if is provided as > application argument. > E.g. > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar --help > or > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -h > If my application is using a parsing library, such as Scallop, then it will > never be able to run the application with --help as argument. > I think the spark-submit script should only print the help message when is > provided as single argument like this: > spark-submit --help > or it should provide a separator for trailing arguments: > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -- --help --arg1 --arg2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12544) Support window functions in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-12544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145581#comment-15145581 ] Davies Liu commented on SPARK-12544: We are retiring HiveContext in 2.0, we may update the docs together. > Support window functions in SQLContext > -- > > Key: SPARK-12544 > URL: https://issues.apache.org/jira/browse/SPARK-12544 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Herman van Hovell > Labels: releasenotes > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7367) spark-submit CLI --help -h overrides the application arguments
[ https://issues.apache.org/jira/browse/SPARK-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145651#comment-15145651 ] Marcelo Vanzin commented on SPARK-7367: --- I think this bug has been fixed since 1.4 help in app args: {noformat} $ spark-submit --master local --class MyClass /my.jar --help Exception in thread "main" scala.MatchError: List(--help) (of class scala.collection.immutable.$colon$colon) {noformat} help in spark args: {noformat} $ spark-submit --master local --class MyClass --help /my.jar Usage: spark-submit [options] [app arguments] ... {noformat} > spark-submit CLI --help -h overrides the application arguments > -- > > Key: SPARK-7367 > URL: https://issues.apache.org/jira/browse/SPARK-7367 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Gianmario Spacagna >Priority: Minor > > The spark-submit script will parse the --help argument even if is provided as > application argument. > E.g. > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar --help > or > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -h > If my application is using a parsing library, such as Scallop, then it will > never be able to run the application with --help as argument. > I think the spark-submit script should only print the help message when is > provided as single argument like this: > spark-submit --help > or it should provide a separator for trailing arguments: > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -- --help --arg1 --arg2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13257) Refine naive Bayes example code
[ https://issues.apache.org/jira/browse/SPARK-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145720#comment-15145720 ] Apache Spark commented on SPARK-13257: -- User 'movelikeriver' has created a pull request for this issue: https://github.com/apache/spark/pull/11192 > Refine naive Bayes example code > --- > > Key: SPARK-13257 > URL: https://issues.apache.org/jira/browse/SPARK-13257 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: Lenjoy Lin >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > 1. Add code to check model after loading it > 2. It's nice if the usage command line can be added into the comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12154) Upgrade to Jersey 2
[ https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145752#comment-15145752 ] Matt Cheah commented on SPARK-12154: Sorry this had to be pushed back - but I'll work on it in the coming week. > Upgrade to Jersey 2 > --- > > Key: SPARK-12154 > URL: https://issues.apache.org/jira/browse/SPARK-12154 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 1.5.2 >Reporter: Matt Cheah > > Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. > Library conflicts for Jersey are difficult to workaround - see discussion on > SPARK-11081. It's easier to upgrade Jersey entirely, but we should target > Spark 2.0 since this may be a break for users who were using Jersey 1 in > their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10086) Flaky StreamingKMeans test in PySpark
[ https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-10086: - Attachment: flakyRepro.py Simple script with similar operations to this StreamingKMeans test, used to reproduce the issue > Flaky StreamingKMeans test in PySpark > - > > Key: SPARK-10086 > URL: https://issues.apache.org/jira/browse/SPARK-10086 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark, Streaming, Tests >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Critical > Attachments: flakyRepro.py > > > Here's a report on investigating test failures in StreamingKMeans in PySpark. > (See Jenkins links below.) > It is a StreamingKMeans test which trains on a DStream with 2 batches and > then tests on those same 2 batches. It fails here: > [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144] > I recreated the same test, with variants training on: (1) the original 2 > batches, (2) just the first batch, (3) just the second batch, and (4) neither > batch. Here is code which avoids Streaming altogether to identify what > batches were processed. > {code} > from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel > batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]] > batches = [sc.parallelize(batch) for batch in batches] > stkm = StreamingKMeans(decayFactor=0.0, k=2) > stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0]) > # Train > def update(rdd): > stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit) > # Remove one or both of these lines to test skipping batches. > update(batches[0]) > update(batches[1]) > # Test > def predict(rdd): > return stkm._model.predict(rdd) > predict(batches[0]).collect() > predict(batches[1]).collect() > {code} > *Results*: > {code} > ### EXPECTED > [0, 1, 1] > > [1, 0, 1] > ### Skip batch 0 > [1, 0, 0] > [0, 1, 0] > ### Skip batch 1 > [0, 1, 1] > [1, 0, 1] > ### Skip both batches (This is what we see in the test > failures.) > [0, 1, 1] > [0, 0, 0] > {code} > Skipping both batches reproduces the failure. There is no randomness in the > StreamingKMeans algorithm (since initial centers are fixed, not randomized). > CC: [~tdas] [~freeman-lab] [~mengxr] > Failure message: > {code} > == > FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest) > Test that prediction happens on the updated model. > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 1147, in test_trainOn_predictOn > self._eventually(condition, catch_assertions=True) > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 123, in _eventually > raise lastValue > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 114, in _eventually > lastValue = condition() > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 1144, in condition > self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]]) > AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]] > First differing element 1: > [0, 0, 0] > [1, 0, 1] > - [[0, 1, 1], [0, 0, 0]] > ? > + [[0, 1, 1], [1, 0, 1]] > ? +++ ^ > -- > Ran 62 tests in 164.188s > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10086) Flaky StreamingKMeans test in PySpark
[ https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145628#comment-15145628 ] Bryan Cutler edited comment on SPARK-10086 at 2/13/16 12:44 AM: Simple script [^flakyRepro.py] with similar operations to this StreamingKMeans test, used to reproduce the issue was (Author: bryanc): Simple script with similar operations to this StreamingKMeans test, used to reproduce the issue > Flaky StreamingKMeans test in PySpark > - > > Key: SPARK-10086 > URL: https://issues.apache.org/jira/browse/SPARK-10086 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark, Streaming, Tests >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Critical > Attachments: flakyRepro.py > > > Here's a report on investigating test failures in StreamingKMeans in PySpark. > (See Jenkins links below.) > It is a StreamingKMeans test which trains on a DStream with 2 batches and > then tests on those same 2 batches. It fails here: > [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144] > I recreated the same test, with variants training on: (1) the original 2 > batches, (2) just the first batch, (3) just the second batch, and (4) neither > batch. Here is code which avoids Streaming altogether to identify what > batches were processed. > {code} > from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel > batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]] > batches = [sc.parallelize(batch) for batch in batches] > stkm = StreamingKMeans(decayFactor=0.0, k=2) > stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0]) > # Train > def update(rdd): > stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit) > # Remove one or both of these lines to test skipping batches. > update(batches[0]) > update(batches[1]) > # Test > def predict(rdd): > return stkm._model.predict(rdd) > predict(batches[0]).collect() > predict(batches[1]).collect() > {code} > *Results*: > {code} > ### EXPECTED > [0, 1, 1] > > [1, 0, 1] > ### Skip batch 0 > [1, 0, 0] > [0, 1, 0] > ### Skip batch 1 > [0, 1, 1] > [1, 0, 1] > ### Skip both batches (This is what we see in the test > failures.) > [0, 1, 1] > [0, 0, 0] > {code} > Skipping both batches reproduces the failure. There is no randomness in the > StreamingKMeans algorithm (since initial centers are fixed, not randomized). > CC: [~tdas] [~freeman-lab] [~mengxr] > Failure message: > {code} > == > FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest) > Test that prediction happens on the updated model. > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 1147, in test_trainOn_predictOn > self._eventually(condition, catch_assertions=True) > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 123, in _eventually > raise lastValue > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 114, in _eventually > lastValue = condition() > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests.py", > line 1144, in condition > self.assertEqual(predict_results, [[0, 1, 1], [1, 0, 1]]) > AssertionError: Lists differ: [[0, 1, 1], [0, 0, 0]] != [[0, 1, 1], [1, 0, 1]] > First differing element 1: > [0, 0, 0] > [1, 0, 1] > - [[0, 1, 1], [0, 0, 0]] > ? > + [[0, 1, 1], [1, 0, 1]] > ? +++ ^ > -- > Ran 62 tests in 164.188s > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13293) Generate code for Expand
[ https://issues.apache.org/jira/browse/SPARK-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13293. - Resolution: Fixed Fix Version/s: 2.0.0 > Generate code for Expand > > > Key: SPARK-13293 > URL: https://issues.apache.org/jira/browse/SPARK-13293 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13308) ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error cases
[ https://issues.apache.org/jira/browse/SPARK-13308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13308: Assignee: Apache Spark (was: Josh Rosen) > ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error > cases > -- > > Key: SPARK-13308 > URL: https://issues.apache.org/jira/browse/SPARK-13308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Apache Spark > > Spark's OneToOneStreamManager does not free ManagedBuffers that are passed to > it except in certain error cases. Instead, ManagedBuffers should be freed > once messages created from them are consumed and destroyed by lower layers of > the Netty networking code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13308) ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error cases
[ https://issues.apache.org/jira/browse/SPARK-13308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145727#comment-15145727 ] Apache Spark commented on SPARK-13308: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11193 > ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error > cases > -- > > Key: SPARK-13308 > URL: https://issues.apache.org/jira/browse/SPARK-13308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark's OneToOneStreamManager does not free ManagedBuffers that are passed to > it except in certain error cases. Instead, ManagedBuffers should be freed > once messages created from them are consumed and destroyed by lower layers of > the Netty networking code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13308) ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error cases
[ https://issues.apache.org/jira/browse/SPARK-13308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13308: Assignee: Josh Rosen (was: Apache Spark) > ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error > cases > -- > > Key: SPARK-13308 > URL: https://issues.apache.org/jira/browse/SPARK-13308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > Spark's OneToOneStreamManager does not free ManagedBuffers that are passed to > it except in certain error cases. Instead, ManagedBuffers should be freed > once messages created from them are consumed and destroyed by lower layers of > the Netty networking code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12154) Upgrade to Jersey 2
[ https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145750#comment-15145750 ] Milad Khajavi commented on SPARK-12154: --- Hmm, Good point for changing pom version and checking the tests. How much time do we have? I think in following week I can work on it. -- Milād Khājavi http://blog.khajavi.ir Having the source means you can do it yourself. I tried to change the world, but I couldn’t find the source code. > Upgrade to Jersey 2 > --- > > Key: SPARK-12154 > URL: https://issues.apache.org/jira/browse/SPARK-12154 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 1.5.2 >Reporter: Matt Cheah > > Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. > Library conflicts for Jersey are difficult to workaround - see discussion on > SPARK-11081. It's easier to upgrade Jersey entirely, but we should target > Spark 2.0 since this may be a break for users who were using Jersey 1 in > their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7367) spark-submit CLI --help -h overrides the application arguments
[ https://issues.apache.org/jira/browse/SPARK-7367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145766#comment-15145766 ] bimal tandel commented on SPARK-7367: - I cant reproduce this on the latest release. This jira can be closed. > spark-submit CLI --help -h overrides the application arguments > -- > > Key: SPARK-7367 > URL: https://issues.apache.org/jira/browse/SPARK-7367 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Reporter: Gianmario Spacagna >Priority: Minor > > The spark-submit script will parse the --help argument even if is provided as > application argument. > E.g. > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar --help > or > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -h > If my application is using a parsing library, such as Scallop, then it will > never be able to run the application with --help as argument. > I think the spark-submit script should only print the help message when is > provided as single argument like this: > spark-submit --help > or it should provide a separator for trailing arguments: > spark-submit --master local[*] --driver-memory 4G --class bar.foo.MyClass > /MyLocalJAR.jar -- --help --arg1 --arg2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13297) [SQL] Backticks cannot be escaped in column names
[ https://issues.apache.org/jira/browse/SPARK-13297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145566#comment-15145566 ] Xiu (Joe) Guo commented on SPARK-13297: --- Looks like in the current [master branch|https://github.com/apache/spark/tree/42d656814f756599a2bc426f0e1f32bd4cc4470f], this problem is fixed. {code} scala> val columnName = "col`s" columnName: String = col`s scala> val rows = List(Row("foo"), Row("bar")) rows: List[org.apache.spark.sql.Row] = List([foo], [bar]) scala> val schema = StructType(Seq(StructField(columnName, StringType))) schema: org.apache.spark.sql.types.StructType = StructType(StructField(col`s,StringType,true)) scala> val rdd = sc.parallelize(rows) rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :28 scala> val df = sqlContext.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [col`s: string] scala> val selectingColumnName = "`" + columnName.replace("`", "``") + "`" selectingColumnName: String = `col``s` scala> selectingColumnName res0: String = `col``s` scala> val selectedDf = df.selectExpr(selectingColumnName) selectedDf: org.apache.spark.sql.DataFrame = [col`s: string] scala> selectedDf.show +-+ |col`s| +-+ | foo| | bar| +-+ {code} > [SQL] Backticks cannot be escaped in column names > - > > Key: SPARK-13297 > URL: https://issues.apache.org/jira/browse/SPARK-13297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Grzegorz Chilkiewicz >Priority: Minor > > We want to use backticks to escape spaces & minus signs in column names. > Are we unable to escape backticks when a column name is surrounded by > backticks? > It is not documented in: > http://spark.apache.org/docs/latest/sql-programming-guide.html > In MySQL there is a way: double the backticks, but this trick doesn't work in > Spark-SQL. > Am I correct or just missing something? Is there a way to escape backticks > inside a column name when it is surrounded by backticks? > Code to reproduce the problem: > https://github.com/grzegorz-chilkiewicz/SparkSqlEscapeBacktick -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10086) Flaky StreamingKMeans test in PySpark
[ https://issues.apache.org/jira/browse/SPARK-10086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145625#comment-15145625 ] Bryan Cutler commented on SPARK-10086: -- I was able to track down the cause of these failures, so here is an update with what I found. The test {{StreamingKMeansTest.test_trainOn_predictOn}} has 2 {{DStream.foreachRDD}} output operations, 1 in the call to {{StreamingKMeans.trainOn}} and 1 with {{collect}} which has a parent {{DStream}} that is a {{PythonTransformedDStream}} returned from {{StreamingKMeans.predictOn}}, so 2 jobs are generated for each batch. When the {{DStream}} jobs are generated, there is nothing to compute for the first job, which updates the model. For generating the second job, {{PythonTransformedDStream.compute}} gets called which will then do a {{PythonTransformFunction}} callback that creates a {{PythonRDD}} and serializes the mapped predict function to a command, containing the current model. Next, the 2 jobs are scheduled in order - first to update the model and then collect the predicted result. At this point, there is a race condition between completing the model update and generating the next set of jobs, which is running in a different thread. If there is enough of a delay in the update, then the next set of jobs will be generated and the old model will be serialized to the {{PythonRDD}} command again. Finally, the predict will be run against this old model causing the test failure. To sum it up, the underlying issue is that a func can be serialized with a value before a job is run that updates this value. This doesn't appear to be an issue in the Scala code as the closure cleaner is run just before the job is executed, and it will get the updated values. So far, the best solution I can think of would be to somehow delay the serialization of the model until it is needed, but I believe this would involve some big changes in {{PythonRDD}} as would any other solutions I could think of. Is something that would be worth doing to correct this, or might there be an easier fix that I am not seeing? It's not just a {{StreamingKMeans}} issue, so it would affect any PySpark streaming application with similar structure. I am attaching some simplified code used to reproduce the issue. I also have a similar Scala version that produces the expected results. > Flaky StreamingKMeans test in PySpark > - > > Key: SPARK-10086 > URL: https://issues.apache.org/jira/browse/SPARK-10086 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark, Streaming, Tests >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Critical > > Here's a report on investigating test failures in StreamingKMeans in PySpark. > (See Jenkins links below.) > It is a StreamingKMeans test which trains on a DStream with 2 batches and > then tests on those same 2 batches. It fails here: > [https://github.com/apache/spark/blob/1968276af0f681fe51328b7dd795bd21724a5441/python/pyspark/mllib/tests.py#L1144] > I recreated the same test, with variants training on: (1) the original 2 > batches, (2) just the first batch, (3) just the second batch, and (4) neither > batch. Here is code which avoids Streaming altogether to identify what > batches were processed. > {code} > from pyspark.mllib.clustering import StreamingKMeans, StreamingKMeansModel > batches = [[[-0.5], [0.6], [0.8]], [[0.2], [-0.1], [0.3]]] > batches = [sc.parallelize(batch) for batch in batches] > stkm = StreamingKMeans(decayFactor=0.0, k=2) > stkm.setInitialCenters([[0.0], [1.0]], [1.0, 1.0]) > # Train > def update(rdd): > stkm._model.update(rdd, stkm._decayFactor, stkm._timeUnit) > # Remove one or both of these lines to test skipping batches. > update(batches[0]) > update(batches[1]) > # Test > def predict(rdd): > return stkm._model.predict(rdd) > predict(batches[0]).collect() > predict(batches[1]).collect() > {code} > *Results*: > {code} > ### EXPECTED > [0, 1, 1] > > [1, 0, 1] > ### Skip batch 0 > [1, 0, 0] > [0, 1, 0] > ### Skip batch 1 > [0, 1, 1] > [1, 0, 1] > ### Skip both batches (This is what we see in the test > failures.) > [0, 1, 1] > [0, 0, 0] > {code} > Skipping both batches reproduces the failure. There is no randomness in the > StreamingKMeans algorithm (since initial centers are fixed, not randomized). > CC: [~tdas] [~freeman-lab] [~mengxr] > Failure message: > {code} > == > FAIL: test_trainOn_predictOn (__main__.StreamingKMeansTest) > Test that prediction happens on the updated model. >
[jira] [Created] (SPARK-13308) ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error cases
Josh Rosen created SPARK-13308: -- Summary: ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error cases Key: SPARK-13308 URL: https://issues.apache.org/jira/browse/SPARK-13308 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Spark's OneToOneStreamManager does not free ManagedBuffers that are passed to it except in certain error cases. Instead, ManagedBuffers should be freed once messages created from them are consumed and destroyed by lower layers of the Netty networking code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12154) Upgrade to Jersey 2
[ https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-12154: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-11806 > Upgrade to Jersey 2 > --- > > Key: SPARK-12154 > URL: https://issues.apache.org/jira/browse/SPARK-12154 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 1.5.2 >Reporter: Matt Cheah > > Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. > Library conflicts for Jersey are difficult to workaround - see discussion on > SPARK-11081. It's easier to upgrade Jersey entirely, but we should target > Spark 2.0 since this may be a break for users who were using Jersey 1 in > their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12154) Upgrade to Jersey 2
[ https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145746#comment-15145746 ] Andrew Ash commented on SPARK-12154: [~khajavi] would you please give it a go? [~mcheah] must have been busy over the past couple weeks. I think the way to get started would be to change jersey.version from 1.9 to the new version in the main {{/pom.xml}} and work through the resulting compile/test failures. These sections of the official Jersey documentation might be useful as you get started: https://jersey.java.net/documentation/latest/migration.html#mig-1.x https://jersey.java.net/nonav/documentation/2.0/migration.html > Upgrade to Jersey 2 > --- > > Key: SPARK-12154 > URL: https://issues.apache.org/jira/browse/SPARK-12154 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 1.5.2 >Reporter: Matt Cheah > > Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. > Library conflicts for Jersey are difficult to workaround - see discussion on > SPARK-11081. It's easier to upgrade Jersey entirely, but we should target > Spark 2.0 since this may be a break for users who were using Jersey 1 in > their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org