[jira] [Updated] (SPARK-4584) 2x Performance regression for Spark-on-YARN
[ https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4584: -- Priority: Blocker (was: Major) > 2x Performance regression for Spark-on-YARN > --- > > Key: SPARK-4584 > URL: https://issues.apache.org/jira/browse/SPARK-4584 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Nishkam Ravi >Priority: Blocker > > Significant performance regression observed for Spark-on-YARN (upto 2x) after > 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 > from Oct 7th. Problem can be reproduced with JavaWordCount against a large > enough input dataset in YARN cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4596) Refactorize Normalizer to make code cleaner
[ https://issues.apache.org/jira/browse/SPARK-4596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224160#comment-14224160 ] Apache Spark commented on SPARK-4596: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/3446 > Refactorize Normalizer to make code cleaner > --- > > Key: SPARK-4596 > URL: https://issues.apache.org/jira/browse/SPARK-4596 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: DB Tsai > > In this refactoring, the performance is slightly increased by removing the > overhead from breeze vector. The bottleneck is still in breeze norm which is > implemented by activeIterator. This inefficiency of breeze norm will be > addressed in next PR. At least, this PR makes the code more consistent in the > codebase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4595) Spark MetricsServlet is not worked because of initialization ordering
[ https://issues.apache.org/jira/browse/SPARK-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4595: -- Priority: Blocker (was: Major) > Spark MetricsServlet is not worked because of initialization ordering > - > > Key: SPARK-4595 > URL: https://issues.apache.org/jira/browse/SPARK-4595 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Saisai Shao >Priority: Blocker > > Web UI is initialized before MetricsSystem is started, at that time > MetricsSerlvet is not yet created, which will make MetricsServlet fail to > register into web UI. > Instead MetricsServlet handler should be added to the web UI after > MetricsSystem is started. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4595) Spark MetricsServlet is not worked because of initialization ordering
[ https://issues.apache.org/jira/browse/SPARK-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224159#comment-14224159 ] Josh Rosen commented on SPARK-4595: --- Marking this as blocker until we triage tomorrow since it seems like this might be a 1.2.0 regression. [~jerryshao], do you know if this bug is new in 1.2.0? > Spark MetricsServlet is not worked because of initialization ordering > - > > Key: SPARK-4595 > URL: https://issues.apache.org/jira/browse/SPARK-4595 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Saisai Shao >Priority: Blocker > > Web UI is initialized before MetricsSystem is started, at that time > MetricsSerlvet is not yet created, which will make MetricsServlet fail to > register into web UI. > Instead MetricsServlet handler should be added to the web UI after > MetricsSystem is started. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4596) Refactorize Normalizer to make code cleaner
DB Tsai created SPARK-4596: -- Summary: Refactorize Normalizer to make code cleaner Key: SPARK-4596 URL: https://issues.apache.org/jira/browse/SPARK-4596 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai In this refactoring, the performance is slightly increased by removing the overhead from breeze vector. The bottleneck is still in breeze norm which is implemented by activeIterator. This inefficiency of breeze norm will be addressed in next PR. At least, this PR makes the code more consistent in the codebase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4595) Spark MetricsServlet is not worked because of initialization ordering
[ https://issues.apache.org/jira/browse/SPARK-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224146#comment-14224146 ] Apache Spark commented on SPARK-4595: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/3444 > Spark MetricsServlet is not worked because of initialization ordering > - > > Key: SPARK-4595 > URL: https://issues.apache.org/jira/browse/SPARK-4595 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Saisai Shao > > Web UI is initialized before MetricsSystem is started, at that time > MetricsSerlvet is not yet created, which will make MetricsServlet fail to > register into web UI. > Instead MetricsServlet handler should be added to the web UI after > MetricsSystem is started. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4595) Spark MetricsServlet is not worked because of initialization ordering
[ https://issues.apache.org/jira/browse/SPARK-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-4595: --- Summary: Spark MetricsServlet is not worked because of initialization ordering (was: Spark MetricsServlet is not enabled because of initialization ordering) > Spark MetricsServlet is not worked because of initialization ordering > - > > Key: SPARK-4595 > URL: https://issues.apache.org/jira/browse/SPARK-4595 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Saisai Shao > > Web UI is initialized before MetricsSystem is started, at that time > MetricsSerlvet is not yet created, which will make MetricsServlet fail to > register into web UI. > Instead MetricsServlet handler should be added to the web UI after > MetricsSystem is started. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4595) Spark MetricsServlet is not enabled because of initialization ordering
Saisai Shao created SPARK-4595: -- Summary: Spark MetricsServlet is not enabled because of initialization ordering Key: SPARK-4595 URL: https://issues.apache.org/jira/browse/SPARK-4595 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Saisai Shao Web UI is initialized before MetricsSystem is started, at that time MetricsSerlvet is not yet created, which will make MetricsServlet fail to register into web UI. Instead MetricsServlet handler should be added to the web UI after MetricsSystem is started. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4594) Improvement the broadcast for HiveConf
Leo created SPARK-4594: -- Summary: Improvement the broadcast for HiveConf Key: SPARK-4594 URL: https://issues.apache.org/jira/browse/SPARK-4594 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Leo Priority: Minor Every time we need to get a table from hive , HadoopTableReader will broadcast HiveConf to clustor . Acturally In one application the hiveconf is single, so I think we can keep it in HiveContext for every query . Although it just 50kb , it's useful for JDBC user and streaming+sql app . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4593) sum(1/0) would produce a very large number
[ https://issues.apache.org/jira/browse/SPARK-4593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224129#comment-14224129 ] Apache Spark commented on SPARK-4593: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/3443 > sum(1/0) would produce a very large number > -- > > Key: SPARK-4593 > URL: https://issues.apache.org/jira/browse/SPARK-4593 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang >Priority: Minor > > SELECT max(1/0) FROM src would get a very large number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224132#comment-14224132 ] Aaron Staple commented on SPARK-1503: - [~mengxr] [~rezazadeh] Ok, thanks for the heads up. Let me know if there’s anything about the spec that should be handled differently. I covered most of the mathematics informally (the details are already covered formally in the references). And in addition, the proposal describes a method of implementing TFOCS functionality distributively but does not investigate existing distributed optimization systems. > Implement Nesterov's accelerated first-order method > --- > > Key: SPARK-1503 > URL: https://issues.apache.org/jira/browse/SPARK-1503 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Aaron Staple > > Nesterov's accelerated first-order method is a drop-in replacement for > steepest descent but it converges much faster. We should implement this > method and compare its performance with existing algorithms, including SGD > and L-BFGS. > TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's > method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4593) sum(1/0) would produce a very large number
Adrian Wang created SPARK-4593: -- Summary: sum(1/0) would produce a very large number Key: SPARK-4593 URL: https://issues.apache.org/jira/browse/SPARK-4593 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Priority: Minor SELECT max(1/0) FROM src would get a very large number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-911) Support map pruning on sorted (K, V) RDD's
[ https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224123#comment-14224123 ] Apache Spark commented on SPARK-911: User 'aaronjosephs' has created a pull request for this issue: https://github.com/apache/spark/pull/1381 > Support map pruning on sorted (K, V) RDD's > -- > > Key: SPARK-911 > URL: https://issues.apache.org/jira/browse/SPARK-911 > Project: Spark > Issue Type: Bug >Reporter: Patrick Wendell > > If someone has sorted a (K, V) rdd, we should offer them a way to filter a > range of the partitions that employs map pruning. This would be simple using > a small range index within the rdd itself. A good example is I sort my > dataset by time and then I want to serve queries that are restricted to a > certain time range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4590) Early investigation of parameter server
[ https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224117#comment-14224117 ] Reza Zadeh commented on SPARK-4590: --- Some starting points: - http://stanford.edu/~rezab/papers/factorbird.pdf - http://parameterserver.org/ More detailed comparisons coming. > Early investigation of parameter server > --- > > Key: SPARK-4590 > URL: https://issues.apache.org/jira/browse/SPARK-4590 > Project: Spark > Issue Type: Brainstorming > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Reza Zadeh > > In the currently implementation of GLM solvers, we save intermediate models > on the driver node and update it through broadcast and aggregation. Even with > torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond > ~10 million features. This JIRA is for investigating the parameter server > approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4592) "Worker registration failed: Duplicate worker ID" error during Master failover
[ https://issues.apache.org/jira/browse/SPARK-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4592: -- Attachment: log.txt > "Worker registration failed: Duplicate worker ID" error during Master failover > -- > > Key: SPARK-4592 > URL: https://issues.apache.org/jira/browse/SPARK-4592 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Attachments: log.txt > > > When running Spark Standalone in high-availability mode, we sometimes see > "Worker registration failed: Duplicate worker ID" errors which prevent > workers from reconnecting to the new active master. I've attached full logs > from a reproduction in my integration tests suite (which runs something > similar to Spark's FaultToleranceTest). Here's the relevant excerpt from a > worker log during a failed run of the "rolling outage" test, which creates a > multi-master cluster then repeatedly kills the active master, waits for > workers to reconnect to a new active master, then kills that master, and so > on. > {code} > 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at > http://172.17.0.90:8081 > 14/11/23 02:23:02 INFO Worker: Connecting to master > spark://172.17.0.86:7077... > 14/11/23 02:23:02 INFO Worker: Connecting to master > spark://172.17.0.87:7077... > 14/11/23 02:23:02 INFO Worker: Connecting to master > spark://172.17.0.88:7077... > 14/11/23 02:23:02 INFO Worker: Successfully registered with master > spark://172.17.0.86:7077 > 14/11/23 02:23:03 INFO Worker: Asked to launch executor > app-20141123022303-/1 for spark-integration-tests > 14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp" > "::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar" > "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M" > "org.apache.spark.executor.CoarseGrainedExecutorBackend" > "akka.tcp://sparkdri...@joshs-mbp.att.net:51271/user/CoarseGrainedScheduler" > "1" "172.17.0.90" "8" "app-20141123022303-" > "akka.tcp://sparkWorker@172.17.0.90:/user/Worker" > 14/11/23 02:23:14 INFO Worker: Disassociated > [akka.tcp://sparkWorker@172.17.0.90:] -> > [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated ! > 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for > master to reconnect... > 14/11/23 02:23:14 INFO Worker: Connecting to master > spark://172.17.0.86:7077... > 14/11/23 02:23:14 INFO Worker: Connecting to master > spark://172.17.0.87:7077... > 14/11/23 02:23:14 INFO Worker: Connecting to master > spark://172.17.0.88:7077... > 14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkMaster@172.17.0.86:7077] has failed, address is now > gated for [5000] ms. Reason is: [Disassociated]. > 14/11/23 02:23:14 INFO Worker: Disassociated > [akka.tcp://sparkWorker@172.17.0.90:] -> > [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated ! > 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for > master to reconnect... > 14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: > Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from > Actor[akka://sparkWorker/user/Worker#-1246122173] to > Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters > encountered. This logging can be turned off or adjusted with configuration > settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. > 14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with > the master, since there is an attempt scheduled already. > 14/11/23 02:23:14 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkWorker/deadLetters] to > Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613] > was not delivered. [2] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1) > 14/11/23 02:23:25 INFO Worker: Connecting to master > spark://172.17.0.86:7077... > 14/11/23 02:23:25 INFO Worker: Connecting to master > spark://172.17.0.87:7077... > 14/11/23 02:23:25 INFO Worker: Connecting to master > spark://172.17.0.88:7077... > 14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2) > 14/11/23 02:23:36 INFO Worker: Connecting to master > spark://172.17.0.86:7077... > 14/11
[jira] [Commented] (SPARK-4592) "Worker registration failed: Duplicate worker ID" error during Master failover
[ https://issues.apache.org/jira/browse/SPARK-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224108#comment-14224108 ] Josh Rosen commented on SPARK-4592: --- It looks like this is a bug that was introduced in [the patch|https://github.com/apache/spark/pull/2828] for SPARK-3736. Prior to that patch, a worker that became disassociated from a master would wait for a master to initiate a reconnection. This behavior caused problems when a master failed by stopping and restarting, since the restarted master would never know to initiate a reconnection. That patch addressed this issue by having workers attempt to reconnect to the master if they think they've become disconnected. When reviewing that patch, I wrote a [long comment|https://github.com/apache/spark/pull/2828#issuecomment-59602394] that explains it in much more detail; that's a good reference for understanding its motivation. This change introduced a problem, though: there's now a race-condition during multi-master failover. In the old multi-master code, a worker never initiates a reconnection attempt to a master; instead, it reconnects after the new active / primary master tells the worker that it's the new master. With the addition of worker-initiated reconnect, there's a race-condition where the worker detects that it's become disconnected, goes down the list of known masters and tries to connect to each of them, successfully connects to the new primary master, then receives a {{ChangedMaster}} event and attempts to connect to the new primary master _even though it's already connected_, causing a duplicate worker registration. There are a number of ways that we might fix this, but we have to be careful because it seems likely that the worker-initiated reconnect could have introduced other problems: - What happens if a worker sends a reconnection attempt to a live master which is not the new primary? Will that non-primary master reject or redirect those registrations, or will it register the workers and cause a split-brain scenario to occur? - The Worker is implemented as an actor and thus does not have synchronization of its internal state since it assumes message-at-a-time processing, but the asynchronous re-registration timer thread may violate this assumption because it directly calls internal worker methods instead of sending messages to the worker's own mailbox. One simple fix might be to have the worker never initiate reconnection attempts when running in a multi-master environment. I still need to think through whether this will cause new problems similar to SPARK-3736. I don't think it will be a problem because that patch was motivated by cases where the master forgot who the worker was and couldn't initiate a reconnect. If the list of registered workers is stored durably in ZooKeeper such that a worker is never told that it has registered until a record of its registration has become durable, then I think this is fine: if a live master thinks that a worker has disconnected, then it will initiate a reconnection; when a new master fails over, it will reconnect all workers based on the list from ZooKeeper. > "Worker registration failed: Duplicate worker ID" error during Master failover > -- > > Key: SPARK-4592 > URL: https://issues.apache.org/jira/browse/SPARK-4592 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > When running Spark Standalone in high-availability mode, we sometimes see > "Worker registration failed: Duplicate worker ID" errors which prevent > workers from reconnecting to the new active master. I've attached full logs > from a reproduction in my integration tests suite (which runs something > similar to Spark's FaultToleranceTest). Here's the relevant excerpt from a > worker log during a failed run of the "rolling outage" test, which creates a > multi-master cluster then repeatedly kills the active master, waits for > workers to reconnect to a new active master, then kills that master, and so > on. > {code} > 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at > http://172.17.0.90:8081 > 14/11/23 02:23:02 INFO Worker: Connecting to master > spark://172.17.0.86:7077... > 14/11/23 02:23:02 INFO Worker: Connecting to master > spark://172.17.0.87:7077... > 14/11/23 02:23:02 INFO Worker: Connecting to master > spark://172.17.0.88:7077... > 14/11/23 02:23:02 INFO Worker: Successfully registered with master > spark://172.17.0.86:7077 > 14/11/23 02:23:03 INFO Worker: Asked to launch executor > app-20141123022303-/1 for spark-integration-tests > 14/11/23 02:23:03 INFO ExecutorRunner: Lau
[jira] [Created] (SPARK-4592) "Worker registration failed: Duplicate worker ID" error during Master failover
Josh Rosen created SPARK-4592: - Summary: "Worker registration failed: Duplicate worker ID" error during Master failover Key: SPARK-4592 URL: https://issues.apache.org/jira/browse/SPARK-4592 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker When running Spark Standalone in high-availability mode, we sometimes see "Worker registration failed: Duplicate worker ID" errors which prevent workers from reconnecting to the new active master. I've attached full logs from a reproduction in my integration tests suite (which runs something similar to Spark's FaultToleranceTest). Here's the relevant excerpt from a worker log during a failed run of the "rolling outage" test, which creates a multi-master cluster then repeatedly kills the active master, waits for workers to reconnect to a new active master, then kills that master, and so on. {code} 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at http://172.17.0.90:8081 14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.86:7077... 14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.87:7077... 14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.88:7077... 14/11/23 02:23:02 INFO Worker: Successfully registered with master spark://172.17.0.86:7077 14/11/23 02:23:03 INFO Worker: Asked to launch executor app-20141123022303-/1 for spark-integration-tests 14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp" "::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar" "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://sparkdri...@joshs-mbp.att.net:51271/user/CoarseGrainedScheduler" "1" "172.17.0.90" "8" "app-20141123022303-" "akka.tcp://sparkWorker@172.17.0.90:/user/Worker" 14/11/23 02:23:14 INFO Worker: Disassociated [akka.tcp://sparkWorker@172.17.0.90:] -> [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated ! 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for master to reconnect... 14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.86:7077... 14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.87:7077... 14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.88:7077... 14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster@172.17.0.86:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/11/23 02:23:14 INFO Worker: Disassociated [akka.tcp://sparkWorker@172.17.0.90:] -> [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated ! 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for master to reconnect... 14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from Actor[akka://sparkWorker/user/Worker#-1246122173] to Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 14/11/23 02:23:14 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1) 14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.86:7077... 14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.87:7077... 14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.88:7077... 14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2) 14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.86:7077... 14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.87:7077... 14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.88:7077... 14/11/23 02:23:42 INFO Worker: Master has changed, new master is at spark://172.17.0.87:7077 14/11/23 02:23:47 INFO Worker: Retrying connection to master (attempt # 3) 14/11/23 02:23:47 INFO Worker: Connecting to master spark://172.17.0.86:7077... 14/11/2
[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering
[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224091#comment-14224091 ] Meethu Mathew commented on SPARK-3588: -- [~mengxr] We have completed the pyspark implementation which is available at https://github.com/FlytxtRnD/GMM. We are in the process of porting the code to Scala and were planning to create a PR once the coding and test cases are completed. By "merging" do you mean to merge the tickets or the implementations? Kindly explain how the merge would be done. Will our work be a duplicate effort if we continue with our scala implementation? Could you please suggest the next course of action? > Gaussian Mixture Model clustering > - > > Key: SPARK-3588 > URL: https://issues.apache.org/jira/browse/SPARK-3588 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Meethu Mathew >Assignee: Meethu Mathew > Attachments: GMMSpark.py > > > Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM > models the entire data set as a finite mixture of Gaussian distributions,each > parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight > π. In this technique, probability of each point to belong to each cluster is > computed along with the cluster statistics. > We have come up with an initial distributed implementation of GMM in pyspark > where the parameters are estimated using the Expectation-Maximization > algorithm.Our current implementation considers diagonal covariance matrix for > each component. > We did an initial benchmark study on a 2 node Spark standalone cluster setup > where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. > We also evaluated python version of k-means available in spark on the same > datasets. > Below are the results from this benchmark study. The reported stats are > average from 10 runs.Tests were done on multiple datasets with varying number > of features and instances. > || Dataset > || Gaussian > mixture model || > Kmeans(Python) || > > |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg > time per iteration |Time for 100 iterations | > |0.7million| 13 > | > 7s > | > 12min > | > 13s > | 26min > | > |1.8million| 11 > | > 17s > | > 29min > | > 33s > | 53min > | > |10million| 16 > | > 1.6min > | 2.7hr > | > 1.2min | > 2hr > | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224088#comment-14224088 ] Xiangrui Meng commented on SPARK-1503: -- [~staple] Thanks for working on the design doc! [~rezazadeh] will make a pass. > Implement Nesterov's accelerated first-order method > --- > > Key: SPARK-1503 > URL: https://issues.apache.org/jira/browse/SPARK-1503 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Aaron Staple > > Nesterov's accelerated first-order method is a drop-in replacement for > steepest descent but it converges much faster. We should implement this > method and compare its performance with existing algorithms, including SGD > and L-BFGS. > TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's > method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4591) Add algorithm/model wrappers in spark.ml to adapt the new API
Xiangrui Meng created SPARK-4591: Summary: Add algorithm/model wrappers in spark.ml to adapt the new API Key: SPARK-4591 URL: https://issues.apache.org/jira/browse/SPARK-4591 Project: Spark Issue Type: Umbrella Reporter: Xiangrui Meng This is an umbrella JIRA for porting spark.mllib implementations to adapt the new API defined under spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4590) Early investigation of parameter server
Xiangrui Meng created SPARK-4590: Summary: Early investigation of parameter server Key: SPARK-4590 URL: https://issues.apache.org/jira/browse/SPARK-4590 Project: Spark Issue Type: Brainstorming Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Reza Zadeh In the currently implementation of GLM solvers, we save intermediate models on the driver node and update it through broadcast and aggregation. Even with torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond ~10 million features. This JIRA is for investigating the parameter server approach, including algorithm, infrastructure, and dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4589) ML add-ons to SchemaRDD
Xiangrui Meng created SPARK-4589: Summary: ML add-ons to SchemaRDD Key: SPARK-4589 URL: https://issues.apache.org/jira/browse/SPARK-4589 Project: Spark Issue Type: New Feature Components: ML, MLlib, SQL Reporter: Xiangrui Meng One feedback we received from the Pipeline API (SPARK-3530) is about the boilerplate code in the implementation. We can add more Scala DSL to simplify the code for the operations we need in ML. Those operations could live under spark.ml via implicit, or be added to SchemaRDD directly if they are also useful for general purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4588) Add API for feature attributes
Xiangrui Meng created SPARK-4588: Summary: Add API for feature attributes Key: SPARK-4588 URL: https://issues.apache.org/jira/browse/SPARK-4588 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Feature attributes, e.g., continuous/categorical, feature names, feature dimension, number of categories, number of nonzeros (support) could be useful for ML algorithms. In SPARK-3569, we added metadata to schema, which can be used to store feature attributes along with the dataset. We need to provide a wrapper over the Metadata class for ML usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering
[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224042#comment-14224042 ] Xiangrui Meng commented on SPARK-3588: -- [~MeethuMathew] Just want to check with you whether you are working on the Scala implementation. [~tgaloppo] sent out a PR in SPARK-4156 . If you haven't spent much time on the Scala implementation, I'd like to invite you to review that PR, or we can think of a way to merge both implementations. Does it sound good to you? > Gaussian Mixture Model clustering > - > > Key: SPARK-3588 > URL: https://issues.apache.org/jira/browse/SPARK-3588 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Meethu Mathew >Assignee: Meethu Mathew > Attachments: GMMSpark.py > > > Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM > models the entire data set as a finite mixture of Gaussian distributions,each > parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight > π. In this technique, probability of each point to belong to each cluster is > computed along with the cluster statistics. > We have come up with an initial distributed implementation of GMM in pyspark > where the parameters are estimated using the Expectation-Maximization > algorithm.Our current implementation considers diagonal covariance matrix for > each component. > We did an initial benchmark study on a 2 node Spark standalone cluster setup > where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. > We also evaluated python version of k-means available in spark on the same > datasets. > Below are the results from this benchmark study. The reported stats are > average from 10 runs.Tests were done on multiple datasets with varying number > of features and instances. > || Dataset > || Gaussian > mixture model || > Kmeans(Python) || > > |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg > time per iteration |Time for 100 iterations | > |0.7million| 13 > | > 7s > | > 12min > | > 13s > | 26min > | > |1.8million| 11 > | > 17s > | > 29min > | > 33s > | 53min > | > |10million| 16 > | > 1.6min > | 2.7hr > | > 1.2min | > 2hr > | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3717: - Target Version/s: 1.3.0 > DecisionTree, RandomForest: Partition by feature > > > Key: SPARK-3717 > URL: https://issues.apache.org/jira/browse/SPARK-3717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > h1. Summary > Currently, data are partitioned by row/instance for DecisionTree and > RandomForest. This JIRA argues for partitioning by feature for training deep > trees. This is especially relevant for random forests, which are often > trained to be deeper than single decision trees. > h1. Details > Dataset dimensions and the depth of the tree to be trained are the main > problem parameters determining whether it is better to partition features or > instances. For random forests (training many deep trees), partitioning > features could be much better. > Notation: > * P = # workers > * N = # instances > * M = # features > * D = depth of tree > h2. Partitioning Features > Algorithm sketch: > * Each worker stores: > ** a subset of columns (i.e., a subset of features). If a worker stores > feature j, then the worker stores the feature value for all instances (i.e., > the whole column). > ** all labels > * Train one level at a time. > * Invariants: > ** Each worker stores a mapping: instance → node in current level > * On each iteration: > ** Each worker: For each node in level, compute (best feature to split, info > gain). > ** Reduce (P x M) values to M values to find best split for each node. > ** Workers who have features used in best splits communicate left/right for > relevant instances. Gather total of N bits to master, then broadcast. > * Total communication: > ** Depth D iterations > ** On each iteration, reduce to M values (~8 bytes each), broadcast N values > (1 bit each). > ** Estimate: D * (M * 8 + N) > h2. Partitioning Instances > Algorithm sketch: > * Train one group of nodes at a time. > * Invariants: > * Each worker stores a mapping: instance → node > * On each iteration: > ** Each worker: For each instance, add to aggregate statistics. > ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) > *** (“# classes” is for classification. 3 for regression) > ** Reduce aggregate. > ** Master chooses best split for each node in group and broadcasts. > * Local training: Once all instances for a node fit on one machine, it can be > best to shuffle data and training subtrees locally. This can mean shuffling > the entire dataset for each tree trained. > * Summing over all iterations, reduce to total of: > ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) > ** Estimate: 2^D * M * B * C * 8 > h2. Comparing Partitioning Methods > Partitioning features cost < partitioning instances cost when: > * D * (M * 8 + N) < 2^D * M * B * C * 8 > * D * N < 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the > right hand side) > * N < [ 2^D * M * B * C * 8 ] / D > Example: many instances: > * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = > 5) > * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 > * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3717: - Assignee: Joseph K. Bradley > DecisionTree, RandomForest: Partition by feature > > > Key: SPARK-3717 > URL: https://issues.apache.org/jira/browse/SPARK-3717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > h1. Summary > Currently, data are partitioned by row/instance for DecisionTree and > RandomForest. This JIRA argues for partitioning by feature for training deep > trees. This is especially relevant for random forests, which are often > trained to be deeper than single decision trees. > h1. Details > Dataset dimensions and the depth of the tree to be trained are the main > problem parameters determining whether it is better to partition features or > instances. For random forests (training many deep trees), partitioning > features could be much better. > Notation: > * P = # workers > * N = # instances > * M = # features > * D = depth of tree > h2. Partitioning Features > Algorithm sketch: > * Each worker stores: > ** a subset of columns (i.e., a subset of features). If a worker stores > feature j, then the worker stores the feature value for all instances (i.e., > the whole column). > ** all labels > * Train one level at a time. > * Invariants: > ** Each worker stores a mapping: instance → node in current level > * On each iteration: > ** Each worker: For each node in level, compute (best feature to split, info > gain). > ** Reduce (P x M) values to M values to find best split for each node. > ** Workers who have features used in best splits communicate left/right for > relevant instances. Gather total of N bits to master, then broadcast. > * Total communication: > ** Depth D iterations > ** On each iteration, reduce to M values (~8 bytes each), broadcast N values > (1 bit each). > ** Estimate: D * (M * 8 + N) > h2. Partitioning Instances > Algorithm sketch: > * Train one group of nodes at a time. > * Invariants: > * Each worker stores a mapping: instance → node > * On each iteration: > ** Each worker: For each instance, add to aggregate statistics. > ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) > *** (“# classes” is for classification. 3 for regression) > ** Reduce aggregate. > ** Master chooses best split for each node in group and broadcasts. > * Local training: Once all instances for a node fit on one machine, it can be > best to shuffle data and training subtrees locally. This can mean shuffling > the entire dataset for each tree trained. > * Summing over all iterations, reduce to total of: > ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) > ** Estimate: 2^D * M * B * C * 8 > h2. Comparing Partitioning Methods > Partitioning features cost < partitioning instances cost when: > * D * (M * 8 + N) < 2^D * M * B * C * 8 > * D * N < 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the > right hand side) > * N < [ 2^D * M * B * C * 8 ] / D > Example: many instances: > * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = > 5) > * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 > * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4587) Model export/import
Xiangrui Meng created SPARK-4587: Summary: Model export/import Key: SPARK-4587 URL: https://issues.apache.org/jira/browse/SPARK-4587 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Xiangrui Meng Priority: Critical This is an umbrella JIRA for one of the most requested features on the user mailing list. Model export/import can be done via Java serialization. But it doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we should provide save/load methods to every model. PMML is an option but it has its limitations. There are couple things we need to discuss: 1) data format, 2) how to preserve partitioning, 3) data compatibility between versions and language APIs, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1406: - Assignee: Vincenzo Selvaggio > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont >Assignee: Vincenzo Selvaggio > Attachments: MyJPMMLEval.java, SPARK-1406.pdf, kmeans.xml > > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4586) Python API for ML Pipeline
Xiangrui Meng created SPARK-4586: Summary: Python API for ML Pipeline Key: SPARK-4586 URL: https://issues.apache.org/jira/browse/SPARK-4586 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Add Python API to the newly added ML pipeline and parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4570) Add broadcast join to left semi join
[ https://issues.apache.org/jira/browse/SPARK-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224025#comment-14224025 ] Apache Spark commented on SPARK-4570: - User 'wangxiaojing' has created a pull request for this issue: https://github.com/apache/spark/pull/3442 > Add broadcast join to left semi join > - > > Key: SPARK-4570 > URL: https://issues.apache.org/jira/browse/SPARK-4570 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: XiaoJing wang >Priority: Minor > Fix For: 1.1.0 > > > For now, spark use broadcast join instead of hash join to optimize {{inner > join}} when the size of one side data did not reach the > {{AUTO_BROADCASTJOIN_THRESHOLD}} > However,Spark SQL will perform shuffle operations on each child relations > while executing {{left semi join}} is more suitable for optimiztion with > broadcast join. > We are planning to create a{{BroadcastLeftSemiJoinHash}} to implement the > broadcast join for {{left semi join}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4251) Add Restricted Boltzmann machine(RBM) algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4251: - Target Version/s: (was: 1.3.0) > Add Restricted Boltzmann machine(RBM) algorithm to MLlib > > > Key: SPARK-4251 > URL: https://issues.apache.org/jira/browse/SPARK-4251 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4156: - Priority: Major (was: Minor) > Add expectation maximization for Gaussian mixture models to MLLib clustering > > > Key: SPARK-4156 > URL: https://issues.apache.org/jira/browse/SPARK-4156 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Travis Galoppo >Assignee: Travis Galoppo > > As an additional clustering algorithm, implement expectation maximization for > Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3188: - Assignee: Fan Jiang > Add Robust Regression Algorithm with Tukey bisquare weight function > (Biweight Estimates) > -- > > Key: SPARK-3188 > URL: https://issues.apache.org/jira/browse/SPARK-3188 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang >Priority: Minor > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > Linear least square estimates assume the error has normal distribution and > can behave badly when the errors are heavy-tailed. In practical we get > various types of data. We need to include Robust Regression to employ a > fitting criterion that is not as vulnerable as least square. > The Tukey bisquare weight function, also referred to as the biweight > function, produces an M-estimator that is more resistant to regression > outliers than the Huber M-estimator (Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector
[ https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4494: - Priority: Minor (was: Major) > IDFModel.transform() add support for single vector > -- > > Key: SPARK-4494 > URL: https://issues.apache.org/jira/browse/SPARK-4494 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.1, 1.2.0 >Reporter: Jean-Philippe Quemener >Priority: Minor > > For now when using the tfidf implementation of mllib you have no other > possibility to map your data back onto i.e. labels or ids than use a hackish > way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just > vectors and apply IDFModel 3. zip with original RDD 4. transform label and > new vector to LabeledPoint{quote} > Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression] > I think as in production alot of users want to map their data back to some > identifier, it would be a good imporvement to allow using a single vector on > IDFModel.transform() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4582) Add getVectors to Word2VecModel
[ https://issues.apache.org/jira/browse/SPARK-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4582: - Assignee: Tobias Kässmann > Add getVectors to Word2VecModel > --- > > Key: SPARK-4582 > URL: https://issues.apache.org/jira/browse/SPARK-4582 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Tobias Kässmann >Priority: Minor > Fix For: 1.2.0 > > > Add getVectors to Word2VecModel for further processing. PR for branch-1.2: > https://github.com/apache/spark/pull/3309 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4582) Add getVectors to Word2VecModel
[ https://issues.apache.org/jira/browse/SPARK-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4582. -- Resolution: Fixed Issue resolved by pull request 3437 [https://github.com/apache/spark/pull/3437] > Add getVectors to Word2VecModel > --- > > Key: SPARK-4582 > URL: https://issues.apache.org/jira/browse/SPARK-4582 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Priority: Minor > Fix For: 1.2.0 > > > Add getVectors to Word2VecModel for further processing. PR for branch-1.2: > https://github.com/apache/spark/pull/3309 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet
[ https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224011#comment-14224011 ] Apache Spark commented on SPARK-3575: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3441 > Hive Schema is ignored when using convertMetastoreParquet > - > > Key: SPARK-3575 > URL: https://issues.apache.org/jira/browse/SPARK-3575 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Critical > > This can cause problems when for example one of the columns is defined as > TINYINT. A class cast exception will be thrown since the parquet table scan > produces INTs while the rest of the execution is expecting bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1476: --- Target Version/s: (was: 1.2.0) > 2GB limit in spark for blocks > - > > Key: SPARK-1476 > URL: https://issues.apache.org/jira/browse/SPARK-1476 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Environment: all >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Critical > Attachments: 2g_fix_proposal.pdf > > > The underlying abstraction for blocks in spark is a ByteBuffer : which limits > the size of the block to 2GB. > This has implication not just for managed blocks in use, but also for shuffle > blocks (memory mapped blocks are limited to 2gig, even though the api allows > for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. > This is a severe limitation for use of spark when used on non trivial > datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4525) MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers
[ https://issues.apache.org/jira/browse/SPARK-4525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4525: --- Fix Version/s: 1.2.0 > MesosSchedulerBackend.resourceOffers cannot decline unused offers from > acceptedOffers > - > > Key: SPARK-4525 > URL: https://issues.apache.org/jira/browse/SPARK-4525 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.2.0, 1.3.0 >Reporter: Jongyoul Lee >Assignee: Jongyoul Lee >Priority: Blocker > Fix For: 1.2.0 > > > After resourceOffers function is refactored - SPARK-2269 -, that function > doesn't decline unused offers from accepted offers. That's because when > driver.launchTasks is called, if that's tasks is empty, driver.launchTask > calls the declineOffer(offer.id). > {quote} > Invoking this function with an empty collection of tasks declines offers in > their entirety (see SchedulerDriver.declineOffer(OfferID, Filters)). > - > http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html#launchTasks(OfferID,%20java.util.Collection,%20Filters) > {quote} > In branch-1.1, resourcesOffers calls a launchTask function for all offered > offers, so driver declines unused resources, however, in current master, at > first offers are divided accepted and declined offers by their resources, and > delinedOffers are declined explicitly, and offers with task from > acceptedOffers are launched by driver.launchTasks, but, offers without from > acceptedOfers are not launched with empty task or declined explicitly. Thus, > mesos master judges thats offers used by TaskScheduler and there are no > resources remaing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
[ https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223992#comment-14223992 ] Cheng Lian commented on SPARK-4395: --- [~davies] Sure, I'll take a look. > Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour > -- > > Key: SPARK-4395 > URL: https://issues.apache.org/jira/browse/SPARK-4395 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: version 1.2.0-SNAPSHOT >Reporter: Sameer Farooqui > > When I run this command it hangs for one to many hours and then finally > returns with successful results: > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > Note, the lab environment below is still active, so let me know if you'd like > to just access it directly. > +++ My Environment +++ > - 1-node cluster in Amazon > - RedHat 6.5 64-bit > - java version "1.7.0_67" > - SBT version: sbt-0.13.5 > - Scala version: scala-2.11.2 > Ran: > sudo yum -y update > git clone https://github.com/apache/spark > sudo sbt assembly > +++ Data file used +++ > http://blueplastic.com/databricks/movielens/ratings.dat > {code} > >>> import re > >>> import string > >>> from pyspark.sql import SQLContext, Row > >>> sqlContext = SQLContext(sc) > >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)' > >>> > >>> def parse_ratings_line(line): > ... match = re.search(RATINGS_PATTERN, line) > ... if match is None: > ... # Optionally, you can change this to just ignore if each line of > data is not critical. > ... raise Error("Invalid logline: %s" % logline) > ... return Row( > ... UserID= int(match.group(1)), > ... MovieID = int(match.group(2)), > ... Rating= int(match.group(3)), > ... Timestamp = int(match.group(4))) > ... > >>> ratings_base_RDD = > >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat") > ...# Call the parse_apace_log_line function on each line. > ....map(parse_ratings_line) > ...# Caches the objects in memory since they will be queried > multiple times. > ....cache()) > >>> ratings_base_RDD.count() > 1000209 > >>> ratings_base_RDD.first() > Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1) > >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD) > >>> schemaRatings.registerTempTable("RatingsTable") > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > {code} > (Now the Python shell hangs...) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4525) MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers
[ https://issues.apache.org/jira/browse/SPARK-4525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4525. Resolution: Fixed > MesosSchedulerBackend.resourceOffers cannot decline unused offers from > acceptedOffers > - > > Key: SPARK-4525 > URL: https://issues.apache.org/jira/browse/SPARK-4525 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.2.0, 1.3.0 >Reporter: Jongyoul Lee >Assignee: Jongyoul Lee >Priority: Blocker > > After resourceOffers function is refactored - SPARK-2269 -, that function > doesn't decline unused offers from accepted offers. That's because when > driver.launchTasks is called, if that's tasks is empty, driver.launchTask > calls the declineOffer(offer.id). > {quote} > Invoking this function with an empty collection of tasks declines offers in > their entirety (see SchedulerDriver.declineOffer(OfferID, Filters)). > - > http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html#launchTasks(OfferID,%20java.util.Collection,%20Filters) > {quote} > In branch-1.1, resourcesOffers calls a launchTask function for all offered > offers, so driver declines unused resources, however, in current master, at > first offers are divided accepted and declined offers by their resources, and > delinedOffers are declined explicitly, and offers with task from > acceptedOffers are launched by driver.launchTasks, but, offers without from > acceptedOfers are not launched with empty task or declined explicitly. Thus, > mesos master judges thats offers used by TaskScheduler and there are no > resources remaing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4258) NPE with new Parquet Filters
[ https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223981#comment-14223981 ] Apache Spark commented on SPARK-4258: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3440 > NPE with new Parquet Filters > > > Key: SPARK-4258 > URL: https://issues.apache.org/jira/browse/SPARK-4258 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Critical > > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): > java.lang.NullPointerException: > parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206) > parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) > parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) > parquet.filter2.predicate.Operators$Or.accept(Operators.java:302) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201) > > parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) > parquet.filter2.predicate.Operators$And.accept(Operators.java:290) > > parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52) > parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46) > parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) > > parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) > > parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) > > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) > > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > {code} > This occurs when reading parquet data encoded with the older version of the > library for TPC-DS query 34. Will work on coming up with a smaller > reproduction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4539) History Server counts "incomplete" applications against the "retainedApplications" total, fails to show eligible "completed" applications
[ https://issues.apache.org/jira/browse/SPARK-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223979#comment-14223979 ] Masayoshi TSUZUKI commented on SPARK-4539: -- I assume you mean the parameter "spark.history.retainedApplications". It is not the value of limit the number of apps listed on the HistoryServer UI, but the number of caches of the application detail info which is shown when we click the listed "App ID" link. Even when spark.history.retainedApplications is set as 2, we can see more than 10 apps listed. By the way, I think your operation doesn't work properly. As you know, just copying some existing application directory doesn't work because they both have same application id in EVENT_LOG_1 so it is needed to be modified. If there are 2 app directories which have the same application id, HistoryServer skips listing. And HistoryServer read only the directory whose modification time is later than the log directory was loaded last time. So please try * update the modification time of the directory after you modified EVENT_LOG_1. * make sure you don't see the browser cache. It works for me. And of course, restarting HistoryServer is also a good idea to get all apps listed. > History Server counts "incomplete" applications against the > "retainedApplications" total, fails to show eligible "completed" applications > - > > Key: SPARK-4539 > URL: https://issues.apache.org/jira/browse/SPARK-4539 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Ryan Williams > > I have observed the history server to return 0 or 1 applications from a > directory that contains many complete and incomplete applications (the latter > being application directories that are missing the {{APPLICATION_COMPLETE}} > file). > Without having dug too much, my theory is that HistoryServer is seeing the > "incomplete" directories and counting them against the > {{retainedApplications}} maximum but not displaying them. > One supporting anecdote for this is that I loaded HS against a directory that > had one complete application and nothing else, and HS worked as expected (I > saw the one application in the web UI). > I then copied ~100 other application directories in, the majority of which > were "incomplete" (in particular, most of the ones that had the earliest > timestamps), and still only saw the one original completed application via > the web UI. > Finally, I restarted the same server with the {{retainedApplications}} set to > 1000 (instead of 50; the directory a this point had ~10 completed > applications and 90 incomplete ones), and saw all/exactly the completed > applications, leading me to believe that they were being "boxed out" of the > maximum-50-retained-applications iteration of the history server. > Silently failing on "incomplete" directories while still docking the count, > if that is indeed what is happening, is a pretty confusing failure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4583) GradientBoostedTrees error logging should use loss being minimized
[ https://issues.apache.org/jira/browse/SPARK-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223950#comment-14223950 ] Apache Spark commented on SPARK-4583: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/3439 > GradientBoostedTrees error logging should use loss being minimized > -- > > Key: SPARK-4583 > URL: https://issues.apache.org/jira/browse/SPARK-4583 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Currently, the LogLoss used by GradientBoostedTrees has 2 issues: > * the gradient (and therefore loss) does not match that used by Friedman > (1999) > * the error computation uses 0/1 accuracy, not log loss -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223922#comment-14223922 ] Pedro Rodriguez edited comment on SPARK-1405 at 11/25/14 2:18 AM: -- Finished an initial implementation of an LDA data generator. I have done some initial testing and it seems reasonable, but just initial testing at the moment. Will be looking at metrics other than "it looks good" to make sure that the data being generated is correct. Implementation: https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala was (Author: pedrorodriguez): Finished an initial implementation of an LDA data generator. I have done some initial testing and it seems reasonable, but just initial testing at the moment. Will be looking at metrics other than "it looks good" to make sure that the data being generated looks reasonable. Implementation: https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4266) Avoid expensive JavaScript for StagePages with huge numbers of tasks
[ https://issues.apache.org/jira/browse/SPARK-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4266. Resolution: Fixed Fix Version/s: 1.2.0 [~kayousterhout] I'm resolving this because I saw you merged it. > Avoid expensive JavaScript for StagePages with huge numbers of tasks > > > Key: SPARK-4266 > URL: https://issues.apache.org/jira/browse/SPARK-4266 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Blocker > Fix For: 1.2.0 > > > Some of the new javascript added to handle hiding metrics significantly slows > the page load for stages with a lot of tasks (e.g., for a job with 10K tasks, > it took over a minute for the page to finish loading in Chrome on my laptop). > There are at least two issues here: > (1) The new table striping java script is much slower than the old CSS. The > fancier javascript is only needed for the stage summary table, so we should > change the task table back to using CSS so that it doesn't slow the page load > for jobs with lots of tasks. > (2) The javascript associated with hiding metrics is expensive when jobs have > lots of tasks, I think because the jQuery selectors have to traverse a much > larger DOM. The ID selectors are much more efficient, so we should consider > switching to these, and/or avoiding this code in additional-metrics.js: > $("input:checkbox:not(:checked)").each(function() { > var column = "table ." + $(this).attr("name"); > $(column).hide(); > }); > by initially hiding the data when we generate the page in the render function > instead, which should be easy to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4585) Spark dynamic scaling executors use upper limit value as default.
Chengxiang Li created SPARK-4585: Summary: Spark dynamic scaling executors use upper limit value as default. Key: SPARK-4585 URL: https://issues.apache.org/jira/browse/SPARK-4585 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.1.0 Reporter: Chengxiang Li With SPARK-3174, one can configure a minimum and maximum number of executors for a Spark application on Yarn. However, the application always starts with the maximum. It seems more reasonable, at least for Hive on Spark, to start from the minimum and scale up as needed up to the maximum. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4584) 2x Performance regression for Spark-on-YARN
[ https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishkam Ravi updated SPARK-4584: Comment: was deleted (was: In YARN cluster mode.) > 2x Performance regression for Spark-on-YARN > --- > > Key: SPARK-4584 > URL: https://issues.apache.org/jira/browse/SPARK-4584 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Nishkam Ravi > > Significant performance regression observed for Spark-on-YARN (upto 2x) after > 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 > from Oct 7th. Problem can be reproduced with JavaWordCount against a large > enough input dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4584) 2x Performance regression for Spark-on-YARN
[ https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishkam Ravi updated SPARK-4584: Description: Significant performance regression observed for Spark-on-YARN (upto 2x) after 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 from Oct 7th. Problem can be reproduced with JavaWordCount against a large enough input dataset in YARN cluster mode. (was: Significant performance regression observed for Spark-on-YARN (upto 2x) after 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 from Oct 7th. Problem can be reproduced with JavaWordCount against a large enough input dataset.) > 2x Performance regression for Spark-on-YARN > --- > > Key: SPARK-4584 > URL: https://issues.apache.org/jira/browse/SPARK-4584 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Nishkam Ravi > > Significant performance regression observed for Spark-on-YARN (upto 2x) after > 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 > from Oct 7th. Problem can be reproduced with JavaWordCount against a large > enough input dataset in YARN cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4584) 2x Performance regression for Spark-on-YARN
[ https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223930#comment-14223930 ] Nishkam Ravi commented on SPARK-4584: - In YARN cluster mode. > 2x Performance regression for Spark-on-YARN > --- > > Key: SPARK-4584 > URL: https://issues.apache.org/jira/browse/SPARK-4584 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Nishkam Ravi > > Significant performance regression observed for Spark-on-YARN (upto 2x) after > 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 > from Oct 7th. Problem can be reproduced with JavaWordCount against a large > enough input dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4584) 2x Performance regression for Spark-on-YARN
Nishkam Ravi created SPARK-4584: --- Summary: 2x Performance regression for Spark-on-YARN Key: SPARK-4584 URL: https://issues.apache.org/jira/browse/SPARK-4584 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Nishkam Ravi Significant performance regression observed for Spark-on-YARN (upto 2x) after 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 from Oct 7th. Problem can be reproduced with JavaWordCount against a large enough input dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223925#comment-14223925 ] Apache Spark commented on SPARK-2926: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/3438 > Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle > -- > > Key: SPARK-2926 > URL: https://issues.apache.org/jira/browse/SPARK-2926 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test > Report(contd).pdf, Spark Shuffle Test Report.pdf > > > Currently Spark has already integrated sort-based shuffle write, which > greatly improve the IO performance and reduce the memory consumption when > reducer number is very large. But for the reducer side, it still adopts the > implementation of hash-based shuffle reader, which neglects the ordering > attributes of map output data in some situations. > Here we propose a MR style sort-merge like shuffle reader for sort-based > shuffle to better improve the performance of sort-based shuffle. > Working in progress code and performance test report will be posted later > when some unit test bugs are fixed. > Any comments would be greatly appreciated. > Thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4583) GradientBoostedTrees error logging should use loss being minimized
[ https://issues.apache.org/jira/browse/SPARK-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4583: - Assignee: Joseph K. Bradley > GradientBoostedTrees error logging should use loss being minimized > -- > > Key: SPARK-4583 > URL: https://issues.apache.org/jira/browse/SPARK-4583 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Currently, the LogLoss used by GradientBoostedTrees has 2 issues: > * the gradient (and therefore loss) does not match that used by Friedman > (1999) > * the error computation uses 0/1 accuracy, not log loss -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4525) MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers
[ https://issues.apache.org/jira/browse/SPARK-4525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4525: --- Target Version/s: 1.2.0 (was: 1.2.0, 1.3.0) > MesosSchedulerBackend.resourceOffers cannot decline unused offers from > acceptedOffers > - > > Key: SPARK-4525 > URL: https://issues.apache.org/jira/browse/SPARK-4525 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.2.0, 1.3.0 >Reporter: Jongyoul Lee >Assignee: Jongyoul Lee >Priority: Blocker > > After resourceOffers function is refactored - SPARK-2269 -, that function > doesn't decline unused offers from accepted offers. That's because when > driver.launchTasks is called, if that's tasks is empty, driver.launchTask > calls the declineOffer(offer.id). > {quote} > Invoking this function with an empty collection of tasks declines offers in > their entirety (see SchedulerDriver.declineOffer(OfferID, Filters)). > - > http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html#launchTasks(OfferID,%20java.util.Collection,%20Filters) > {quote} > In branch-1.1, resourcesOffers calls a launchTask function for all offered > offers, so driver declines unused resources, however, in current master, at > first offers are divided accepted and declined offers by their resources, and > delinedOffers are declined explicitly, and offers with task from > acceptedOffers are launched by driver.launchTasks, but, offers without from > acceptedOfers are not launched with empty task or declined explicitly. Thus, > mesos master judges thats offers used by TaskScheduler and there are no > resources remaing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223922#comment-14223922 ] Pedro Rodriguez commented on SPARK-1405: Finished an initial implementation of an LDA data generator. I have done some initial testing and it seems reasonable, but just initial testing at the moment. Will be looking at metrics other than "it looks good" to make sure that the data being generated looks reasonable. Implementation: https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4583) GradientBoostedTrees error logging should use loss being minimized
Joseph K. Bradley created SPARK-4583: Summary: GradientBoostedTrees error logging should use loss being minimized Key: SPARK-4583 URL: https://issues.apache.org/jira/browse/SPARK-4583 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor Currently, the LogLoss used by GradientBoostedTrees has 2 issues: * the gradient (and therefore loss) does not match that used by Friedman (1999) * the error computation uses 0/1 accuracy, not log loss -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4577) Python example of LBFGS for MLlib guide
[ https://issues.apache.org/jira/browse/SPARK-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223909#comment-14223909 ] Davies Liu commented on SPARK-4577: --- The Scala example about L-BFGS is actually about optimization, but we did not have python api for them, so I would like to close it. > Python example of LBFGS for MLlib guide > --- > > Key: SPARK-4577 > URL: https://issues.apache.org/jira/browse/SPARK-4577 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Reporter: Davies Liu >Priority: Minor > > It should have a Python example of L-BFGS in MLlib guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4577) Python example of LBFGS for MLlib guide
[ https://issues.apache.org/jira/browse/SPARK-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-4577. - Resolution: Won't Fix Target Version/s: (was: 1.2.0) > Python example of LBFGS for MLlib guide > --- > > Key: SPARK-4577 > URL: https://issues.apache.org/jira/browse/SPARK-4577 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Reporter: Davies Liu >Priority: Minor > > It should have a Python example of L-BFGS in MLlib guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4409) Additional (but limited) Linear Algebra Utils
[ https://issues.apache.org/jira/browse/SPARK-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4409: - Assignee: Burak Yavuz > Additional (but limited) Linear Algebra Utils > - > > Key: SPARK-4409 > URL: https://issues.apache.org/jira/browse/SPARK-4409 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Minor > > This ticket is to discuss the addition of a very limited number of local > matrix manipulation and generation methods that would be helpful in the > further development for algorithms on top of BlockMatrix (SPARK-3974), such > as Randomized SVD, and Multi Model Training (SPARK-1486). > The proposed methods for addition are: > For `Matrix` > - map: maps the values in the matrix with a given function. Produces a new > matrix. > - update: the values in the matrix are updated with a given function. > Occurs in place. > Factory methods for `DenseMatrix`: > - *zeros: Generate a matrix consisting of zeros > - *ones: Generate a matrix consisting of ones > - *eye: Generate an identity matrix > - *rand: Generate a matrix consisting of i.i.d. uniform random numbers > - *randn: Generate a matrix consisting of i.i.d. gaussian random numbers > - *diag: Generate a diagonal matrix from a supplied vector > *These methods already exist in the factory methods for `Matrices`, however > for cases where we require a `DenseMatrix`, you constantly have to add > `.asInstanceOf[DenseMatrix]` everywhere, which makes the code "dirtier". I > propose moving these functions to factory methods for `DenseMatrix` where the > putput will be a `DenseMatrix` and the factory methods for `Matrices` will > call these functions directly and output a generic `Matrix`. > Factory methods for `SparseMatrix`: > - speye: Identity matrix in sparse format. Saves a ton of memory when > dimensions are large, especially in Multi Model Training, where each row > requires being multiplied by a scalar. > - sprand: Generate a sparse matrix with a given density consisting of > i.i.d. uniform random numbers. > - sprandn: Generate a sparse matrix with a given density consisting of > i.i.d. gaussian random numbers. > - diag: Generate a diagonal matrix from a supplied vector, but is memory > efficient, because it just stores the diagonal. Again, very helpful in Multi > Model Training. > Factory methods for `Matrices`: > - Include all the factory methods given above, but return a generic > `Matrix` rather than `SparseMatrix` or `DenseMatrix`. > - horzCat: Horizontally concatenate matrices to form one larger matrix. > Very useful in both Multi Model Training, and for the repartitioning of > BlockMatrix. > - vertCat: Vertically concatenate matrices to form one larger matrix. Very > useful for the repartitioning of BlockMatrix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4409) Additional (but limited) Linear Algebra Utils
[ https://issues.apache.org/jira/browse/SPARK-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4409: - Priority: Major (was: Minor) > Additional (but limited) Linear Algebra Utils > - > > Key: SPARK-4409 > URL: https://issues.apache.org/jira/browse/SPARK-4409 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > This ticket is to discuss the addition of a very limited number of local > matrix manipulation and generation methods that would be helpful in the > further development for algorithms on top of BlockMatrix (SPARK-3974), such > as Randomized SVD, and Multi Model Training (SPARK-1486). > The proposed methods for addition are: > For `Matrix` > - map: maps the values in the matrix with a given function. Produces a new > matrix. > - update: the values in the matrix are updated with a given function. > Occurs in place. > Factory methods for `DenseMatrix`: > - *zeros: Generate a matrix consisting of zeros > - *ones: Generate a matrix consisting of ones > - *eye: Generate an identity matrix > - *rand: Generate a matrix consisting of i.i.d. uniform random numbers > - *randn: Generate a matrix consisting of i.i.d. gaussian random numbers > - *diag: Generate a diagonal matrix from a supplied vector > *These methods already exist in the factory methods for `Matrices`, however > for cases where we require a `DenseMatrix`, you constantly have to add > `.asInstanceOf[DenseMatrix]` everywhere, which makes the code "dirtier". I > propose moving these functions to factory methods for `DenseMatrix` where the > putput will be a `DenseMatrix` and the factory methods for `Matrices` will > call these functions directly and output a generic `Matrix`. > Factory methods for `SparseMatrix`: > - speye: Identity matrix in sparse format. Saves a ton of memory when > dimensions are large, especially in Multi Model Training, where each row > requires being multiplied by a scalar. > - sprand: Generate a sparse matrix with a given density consisting of > i.i.d. uniform random numbers. > - sprandn: Generate a sparse matrix with a given density consisting of > i.i.d. gaussian random numbers. > - diag: Generate a diagonal matrix from a supplied vector, but is memory > efficient, because it just stores the diagonal. Again, very helpful in Multi > Model Training. > Factory methods for `Matrices`: > - Include all the factory methods given above, but return a generic > `Matrix` rather than `SparseMatrix` or `DenseMatrix`. > - horzCat: Horizontally concatenate matrices to form one larger matrix. > Very useful in both Multi Model Training, and for the repartitioning of > BlockMatrix. > - vertCat: Vertically concatenate matrices to form one larger matrix. Very > useful for the repartitioning of BlockMatrix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector
[ https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4494: - Target Version/s: 1.3.0 (was: 1.1.1) > IDFModel.transform() add support for single vector > -- > > Key: SPARK-4494 > URL: https://issues.apache.org/jira/browse/SPARK-4494 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.1, 1.2.0 >Reporter: Jean-Philippe Quemener > > For now when using the tfidf implementation of mllib you have no other > possibility to map your data back onto i.e. labels or ids than use a hackish > way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just > vectors and apply IDFModel 3. zip with original RDD 4. transform label and > new vector to LabeledPoint{quote} > Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression] > I think as in production alot of users want to map their data back to some > identifier, it would be a good imporvement to allow using a single vector on > IDFModel.transform() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector
[ https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4494: - Affects Version/s: (was: 1.1.0) 1.1.1 > IDFModel.transform() add support for single vector > -- > > Key: SPARK-4494 > URL: https://issues.apache.org/jira/browse/SPARK-4494 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.1, 1.2.0 >Reporter: Jean-Philippe Quemener > > For now when using the tfidf implementation of mllib you have no other > possibility to map your data back onto i.e. labels or ids than use a hackish > way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just > vectors and apply IDFModel 3. zip with original RDD 4. transform label and > new vector to LabeledPoint{quote} > Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression] > I think as in production alot of users want to map their data back to some > identifier, it would be a good imporvement to allow using a single vector on > IDFModel.transform() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector
[ https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4494: - Affects Version/s: 1.2.0 > IDFModel.transform() add support for single vector > -- > > Key: SPARK-4494 > URL: https://issues.apache.org/jira/browse/SPARK-4494 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.1, 1.2.0 >Reporter: Jean-Philippe Quemener > > For now when using the tfidf implementation of mllib you have no other > possibility to map your data back onto i.e. labels or ids than use a hackish > way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just > vectors and apply IDFModel 3. zip with original RDD 4. transform label and > new vector to LabeledPoint{quote} > Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression] > I think as in production alot of users want to map their data back to some > identifier, it would be a good imporvement to allow using a single vector on > IDFModel.transform() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm
[ https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4510: - Assignee: Fan Jiang > Add k-medoids Partitioning Around Medoids (PAM) algorithm > - > > Key: SPARK-4510 > URL: https://issues.apache.org/jira/browse/SPARK-4510 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > PAM (k-medoids) is more robust to noise and outliers as compared to k-means > because it minimizes a sum of pairwise dissimilarities instead of a sum of > squared Euclidean distances. A medoid can be defined as the object of a > cluster, whose average dissimilarity to all the objects in the cluster is > minimal i.e. it is a most centrally located point in the cluster. > The most common realisation of k-medoid clustering is the Partitioning Around > Medoids (PAM) algorithm and is as follows: > Initialize: randomly select (without replacement) k of the n data points as > the medoids > Associate each data point to the closest medoid. ("closest" here is defined > using any valid distance metric, most commonly Euclidean distance, Manhattan > distance or Minkowski distance) > For each medoid m > For each non-medoid data point o > Swap m and o and compute the total cost of the configuration > Select the configuration with the lowest cost. > Repeat steps 2 to 4 until there is no change in the medoid. > The new feature for MLlib will contain 5 new files > /main/scala/org/apache/spark/mllib/clustering/PAM.scala > /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala > /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala > /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala > /main/scala/org/apache/spark/examples/mllib/KMedoids.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4530) GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.
[ https://issues.apache.org/jira/browse/SPARK-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4530: - Target Version/s: 1.0.2, 1.2.0, 1.1.2 > GradientDescent get a wrong gradient value according to the gradient formula, > which is caused by the miniBatchSize parameter. > - > > Key: SPARK-4530 > URL: https://issues.apache.org/jira/browse/SPARK-4530 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Minor > > This bug is caused by {{RDD.sample}} > The number of {{RDD.sample}} returns is not fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4530) GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.
[ https://issues.apache.org/jira/browse/SPARK-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4530: - Priority: Minor (was: Major) > GradientDescent get a wrong gradient value according to the gradient formula, > which is caused by the miniBatchSize parameter. > - > > Key: SPARK-4530 > URL: https://issues.apache.org/jira/browse/SPARK-4530 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Minor > > This bug is caused by {{RDD.sample}} > The number of {{RDD.sample}} returns is not fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4530) GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.
[ https://issues.apache.org/jira/browse/SPARK-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4530: - Assignee: Guoqiang Li > GradientDescent get a wrong gradient value according to the gradient formula, > which is caused by the miniBatchSize parameter. > - > > Key: SPARK-4530 > URL: https://issues.apache.org/jira/browse/SPARK-4530 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li > > This bug is caused by {{RDD.sample}} > The number of {{RDD.sample}} returns is not fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4581) Refactorize StandardScaler to improve the transformation performance
[ https://issues.apache.org/jira/browse/SPARK-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4581: - Target Version/s: 1.2.0 Assignee: DB Tsai > Refactorize StandardScaler to improve the transformation performance > > > Key: SPARK-4581 > URL: https://issues.apache.org/jira/browse/SPARK-4581 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: DB Tsai >Assignee: DB Tsai > > The following optimizations are done to improve the StandardScaler model > transformation performance. > 1) Covert Breeze dense vector to primitive vector to reduce the overhead. > 2) Since mean can be potentially a sparse vector, we explicitly convert it to > dense primitive vector. > 3) Have a local reference to `shift` and `factor` array so JVM can locate the > value with one operation call. > 4) In pattern matching part, we use the mllib SparseVector/DenseVector > instead of breeze's vector to make the codebase cleaner. > Benchmark with mnist8m dataset: > Before, > DenseVector withMean and withStd: 50.97secs > DenseVector withMean and withoutStd: 42.11secs > DenseVector withoutMean and withStd: 8.75secs > SparseVector withoutMean and withStd: 5.437 > With this PR, > DenseVector withMean and withStd: 5.76secs > DenseVector withMean and withoutStd: 5.28secs > DenseVector withoutMean and withStd: 5.30secs > SparseVector withoutMean and withStd: 1.27 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4547) OOM when making bins in BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4547: - Target Version/s: 1.3.0 > OOM when making bins in BinaryClassificationMetrics > --- > > Key: SPARK-4547 > URL: https://issues.apache.org/jira/browse/SPARK-4547 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Sean Owen >Priority: Minor > > Also following up on > http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdK4s4TNkf3_ecLC6yD-pLpys_PpT3WB7Tp6=yoxuxf...@mail.gmail.com%3E > -- this one I intend to make a PR for a bit later. The conversation was > basically: > {quote} > Recently I was using BinaryClassificationMetrics to build an AUC curve for a > classifier over a reasonably large number of points (~12M). The scores were > all probabilities, so tended to be almost entirely unique. > The computation does some operations by key, and this ran out of memory. It's > something you can solve with more than the default amount of memory, but in > this case, it seemed unuseful to create an AUC curve with such fine-grained > resolution. > I ended up just binning the scores so there were ~1000 unique values > and then it was fine. > {quote} > and: > {quote} > Yes, if there are many distinct values, we need binning to compute the AUC > curve. Usually, the scores are not evenly distribution, we cannot simply > truncate the digits. Estimating the quantiles for binning is necessary, > similar to RangePartitioner: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104 > Limiting the number of bins is definitely useful. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4547) OOM when making bins in BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4547: - Assignee: Sean Owen > OOM when making bins in BinaryClassificationMetrics > --- > > Key: SPARK-4547 > URL: https://issues.apache.org/jira/browse/SPARK-4547 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > Also following up on > http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdK4s4TNkf3_ecLC6yD-pLpys_PpT3WB7Tp6=yoxuxf...@mail.gmail.com%3E > -- this one I intend to make a PR for a bit later. The conversation was > basically: > {quote} > Recently I was using BinaryClassificationMetrics to build an AUC curve for a > classifier over a reasonably large number of points (~12M). The scores were > all probabilities, so tended to be almost entirely unique. > The computation does some operations by key, and this ran out of memory. It's > something you can solve with more than the default amount of memory, but in > this case, it seemed unuseful to create an AUC curve with such fine-grained > resolution. > I ended up just binning the scores so there were ~1000 unique values > and then it was fine. > {quote} > and: > {quote} > Yes, if there are many distinct values, we need binning to compute the AUC > curve. Usually, the scores are not evenly distribution, we cannot simply > truncate the digits. Estimating the quantiles for binning is necessary, > similar to RangePartitioner: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104 > Limiting the number of bins is definitely useful. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4577) Python example of LBFGS for MLlib guide
[ https://issues.apache.org/jira/browse/SPARK-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4577: - Issue Type: Improvement (was: Bug) > Python example of LBFGS for MLlib guide > --- > > Key: SPARK-4577 > URL: https://issues.apache.org/jira/browse/SPARK-4577 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Reporter: Davies Liu >Priority: Minor > > It should have a Python example of L-BFGS in MLlib guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4577) Python example of LBFGS for MLlib guide
[ https://issues.apache.org/jira/browse/SPARK-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4577: - Target Version/s: 1.2.0 > Python example of LBFGS for MLlib guide > --- > > Key: SPARK-4577 > URL: https://issues.apache.org/jira/browse/SPARK-4577 > Project: Spark > Issue Type: Bug > Components: Documentation, MLlib >Reporter: Davies Liu >Priority: Minor > > It should have a Python example of L-BFGS in MLlib guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3080: - Affects Version/s: 1.2.0 > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-3080 > URL: https://issues.apache.org/jira/browse/SPARK-3080 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0, 1.2.0 >Reporter: Burak Yavuz >Assignee: Xiangrui Meng > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4517) Improve memory efficiency for python broadcast
[ https://issues.apache.org/jira/browse/SPARK-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4517: -- Component/s: PySpark > Improve memory efficiency for python broadcast > -- > > Key: SPARK-4517 > URL: https://issues.apache.org/jira/browse/SPARK-4517 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.2.0 > > > Currently, the Python broadcast (TorrentBroadcast) will have multiple copies > in : > 1) 1 copy in python driver > 2) 1 copy in disks of driver (serialized and compressed) > 3) 2 copies in JVM driver (one is unserialized, one is serialized and > compressed) > 4) 2 copies in executor (one is unserialized, one is serialized and > compressed) > 5) one copy in each python worker. > Some of them are different in HTTPBroadcast: > 3) one copy in memory of driver, one copy in disk (serialized and compressed) > 4) one copy in memory of executor > If the python broadcast is 4G, then it need 12G in driver, and 8+4x G in > executor (x is the number of python worker, it's the number of CPUs usually). > The Python broadcast is already serialized and compressed in Python, it > should not be serialized and compressed again in JVM. Also, JVM does not need > to know the content of it, so it could be out of JVM. > So, we should have specified broadcast implementation for Python, it stores > the serialized and compressed data in disks, transferred to executors in p2p > way (similar to TorrentBroadcast), sent to python workers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4517) Improve memory efficiency for python broadcast
[ https://issues.apache.org/jira/browse/SPARK-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4517. --- Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Davies Liu This was fixed in https://github.com/apache/spark/pull/3417 > Improve memory efficiency for python broadcast > -- > > Key: SPARK-4517 > URL: https://issues.apache.org/jira/browse/SPARK-4517 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.2.0 > > > Currently, the Python broadcast (TorrentBroadcast) will have multiple copies > in : > 1) 1 copy in python driver > 2) 1 copy in disks of driver (serialized and compressed) > 3) 2 copies in JVM driver (one is unserialized, one is serialized and > compressed) > 4) 2 copies in executor (one is unserialized, one is serialized and > compressed) > 5) one copy in each python worker. > Some of them are different in HTTPBroadcast: > 3) one copy in memory of driver, one copy in disk (serialized and compressed) > 4) one copy in memory of executor > If the python broadcast is 4G, then it need 12G in driver, and 8+4x G in > executor (x is the number of python worker, it's the number of CPUs usually). > The Python broadcast is already serialized and compressed in Python, it > should not be serialized and compressed again in JVM. Also, JVM does not need > to know the content of it, so it could be out of JVM. > So, we should have specified broadcast implementation for Python, it stores > the serialized and compressed data in disks, transferred to executors in p2p > way (similar to TorrentBroadcast), sent to python workers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification
[ https://issues.apache.org/jira/browse/SPARK-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2206: - Target Version/s: 1.3.0 (was: 1.2.0) > Automatically infer the number of classification classes in multiclass > classification > - > > Key: SPARK-2206 > URL: https://issues.apache.org/jira/browse/SPARK-2206 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Manish Amde > > Currently, the user needs to specify the numClassesForClassification > parameter explicitly during multiclass classification for decision trees. > This feature will automatically infer this information (and possibly class > histograms) from the training data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3080: - Target Version/s: 1.3.0 (was: 1.2.0) > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-3080 > URL: https://issues.apache.org/jira/browse/SPARK-3080 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0, 1.2.0 >Reporter: Burak Yavuz >Assignee: Xiangrui Meng > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4548) Python broadcast perf regression from Spark 1.1
[ https://issues.apache.org/jira/browse/SPARK-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4548. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3417 [https://github.com/apache/spark/pull/3417] > Python broadcast perf regression from Spark 1.1 > --- > > Key: SPARK-4548 > URL: https://issues.apache.org/jira/browse/SPARK-4548 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.2.0 > > > Python broadcast in 1.2 is much slower than 1.1: > In spark-perf tests: > name1.1 1.2 speedup > python-broadcast-w-set3.6316.68 -78.23% -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers
[ https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-927: - Affects Version/s: 0.9.1 > PySpark sample() doesn't work if numpy is installed on master but not on > workers > > > Key: SPARK-927 > URL: https://issues.apache.org/jira/browse/SPARK-927 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.8.0, 0.9.1, 1.0.2, 1.1.2 >Reporter: Josh Rosen >Assignee: Matthew Farrellee >Priority: Minor > > PySpark's sample() method crashes with ImportErrors on the workers if numpy > is installed on the driver machine but not on the workers. I'm not sure > what's the best way to fix this. A general mechanism for automatically > shipping libraries from the master to the workers would address this, but > that could be complicated to implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers
[ https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-927: - Affects Version/s: 1.1.2 1.0.2 > PySpark sample() doesn't work if numpy is installed on master but not on > workers > > > Key: SPARK-927 > URL: https://issues.apache.org/jira/browse/SPARK-927 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.8.0, 0.9.1, 1.0.2, 1.1.2 >Reporter: Josh Rosen >Assignee: Matthew Farrellee >Priority: Minor > > PySpark's sample() method crashes with ImportErrors on the workers if numpy > is installed on the driver machine but not on the workers. I'm not sure > what's the best way to fix this. A general mechanism for automatically > shipping libraries from the master to the workers would address this, but > that could be complicated to implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4565) Add docs about advanced spark application development
[ https://issues.apache.org/jira/browse/SPARK-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223878#comment-14223878 ] Joseph E. Gonzalez commented on SPARK-4565: --- Hmm, I wonder if it would make more sense to have an "Advanced Programming Guide" since a lot of these ideas (and those we might add for GraphX) are around how to use the Spark APIs most efficiently rather than how to configure the cluster and would be a distraction in the standard programming guide. > Add docs about advanced spark application development > - > > Key: SPARK-4565 > URL: https://issues.apache.org/jira/browse/SPARK-4565 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Evan Sparks >Priority: Minor > > [~shivaram], [~jegonzal] and I have been working on a brief document based on > our experiences writing high performance spark applications - MLlib, GraphX, > pipelines, ml-matrix, etc. > It currently exists here - > https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing > Would it make sense to add these tips and tricks to the Spark Wiki? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4582) Add getVectors to Word2VecModel
[ https://issues.apache.org/jira/browse/SPARK-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4582: - Fix Version/s: 1.2.0 > Add getVectors to Word2VecModel > --- > > Key: SPARK-4582 > URL: https://issues.apache.org/jira/browse/SPARK-4582 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Priority: Minor > Fix For: 1.2.0 > > > Add getVectors to Word2VecModel for further processing. PR for branch-1.2: > https://github.com/apache/spark/pull/3309 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4565) Add docs about advanced spark application development
[ https://issues.apache.org/jira/browse/SPARK-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223864#comment-14223864 ] Evan Sparks commented on SPARK-4565: [~pwendell] suggested that we add this to the spark tuning guide and programming guide. What do you think? > Add docs about advanced spark application development > - > > Key: SPARK-4565 > URL: https://issues.apache.org/jira/browse/SPARK-4565 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Evan Sparks >Priority: Minor > > [~shivaram], [~jegonzal] and I have been working on a brief document based on > our experiences writing high performance spark applications - MLlib, GraphX, > pipelines, ml-matrix, etc. > It currently exists here - > https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing > Would it make sense to add these tips and tricks to the Spark Wiki? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4582) Add getVectors to Word2VecModel
[ https://issues.apache.org/jira/browse/SPARK-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223861#comment-14223861 ] Apache Spark commented on SPARK-4582: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3437 > Add getVectors to Word2VecModel > --- > > Key: SPARK-4582 > URL: https://issues.apache.org/jira/browse/SPARK-4582 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Priority: Minor > > Add getVectors to Word2VecModel for further processing. PR for branch-1.2: > https://github.com/apache/spark/pull/3309 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4565) Add docs about advanced spark application development
[ https://issues.apache.org/jira/browse/SPARK-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223856#comment-14223856 ] Joseph E. Gonzalez commented on SPARK-4565: --- Yes! However, we might want to organize it a bit more? > Add docs about advanced spark application development > - > > Key: SPARK-4565 > URL: https://issues.apache.org/jira/browse/SPARK-4565 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Evan Sparks >Priority: Minor > > [~shivaram], [~jegonzal] and I have been working on a brief document based on > our experiences writing high performance spark applications - MLlib, GraphX, > pipelines, ml-matrix, etc. > It currently exists here - > https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing > Would it make sense to add these tips and tricks to the Spark Wiki? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
[ https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223853#comment-14223853 ] Davies Liu commented on SPARK-4395: --- [~lian cheng] Could you help to investigate the cache issue here? > Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour > -- > > Key: SPARK-4395 > URL: https://issues.apache.org/jira/browse/SPARK-4395 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: version 1.2.0-SNAPSHOT >Reporter: Sameer Farooqui > > When I run this command it hangs for one to many hours and then finally > returns with successful results: > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > Note, the lab environment below is still active, so let me know if you'd like > to just access it directly. > +++ My Environment +++ > - 1-node cluster in Amazon > - RedHat 6.5 64-bit > - java version "1.7.0_67" > - SBT version: sbt-0.13.5 > - Scala version: scala-2.11.2 > Ran: > sudo yum -y update > git clone https://github.com/apache/spark > sudo sbt assembly > +++ Data file used +++ > http://blueplastic.com/databricks/movielens/ratings.dat > {code} > >>> import re > >>> import string > >>> from pyspark.sql import SQLContext, Row > >>> sqlContext = SQLContext(sc) > >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)' > >>> > >>> def parse_ratings_line(line): > ... match = re.search(RATINGS_PATTERN, line) > ... if match is None: > ... # Optionally, you can change this to just ignore if each line of > data is not critical. > ... raise Error("Invalid logline: %s" % logline) > ... return Row( > ... UserID= int(match.group(1)), > ... MovieID = int(match.group(2)), > ... Rating= int(match.group(3)), > ... Timestamp = int(match.group(4))) > ... > >>> ratings_base_RDD = > >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat") > ...# Call the parse_apace_log_line function on each line. > ....map(parse_ratings_line) > ...# Caches the objects in memory since they will be queried > multiple times. > ....cache()) > >>> ratings_base_RDD.count() > 1000209 > >>> ratings_base_RDD.first() > Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1) > >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD) > >>> schemaRatings.registerTempTable("RatingsTable") > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > {code} > (Now the Python shell hangs...) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4582) Add getVectors to Word2VecModel
Xiangrui Meng created SPARK-4582: Summary: Add getVectors to Word2VecModel Key: SPARK-4582 URL: https://issues.apache.org/jira/browse/SPARK-4582 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Priority: Minor Add getVectors to Word2VecModel for further processing. PR for branch-1.2: https://github.com/apache/spark/pull/3309 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
[ https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223851#comment-14223851 ] Sameer Farooqui commented on SPARK-4395: Hi Davies and Michael, I can confirm that this works if I move the .cache() to AFTER the inferSchema as Davies suggested. But if the cache is first, then the hang occurs. Workaround is suitable by me for now, although other people could also run into this if they're not aware of this JIRA... > Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour > -- > > Key: SPARK-4395 > URL: https://issues.apache.org/jira/browse/SPARK-4395 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: version 1.2.0-SNAPSHOT >Reporter: Sameer Farooqui > > When I run this command it hangs for one to many hours and then finally > returns with successful results: > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > Note, the lab environment below is still active, so let me know if you'd like > to just access it directly. > +++ My Environment +++ > - 1-node cluster in Amazon > - RedHat 6.5 64-bit > - java version "1.7.0_67" > - SBT version: sbt-0.13.5 > - Scala version: scala-2.11.2 > Ran: > sudo yum -y update > git clone https://github.com/apache/spark > sudo sbt assembly > +++ Data file used +++ > http://blueplastic.com/databricks/movielens/ratings.dat > {code} > >>> import re > >>> import string > >>> from pyspark.sql import SQLContext, Row > >>> sqlContext = SQLContext(sc) > >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)' > >>> > >>> def parse_ratings_line(line): > ... match = re.search(RATINGS_PATTERN, line) > ... if match is None: > ... # Optionally, you can change this to just ignore if each line of > data is not critical. > ... raise Error("Invalid logline: %s" % logline) > ... return Row( > ... UserID= int(match.group(1)), > ... MovieID = int(match.group(2)), > ... Rating= int(match.group(3)), > ... Timestamp = int(match.group(4))) > ... > >>> ratings_base_RDD = > >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat") > ...# Call the parse_apace_log_line function on each line. > ....map(parse_ratings_line) > ...# Caches the objects in memory since they will be queried > multiple times. > ....cache()) > >>> ratings_base_RDD.count() > 1000209 > >>> ratings_base_RDD.first() > Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1) > >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD) > >>> schemaRatings.registerTempTable("RatingsTable") > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > {code} > (Now the Python shell hangs...) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4578) Row.asDict() should keep the type of values
[ https://issues.apache.org/jira/browse/SPARK-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4578. Resolution: Fixed Fix Version/s: 1.2.0 Thanks davies I've resolved this. > Row.asDict() should keep the type of values > --- > > Key: SPARK-4578 > URL: https://issues.apache.org/jira/browse/SPARK-4578 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.2.0 > > > Current, the nested Row will be returned as tuple, it should be Row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4580) Document random forests and boosting in programming guide
[ https://issues.apache.org/jira/browse/SPARK-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4580: - Assignee: Joseph K. Bradley > Document random forests and boosting in programming guide > - > > Key: SPARK-4580 > URL: https://issues.apache.org/jira/browse/SPARK-4580 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > New items in Spark 1.2 require documentation updates, especially in the > programming guide: > * RandomForest > * GradientBoostedTrees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4562) GLM testing time regressions from Spark 1.1
[ https://issues.apache.org/jira/browse/SPARK-4562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4562. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3420 [https://github.com/apache/spark/pull/3420] > GLM testing time regressions from Spark 1.1 > --- > > Key: SPARK-4562 > URL: https://issues.apache.org/jira/browse/SPARK-4562 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.2.0 > > > There is a performance regression in test of GLM, it's introduced by > serialization change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4525) MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers
[ https://issues.apache.org/jira/browse/SPARK-4525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223841#comment-14223841 ] Apache Spark commented on SPARK-4525: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/3436 > MesosSchedulerBackend.resourceOffers cannot decline unused offers from > acceptedOffers > - > > Key: SPARK-4525 > URL: https://issues.apache.org/jira/browse/SPARK-4525 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.2.0, 1.3.0 >Reporter: Jongyoul Lee >Assignee: Jongyoul Lee >Priority: Blocker > > After resourceOffers function is refactored - SPARK-2269 -, that function > doesn't decline unused offers from accepted offers. That's because when > driver.launchTasks is called, if that's tasks is empty, driver.launchTask > calls the declineOffer(offer.id). > {quote} > Invoking this function with an empty collection of tasks declines offers in > their entirety (see SchedulerDriver.declineOffer(OfferID, Filters)). > - > http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html#launchTasks(OfferID,%20java.util.Collection,%20Filters) > {quote} > In branch-1.1, resourcesOffers calls a launchTask function for all offered > offers, so driver declines unused resources, however, in current master, at > first offers are divided accepted and declined offers by their resources, and > delinedOffers are declined explicitly, and offers with task from > acceptedOffers are launched by driver.launchTasks, but, offers without from > acceptedOfers are not launched with empty task or declined explicitly. Thus, > mesos master judges thats offers used by TaskScheduler and there are no > resources remaing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running
[ https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4180: - Assignee: Josh Rosen > SparkContext constructor should throw exception if another SparkContext is > already running > -- > > Key: SPARK-4180 > URL: https://issues.apache.org/jira/browse/SPARK-4180 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.2.0 > > > Spark does not currently support multiple concurrently-running SparkContexts > in the same JVM (see SPARK-2243). Therefore, SparkContext's constructor > should throw an exception if there is an active SparkContext that has not > been shut down via {{stop()}}. > PySpark already does this, but the Scala SparkContext should do the same > thing. The current behavior with multiple active contexts is unspecified / > not understood and it may be the source of confusing errors (see the user > error report in SPARK-4080, for example). > This should be pretty easy to add: just add a {{activeSparkContext}} field to > the SparkContext companion object and {{synchronize}} on it in the > constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an > example of this approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4578) Row.asDict() should keep the type of values
[ https://issues.apache.org/jira/browse/SPARK-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4578: - Assignee: Davies Liu > Row.asDict() should keep the type of values > --- > > Key: SPARK-4578 > URL: https://issues.apache.org/jira/browse/SPARK-4578 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > Current, the nested Row will be returned as tuple, it should be Row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4581) Refactorize StandardScaler to improve the transformation performance
[ https://issues.apache.org/jira/browse/SPARK-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223792#comment-14223792 ] Apache Spark commented on SPARK-4581: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/3435 > Refactorize StandardScaler to improve the transformation performance > > > Key: SPARK-4581 > URL: https://issues.apache.org/jira/browse/SPARK-4581 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: DB Tsai > > The following optimizations are done to improve the StandardScaler model > transformation performance. > 1) Covert Breeze dense vector to primitive vector to reduce the overhead. > 2) Since mean can be potentially a sparse vector, we explicitly convert it to > dense primitive vector. > 3) Have a local reference to `shift` and `factor` array so JVM can locate the > value with one operation call. > 4) In pattern matching part, we use the mllib SparseVector/DenseVector > instead of breeze's vector to make the codebase cleaner. > Benchmark with mnist8m dataset: > Before, > DenseVector withMean and withStd: 50.97secs > DenseVector withMean and withoutStd: 42.11secs > DenseVector withoutMean and withStd: 8.75secs > SparseVector withoutMean and withStd: 5.437 > With this PR, > DenseVector withMean and withStd: 5.76secs > DenseVector withMean and withoutStd: 5.28secs > DenseVector withoutMean and withStd: 5.30secs > SparseVector withoutMean and withStd: 1.27 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4581) Refactorize StandardScaler to improve the transformation performance
DB Tsai created SPARK-4581: -- Summary: Refactorize StandardScaler to improve the transformation performance Key: SPARK-4581 URL: https://issues.apache.org/jira/browse/SPARK-4581 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai The following optimizations are done to improve the StandardScaler model transformation performance. 1) Covert Breeze dense vector to primitive vector to reduce the overhead. 2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector. 3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call. 4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to make the codebase cleaner. Benchmark with mnist8m dataset: Before, DenseVector withMean and withStd: 50.97secs DenseVector withMean and withoutStd: 42.11secs DenseVector withoutMean and withStd: 8.75secs SparseVector withoutMean and withStd: 5.437 With this PR, DenseVector withMean and withStd: 5.76secs DenseVector withMean and withoutStd: 5.28secs DenseVector withoutMean and withStd: 5.30secs SparseVector withoutMean and withStd: 1.27 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4578) Row.asDict() should keep the type of values
[ https://issues.apache.org/jira/browse/SPARK-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223742#comment-14223742 ] Apache Spark commented on SPARK-4578: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3434 > Row.asDict() should keep the type of values > --- > > Key: SPARK-4578 > URL: https://issues.apache.org/jira/browse/SPARK-4578 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Davies Liu >Priority: Blocker > > Current, the nested Row will be returned as tuple, it should be Row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha
[ https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-4447: - Assignee: Sandy Ryza (was: Patrick Wendell) > Remove layers of abstraction in YARN code no longer needed after dropping > yarn-alpha > > > Key: SPARK-4447 > URL: https://issues.apache.org/jira/browse/SPARK-4447 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.3.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > For example, YarnRMClient and YarnRMClientImpl can be merged > YarnAllocator and YarnAllocationHandler can be merged -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha
[ https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-4447: - Assignee: Patrick Wendell > Remove layers of abstraction in YARN code no longer needed after dropping > yarn-alpha > > > Key: SPARK-4447 > URL: https://issues.apache.org/jira/browse/SPARK-4447 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.3.0 >Reporter: Sandy Ryza >Assignee: Patrick Wendell > > For example, YarnRMClient and YarnRMClientImpl can be merged > YarnAllocator and YarnAllocationHandler can be merged -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4196) Streaming + checkpointing + saveAsNewAPIHadoopFiles = NotSerializableException for Hadoop Configuration
[ https://issues.apache.org/jira/browse/SPARK-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4196: --- Assignee: Tathagata Das (was: Patrick Wendell) > Streaming + checkpointing + saveAsNewAPIHadoopFiles = > NotSerializableException for Hadoop Configuration > --- > > Key: SPARK-4196 > URL: https://issues.apache.org/jira/browse/SPARK-4196 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Sean Owen >Assignee: Tathagata Das > > I am reasonably sure there is some issue here in Streaming and that I'm not > missing something basic, but not 100%. I went ahead and posted it as a JIRA > to track, since it's come up a few times before without resolution, and right > now I can't get checkpointing to work at all. > When Spark Streaming checkpointing is enabled, I see a > NotSerializableException thrown for a Hadoop Configuration object, and it > seems like it is not one from my user code. > Before I post my particular instance see > http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408135046777-12202.p...@n3.nabble.com%3E > for another occurrence. > I was also on customer site last week debugging an identical issue with > checkpointing in a Scala-based program and they also could not enable > checkpointing without hitting exactly this error. > The essence of my code is: > {code} > final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); > JavaStreamingContextFactory streamingContextFactory = new > JavaStreamingContextFactory() { > @Override > public JavaStreamingContext create() { > return new JavaStreamingContext(sparkContext, new > Duration(batchDurationMS)); > } > }; > streamingContext = JavaStreamingContext.getOrCreate( > checkpointDirString, sparkContext.hadoopConfiguration(), > streamingContextFactory, false); > streamingContext.checkpoint(checkpointDirString); > {code} > It yields: > {code} > 2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66 > org.apache.hadoop.conf.Configuration > - field (class > "org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9", > name: "conf$2", type: "class org.apache.hadoop.conf.Configuration") > - object (class > "org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9", > ) > - field (class "org.apache.spark.streaming.dstream.ForEachDStream", > name: "org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc", > type: "interface scala.Function2") > - object (class "org.apache.spark.streaming.dstream.ForEachDStream", > org.apache.spark.streaming.dstream.ForEachDStream@cb8016a) > ... > {code} > This looks like it's due to PairRDDFunctions, as this saveFunc seems > to be org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9 > : > {code} > def saveAsNewAPIHadoopFiles( > prefix: String, > suffix: String, > keyClass: Class[_], > valueClass: Class[_], > outputFormatClass: Class[_ <: NewOutputFormat[_, _]], > conf: Configuration = new Configuration > ) { > val saveFunc = (rdd: RDD[(K, V)], time: Time) => { > val file = rddToFileName(prefix, suffix, time) > rdd.saveAsNewAPIHadoopFile(file, keyClass, valueClass, > outputFormatClass, conf) > } > self.foreachRDD(saveFunc) > } > {code} > Is that not a problem? but then I don't know how it would ever work in Spark. > But then again I don't see why this is an issue and only when checkpointing > is enabled. > Long-shot, but I wonder if it is related to closure issues like > https://issues.apache.org/jira/browse/SPARK-1866 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org