[jira] [Updated] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4584:
--
Priority: Blocker  (was: Major)

> 2x Performance regression for Spark-on-YARN
> ---
>
> Key: SPARK-4584
> URL: https://issues.apache.org/jira/browse/SPARK-4584
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Nishkam Ravi
>Priority: Blocker
>
> Significant performance regression observed for Spark-on-YARN (upto 2x) after 
> 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
>  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
> enough input dataset in YARN cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4596) Refactorize Normalizer to make code cleaner

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224160#comment-14224160
 ] 

Apache Spark commented on SPARK-4596:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3446

> Refactorize Normalizer to make code cleaner
> ---
>
> Key: SPARK-4596
> URL: https://issues.apache.org/jira/browse/SPARK-4596
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>
> In this refactoring, the performance is slightly increased by removing the 
> overhead from breeze vector. The bottleneck is still in breeze norm which is 
> implemented by activeIterator. This inefficiency of breeze norm will be 
> addressed in next PR. At least, this PR makes the code more consistent in the 
> codebase.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4595) Spark MetricsServlet is not worked because of initialization ordering

2014-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4595:
--
Priority: Blocker  (was: Major)

> Spark MetricsServlet is not worked because of initialization ordering
> -
>
> Key: SPARK-4595
> URL: https://issues.apache.org/jira/browse/SPARK-4595
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Saisai Shao
>Priority: Blocker
>
> Web UI is initialized before MetricsSystem is started, at that time 
> MetricsSerlvet is not yet created, which will make MetricsServlet fail to 
> register  into web UI. 
> Instead MetricsServlet handler should be added to the web UI after 
> MetricsSystem is started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4595) Spark MetricsServlet is not worked because of initialization ordering

2014-11-24 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224159#comment-14224159
 ] 

Josh Rosen commented on SPARK-4595:
---

Marking this as blocker until we triage tomorrow since it seems like this might 
be a 1.2.0 regression.  [~jerryshao], do you know if this bug is new in 1.2.0?

> Spark MetricsServlet is not worked because of initialization ordering
> -
>
> Key: SPARK-4595
> URL: https://issues.apache.org/jira/browse/SPARK-4595
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Saisai Shao
>Priority: Blocker
>
> Web UI is initialized before MetricsSystem is started, at that time 
> MetricsSerlvet is not yet created, which will make MetricsServlet fail to 
> register  into web UI. 
> Instead MetricsServlet handler should be added to the web UI after 
> MetricsSystem is started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4596) Refactorize Normalizer to make code cleaner

2014-11-24 Thread DB Tsai (JIRA)
DB Tsai created SPARK-4596:
--

 Summary: Refactorize Normalizer to make code cleaner
 Key: SPARK-4596
 URL: https://issues.apache.org/jira/browse/SPARK-4596
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai


In this refactoring, the performance is slightly increased by removing the 
overhead from breeze vector. The bottleneck is still in breeze norm which is 
implemented by activeIterator. This inefficiency of breeze norm will be 
addressed in next PR. At least, this PR makes the code more consistent in the 
codebase.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4595) Spark MetricsServlet is not worked because of initialization ordering

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224146#comment-14224146
 ] 

Apache Spark commented on SPARK-4595:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/3444

> Spark MetricsServlet is not worked because of initialization ordering
> -
>
> Key: SPARK-4595
> URL: https://issues.apache.org/jira/browse/SPARK-4595
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Saisai Shao
>
> Web UI is initialized before MetricsSystem is started, at that time 
> MetricsSerlvet is not yet created, which will make MetricsServlet fail to 
> register  into web UI. 
> Instead MetricsServlet handler should be added to the web UI after 
> MetricsSystem is started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4595) Spark MetricsServlet is not worked because of initialization ordering

2014-11-24 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-4595:
---
Summary: Spark MetricsServlet is not worked because of initialization 
ordering  (was: Spark MetricsServlet is not enabled because of initialization 
ordering)

> Spark MetricsServlet is not worked because of initialization ordering
> -
>
> Key: SPARK-4595
> URL: https://issues.apache.org/jira/browse/SPARK-4595
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Saisai Shao
>
> Web UI is initialized before MetricsSystem is started, at that time 
> MetricsSerlvet is not yet created, which will make MetricsServlet fail to 
> register  into web UI. 
> Instead MetricsServlet handler should be added to the web UI after 
> MetricsSystem is started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4595) Spark MetricsServlet is not enabled because of initialization ordering

2014-11-24 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-4595:
--

 Summary: Spark MetricsServlet is not enabled because of 
initialization ordering
 Key: SPARK-4595
 URL: https://issues.apache.org/jira/browse/SPARK-4595
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Saisai Shao


Web UI is initialized before MetricsSystem is started, at that time 
MetricsSerlvet is not yet created, which will make MetricsServlet fail to 
register  into web UI. 

Instead MetricsServlet handler should be added to the web UI after 
MetricsSystem is started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4594) Improvement the broadcast for HiveConf

2014-11-24 Thread Leo (JIRA)
Leo created SPARK-4594:
--

 Summary: Improvement the broadcast for HiveConf
 Key: SPARK-4594
 URL: https://issues.apache.org/jira/browse/SPARK-4594
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Leo
Priority: Minor


Every time we need to get a table from hive , HadoopTableReader will broadcast 
HiveConf to clustor .  Acturally In one application the hiveconf is single, so 
I think we can keep it in HiveContext for every query . Although it just 50kb , 
it's useful for JDBC user and streaming+sql app  .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4593) sum(1/0) would produce a very large number

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224129#comment-14224129
 ] 

Apache Spark commented on SPARK-4593:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3443

> sum(1/0) would produce a very large number
> --
>
> Key: SPARK-4593
> URL: https://issues.apache.org/jira/browse/SPARK-4593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Priority: Minor
>
> SELECT max(1/0) FROM src would get a very large number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-24 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224132#comment-14224132
 ] 

Aaron Staple commented on SPARK-1503:
-

[~mengxr] [~rezazadeh] Ok, thanks for the heads up. Let me know if there’s 
anything about the spec that should be handled differently. I covered most of 
the mathematics informally (the details are already covered formally in the 
references). And in addition, the proposal describes a method of implementing 
TFOCS functionality distributively but does not investigate existing 
distributed optimization systems.

> Implement Nesterov's accelerated first-order method
> ---
>
> Key: SPARK-1503
> URL: https://issues.apache.org/jira/browse/SPARK-1503
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Aaron Staple
>
> Nesterov's accelerated first-order method is a drop-in replacement for 
> steepest descent but it converges much faster. We should implement this 
> method and compare its performance with existing algorithms, including SGD 
> and L-BFGS.
> TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
> method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4593) sum(1/0) would produce a very large number

2014-11-24 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-4593:
--

 Summary: sum(1/0) would produce a very large number
 Key: SPARK-4593
 URL: https://issues.apache.org/jira/browse/SPARK-4593
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Priority: Minor


SELECT max(1/0) FROM src would get a very large number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-911) Support map pruning on sorted (K, V) RDD's

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224123#comment-14224123
 ] 

Apache Spark commented on SPARK-911:


User 'aaronjosephs' has created a pull request for this issue:
https://github.com/apache/spark/pull/1381

> Support map pruning on sorted (K, V) RDD's
> --
>
> Key: SPARK-911
> URL: https://issues.apache.org/jira/browse/SPARK-911
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
>
> If someone has sorted a (K, V) rdd, we should offer them a way to filter a 
> range of the partitions that employs map pruning. This would be simple using 
> a small range index within the rdd itself. A good example is I sort my 
> dataset by time and then I want to serve queries that are restricted to a 
> certain time range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4590) Early investigation of parameter server

2014-11-24 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224117#comment-14224117
 ] 

Reza Zadeh commented on SPARK-4590:
---

Some starting points: 
- http://stanford.edu/~rezab/papers/factorbird.pdf
- http://parameterserver.org/

More detailed comparisons coming.


> Early investigation of parameter server
> ---
>
> Key: SPARK-4590
> URL: https://issues.apache.org/jira/browse/SPARK-4590
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Reza Zadeh
>
> In the currently implementation of GLM solvers, we save intermediate models 
> on the driver node and update it through broadcast and aggregation. Even with 
> torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
> ~10 million features. This JIRA is for investigating the parameter server 
> approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4592) "Worker registration failed: Duplicate worker ID" error during Master failover

2014-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4592:
--
Attachment: log.txt

> "Worker registration failed: Duplicate worker ID" error during Master failover
> --
>
> Key: SPARK-4592
> URL: https://issues.apache.org/jira/browse/SPARK-4592
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Attachments: log.txt
>
>
> When running Spark Standalone in high-availability mode, we sometimes see 
> "Worker registration failed: Duplicate worker ID" errors which prevent 
> workers from reconnecting to the new active master.  I've attached full logs 
> from a reproduction in my integration tests suite (which runs something 
> similar to Spark's FaultToleranceTest).  Here's the relevant excerpt from a 
> worker log during a failed run of the "rolling outage" test, which creates a 
> multi-master cluster then repeatedly kills the active master, waits for 
> workers to reconnect to a new active master, then kills that master, and so 
> on.
> {code}
> 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at 
> http://172.17.0.90:8081
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:02 INFO Worker: Successfully registered with master 
> spark://172.17.0.86:7077
> 14/11/23 02:23:03 INFO Worker: Asked to launch executor 
> app-20141123022303-/1 for spark-integration-tests
> 14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp" 
> "::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar"
>  "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" 
> "akka.tcp://sparkdri...@joshs-mbp.att.net:51271/user/CoarseGrainedScheduler" 
> "1" "172.17.0.90" "8" "app-20141123022303-" 
> "akka.tcp://sparkWorker@172.17.0.90:/user/Worker"
> 14/11/23 02:23:14 INFO Worker: Disassociated 
> [akka.tcp://sparkWorker@172.17.0.90:] -> 
> [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for 
> master to reconnect...
> 14/11/23 02:23:14 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:14 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkMaster@172.17.0.86:7077] has failed, address is now 
> gated for [5000] ms. Reason is: [Disassociated].
> 14/11/23 02:23:14 INFO Worker: Disassociated 
> [akka.tcp://sparkWorker@172.17.0.90:] -> 
> [akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated !
> 14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for 
> master to reconnect...
> 14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: 
> Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from 
> Actor[akka://sparkWorker/user/Worker#-1246122173] to 
> Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters 
> encountered. This logging can be turned off or adjusted with configuration 
> settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with 
> the master, since there is an attempt scheduled already.
> 14/11/23 02:23:14 INFO LocalActorRef: Message 
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
> Actor[akka://sparkWorker/deadLetters] to 
> Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613]
>  was not delivered. [2] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1)
> 14/11/23 02:23:25 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:25 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2)
> 14/11/23 02:23:36 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11

[jira] [Commented] (SPARK-4592) "Worker registration failed: Duplicate worker ID" error during Master failover

2014-11-24 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224108#comment-14224108
 ] 

Josh Rosen commented on SPARK-4592:
---

It looks like this is a bug that was introduced in [the 
patch|https://github.com/apache/spark/pull/2828] for SPARK-3736.  Prior to that 
patch, a worker that became disassociated from a master would wait for a master 
to initiate a reconnection.  This behavior caused problems when a master failed 
by stopping and restarting, since the restarted master would never know to 
initiate a reconnection.  That patch addressed this issue by having workers 
attempt to reconnect to the master if they think they've become disconnected.  
When reviewing that patch, I wrote a [long 
comment|https://github.com/apache/spark/pull/2828#issuecomment-59602394] that 
explains it in much more detail; that's a good reference for understanding its 
motivation.

This change introduced a problem, though: there's now a race-condition during 
multi-master failover.  In the old multi-master code, a worker never initiates 
a reconnection attempt to a master; instead, it reconnects after the new active 
/ primary master tells the worker that it's the new master.  With the addition 
of worker-initiated reconnect, there's a race-condition where the worker 
detects that it's become disconnected, goes down the list of known masters and 
tries to connect to each of them, successfully connects to the new primary 
master, then receives a {{ChangedMaster}} event and attempts to connect to the 
new primary master _even though it's already connected_, causing a duplicate 
worker registration.

There are a number of ways that we might fix this, but we have to be careful 
because it seems likely that the worker-initiated reconnect could have 
introduced other problems:

- What happens if a worker sends a reconnection attempt to a live master which 
is not the new primary?  Will that non-primary master reject or redirect those 
registrations, or will it register the workers and cause a split-brain scenario 
to occur?
- The Worker is implemented as an actor and thus does not have synchronization 
of its internal state since it assumes message-at-a-time processing, but the 
asynchronous re-registration timer thread may violate this assumption because 
it directly calls internal worker methods instead of sending messages to the 
worker's own mailbox.

One simple fix might be to have the worker never initiate reconnection attempts 
when running in a multi-master environment.  I still need to think through 
whether this will cause new problems similar to SPARK-3736.  I don't think it 
will be a problem because that patch was motivated by cases where the master 
forgot who the worker was and couldn't initiate a reconnect.  If the list of 
registered workers is stored durably in ZooKeeper such that a worker is never 
told that it has registered until a record of its registration has become 
durable, then I think this is fine: if a live master thinks that a worker has 
disconnected, then it will initiate a reconnection; when a new master fails 
over, it will reconnect all workers based on the list from ZooKeeper. 

> "Worker registration failed: Duplicate worker ID" error during Master failover
> --
>
> Key: SPARK-4592
> URL: https://issues.apache.org/jira/browse/SPARK-4592
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> When running Spark Standalone in high-availability mode, we sometimes see 
> "Worker registration failed: Duplicate worker ID" errors which prevent 
> workers from reconnecting to the new active master.  I've attached full logs 
> from a reproduction in my integration tests suite (which runs something 
> similar to Spark's FaultToleranceTest).  Here's the relevant excerpt from a 
> worker log during a failed run of the "rolling outage" test, which creates a 
> multi-master cluster then repeatedly kills the active master, waits for 
> workers to reconnect to a new active master, then kills that master, and so 
> on.
> {code}
> 14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at 
> http://172.17.0.90:8081
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.86:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.87:7077...
> 14/11/23 02:23:02 INFO Worker: Connecting to master 
> spark://172.17.0.88:7077...
> 14/11/23 02:23:02 INFO Worker: Successfully registered with master 
> spark://172.17.0.86:7077
> 14/11/23 02:23:03 INFO Worker: Asked to launch executor 
> app-20141123022303-/1 for spark-integration-tests
> 14/11/23 02:23:03 INFO ExecutorRunner: Lau

[jira] [Created] (SPARK-4592) "Worker registration failed: Duplicate worker ID" error during Master failover

2014-11-24 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-4592:
-

 Summary: "Worker registration failed: Duplicate worker ID" error 
during Master failover
 Key: SPARK-4592
 URL: https://issues.apache.org/jira/browse/SPARK-4592
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker


When running Spark Standalone in high-availability mode, we sometimes see 
"Worker registration failed: Duplicate worker ID" errors which prevent workers 
from reconnecting to the new active master.  I've attached full logs from a 
reproduction in my integration tests suite (which runs something similar to 
Spark's FaultToleranceTest).  Here's the relevant excerpt from a worker log 
during a failed run of the "rolling outage" test, which creates a multi-master 
cluster then repeatedly kills the active master, waits for workers to reconnect 
to a new active master, then kills that master, and so on.

{code}
14/11/23 02:23:02 INFO WorkerWebUI: Started WorkerWebUI at 
http://172.17.0.90:8081
14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.86:7077...
14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.87:7077...
14/11/23 02:23:02 INFO Worker: Connecting to master spark://172.17.0.88:7077...
14/11/23 02:23:02 INFO Worker: Successfully registered with master 
spark://172.17.0.86:7077
14/11/23 02:23:03 INFO Worker: Asked to launch executor 
app-20141123022303-/1 for spark-integration-tests
14/11/23 02:23:03 INFO ExecutorRunner: Launch command: "java" "-cp" 
"::/opt/sparkconf:/opt/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-SNAPSHOT-hadoop1.0.4.jar"
 "-XX:MaxPermSize=128m" "-Dspark.driver.port=51271" "-Xms512M" "-Xmx512M" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" 
"akka.tcp://sparkdri...@joshs-mbp.att.net:51271/user/CoarseGrainedScheduler" 
"1" "172.17.0.90" "8" "app-20141123022303-" 
"akka.tcp://sparkWorker@172.17.0.90:/user/Worker"
14/11/23 02:23:14 INFO Worker: Disassociated 
[akka.tcp://sparkWorker@172.17.0.90:] -> 
[akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated !
14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for master 
to reconnect...
14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.86:7077...
14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.87:7077...
14/11/23 02:23:14 INFO Worker: Connecting to master spark://172.17.0.88:7077...
14/11/23 02:23:14 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkMaster@172.17.0.86:7077] has failed, address is now 
gated for [5000] ms. Reason is: [Disassociated].
14/11/23 02:23:14 INFO Worker: Disassociated 
[akka.tcp://sparkWorker@172.17.0.90:] -> 
[akka.tcp://sparkMaster@172.17.0.86:7077] Disassociated !
14/11/23 02:23:14 ERROR Worker: Connection to master failed! Waiting for master 
to reconnect...
14/11/23 02:23:14 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message 
[org.apache.spark.deploy.DeployMessages$RegisterWorker] from 
Actor[akka://sparkWorker/user/Worker#-1246122173] to 
Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters 
encountered. This logging can be turned off or adjusted with configuration 
settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/11/23 02:23:14 INFO Worker: Not spawning another attempt to register with 
the master, since there is an attempt scheduled already.
14/11/23 02:23:14 INFO LocalActorRef: Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
Actor[akka://sparkWorker/deadLetters] to 
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.17.0.86%3A7077-2#343365613]
 was not delivered. [2] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
14/11/23 02:23:25 INFO Worker: Retrying connection to master (attempt # 1)
14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.86:7077...
14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.87:7077...
14/11/23 02:23:25 INFO Worker: Connecting to master spark://172.17.0.88:7077...
14/11/23 02:23:36 INFO Worker: Retrying connection to master (attempt # 2)
14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.86:7077...
14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.87:7077...
14/11/23 02:23:36 INFO Worker: Connecting to master spark://172.17.0.88:7077...
14/11/23 02:23:42 INFO Worker: Master has changed, new master is at 
spark://172.17.0.87:7077
14/11/23 02:23:47 INFO Worker: Retrying connection to master (attempt # 3)
14/11/23 02:23:47 INFO Worker: Connecting to master spark://172.17.0.86:7077...
14/11/2

[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering

2014-11-24 Thread Meethu Mathew (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224091#comment-14224091
 ] 

Meethu Mathew commented on SPARK-3588:
--

[~mengxr] We have completed the pyspark implementation which is available  at 
https://github.com/FlytxtRnD/GMM. We are in the process of porting the code to 
Scala and were planning to create a PR once the coding and test cases are 
completed.
By "merging" do you mean to merge the tickets or the implementations? Kindly 
explain how the merge would be done.
Will our work be a duplicate effort if we continue with our scala 
implementation? 
Could you please suggest the next course of action?

> Gaussian Mixture Model clustering
> -
>
> Key: SPARK-3588
> URL: https://issues.apache.org/jira/browse/SPARK-3588
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Meethu Mathew
>Assignee: Meethu Mathew
> Attachments: GMMSpark.py
>
>
> Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
> models the entire data set as a finite mixture of Gaussian distributions,each 
> parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
> π. In this technique, probability of  each point to belong to each cluster is 
> computed along with the cluster statistics.
> We have come up with an initial distributed implementation of GMM in pyspark 
> where the parameters are estimated using the  Expectation-Maximization 
> algorithm.Our current implementation considers diagonal covariance matrix for 
> each component.
> We did an initial benchmark study on a  2 node Spark standalone cluster setup 
> where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. 
> We also evaluated python version of k-means available in spark on the same 
> datasets.
> Below are the results from this benchmark study. The reported stats are 
> average from 10 runs.Tests were done on multiple datasets with varying number 
> of features and instances.
> ||  Dataset  
>    ||   Gaussian
>  mixture model || 
>    Kmeans(Python)   ||
>  
> |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
> time per iteration |Time for 100 iterations | 
> |0.7million|    13 
>   |  
>    7s 
>     | 
>  12min 
>    |  
>     13s  
>     |  26min 
>    |
> |1.8million|    11 
>   |   
>     17s 
>  | 
>    29min 
>    |  
>     33s  
>      |  53min 
>      |
> |10million|   16 
>   |  
>     1.6min     
>  |    2.7hr 
>      |  
>    1.2min | 
>  2hr       
>  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-24 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224088#comment-14224088
 ] 

Xiangrui Meng commented on SPARK-1503:
--

[~staple] Thanks for working on the design doc! [~rezazadeh] will make a pass.

> Implement Nesterov's accelerated first-order method
> ---
>
> Key: SPARK-1503
> URL: https://issues.apache.org/jira/browse/SPARK-1503
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Aaron Staple
>
> Nesterov's accelerated first-order method is a drop-in replacement for 
> steepest descent but it converges much faster. We should implement this 
> method and compare its performance with existing algorithms, including SGD 
> and L-BFGS.
> TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
> method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4591) Add algorithm/model wrappers in spark.ml to adapt the new API

2014-11-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4591:


 Summary: Add algorithm/model wrappers in spark.ml to adapt the new 
API
 Key: SPARK-4591
 URL: https://issues.apache.org/jira/browse/SPARK-4591
 Project: Spark
  Issue Type: Umbrella
Reporter: Xiangrui Meng


This is an umbrella JIRA for porting spark.mllib implementations to adapt the 
new API defined under spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4590) Early investigation of parameter server

2014-11-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4590:


 Summary: Early investigation of parameter server
 Key: SPARK-4590
 URL: https://issues.apache.org/jira/browse/SPARK-4590
 Project: Spark
  Issue Type: Brainstorming
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Reza Zadeh


In the currently implementation of GLM solvers, we save intermediate models on 
the driver node and update it through broadcast and aggregation. Even with 
torrent broadcast and tree aggregation added in 1.1, it is hard to go beyond 
~10 million features. This JIRA is for investigating the parameter server 
approach, including algorithm, infrastructure, and dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4589) ML add-ons to SchemaRDD

2014-11-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4589:


 Summary: ML add-ons to SchemaRDD
 Key: SPARK-4589
 URL: https://issues.apache.org/jira/browse/SPARK-4589
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib, SQL
Reporter: Xiangrui Meng


One feedback we received from the Pipeline API (SPARK-3530) is about the 
boilerplate code in the implementation. We can add more Scala DSL to simplify 
the code for the operations we need in ML. Those operations could live under 
spark.ml via implicit, or be added to SchemaRDD directly if they are also 
useful for general purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4588) Add API for feature attributes

2014-11-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4588:


 Summary: Add API for feature attributes
 Key: SPARK-4588
 URL: https://issues.apache.org/jira/browse/SPARK-4588
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng


Feature attributes, e.g., continuous/categorical, feature names, feature 
dimension, number of categories, number of nonzeros (support) could be useful 
for ML algorithms.

In SPARK-3569, we added metadata to schema, which can be used to store feature 
attributes along with the dataset. We need to provide a wrapper over the 
Metadata class for ML usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3588) Gaussian Mixture Model clustering

2014-11-24 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224042#comment-14224042
 ] 

Xiangrui Meng commented on SPARK-3588:
--

[~MeethuMathew] Just want to check with you whether you are working on the 
Scala implementation. [~tgaloppo] sent out a PR in SPARK-4156 . If you haven't 
spent much time on the Scala implementation, I'd like to invite you to review 
that PR, or we can think of a way to merge both implementations. Does it sound 
good to you?

> Gaussian Mixture Model clustering
> -
>
> Key: SPARK-3588
> URL: https://issues.apache.org/jira/browse/SPARK-3588
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Meethu Mathew
>Assignee: Meethu Mathew
> Attachments: GMMSpark.py
>
>
> Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
> models the entire data set as a finite mixture of Gaussian distributions,each 
> parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
> π. In this technique, probability of  each point to belong to each cluster is 
> computed along with the cluster statistics.
> We have come up with an initial distributed implementation of GMM in pyspark 
> where the parameters are estimated using the  Expectation-Maximization 
> algorithm.Our current implementation considers diagonal covariance matrix for 
> each component.
> We did an initial benchmark study on a  2 node Spark standalone cluster setup 
> where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. 
> We also evaluated python version of k-means available in spark on the same 
> datasets.
> Below are the results from this benchmark study. The reported stats are 
> average from 10 runs.Tests were done on multiple datasets with varying number 
> of features and instances.
> ||  Dataset  
>    ||   Gaussian
>  mixture model || 
>    Kmeans(Python)   ||
>  
> |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
> time per iteration |Time for 100 iterations | 
> |0.7million|    13 
>   |  
>    7s 
>     | 
>  12min 
>    |  
>     13s  
>     |  26min 
>    |
> |1.8million|    11 
>   |   
>     17s 
>  | 
>    29min 
>    |  
>     33s  
>      |  53min 
>      |
> |10million|   16 
>   |  
>     1.6min     
>  |    2.7hr 
>      |  
>    1.2min | 
>  2hr       
>  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3717:
-
Target Version/s: 1.3.0

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3717:
-
Assignee: Joseph K. Bradley

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4587) Model export/import

2014-11-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4587:


 Summary: Model export/import
 Key: SPARK-4587
 URL: https://issues.apache.org/jira/browse/SPARK-4587
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Xiangrui Meng
Priority: Critical


This is an umbrella JIRA for one of the most requested features on the user 
mailing list. Model export/import can be done via Java serialization. But it 
doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
should provide save/load methods to every model. PMML is an option but it has 
its limitations. There are couple things we need to discuss: 1) data format, 2) 
how to preserve partitioning, 3) data compatibility between versions and 
language APIs, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1406) PMML model evaluation support via MLib

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1406:
-
Assignee: Vincenzo Selvaggio

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>Assignee: Vincenzo Selvaggio
> Attachments: MyJPMMLEval.java, SPARK-1406.pdf, kmeans.xml
>
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4586) Python API for ML Pipeline

2014-11-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4586:


 Summary: Python API for ML Pipeline
 Key: SPARK-4586
 URL: https://issues.apache.org/jira/browse/SPARK-4586
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


Add Python API to the newly added ML pipeline and parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4570) Add broadcast join to left semi join

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224025#comment-14224025
 ] 

Apache Spark commented on SPARK-4570:
-

User 'wangxiaojing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3442

> Add broadcast  join to left semi join
> -
>
> Key: SPARK-4570
> URL: https://issues.apache.org/jira/browse/SPARK-4570
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Priority: Minor
> Fix For: 1.1.0
>
>
> For now, spark use broadcast join instead of hash join to optimize {{inner 
> join}} when the size of one side data did not reach the 
> {{AUTO_BROADCASTJOIN_THRESHOLD}}
> However,Spark SQL will perform shuffle operations on each child relations 
> while executing {{left semi join}}  is more suitable for optimiztion with 
> broadcast join. 
> We are planning to create a{{BroadcastLeftSemiJoinHash}} to implement the 
> broadcast join for {{left semi join}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4251) Add Restricted Boltzmann machine(RBM) algorithm to MLlib

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4251:
-
Target Version/s:   (was: 1.3.0)

> Add Restricted Boltzmann machine(RBM) algorithm to MLlib
> 
>
> Key: SPARK-4251
> URL: https://issues.apache.org/jira/browse/SPARK-4251
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4156:
-
Priority: Major  (was: Minor)

> Add expectation maximization for Gaussian mixture models to MLLib clustering
> 
>
> Key: SPARK-4156
> URL: https://issues.apache.org/jira/browse/SPARK-4156
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Travis Galoppo
>Assignee: Travis Galoppo
>
> As an additional clustering algorithm, implement expectation maximization for 
> Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3188:
-
Assignee: Fan Jiang

> Add Robust Regression Algorithm with Tukey bisquare weight  function 
> (Biweight Estimates) 
> --
>
> Key: SPARK-3188
> URL: https://issues.apache.org/jira/browse/SPARK-3188
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>Priority: Minor
>  Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression to employ a 
> fitting criterion that is not as vulnerable as least square.
> The Tukey bisquare weight function, also referred to as the biweight 
> function, produces an M-estimator that is more resistant to regression 
> outliers than the Huber M-estimator (Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4494:
-
Priority: Minor  (was: Major)

> IDFModel.transform() add support for single vector
> --
>
> Key: SPARK-4494
> URL: https://issues.apache.org/jira/browse/SPARK-4494
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Jean-Philippe Quemener
>Priority: Minor
>
> For now when using the tfidf implementation of mllib you have no other 
> possibility to map your data back onto i.e. labels or ids than use a hackish 
> way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just 
> vectors and apply IDFModel 3. zip with original RDD 4. transform label and 
> new vector to LabeledPoint{quote}
> Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression]
> I think as in production alot of users want to map their data back to some 
> identifier, it would be a good imporvement to allow using a single vector on 
> IDFModel.transform()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4582) Add getVectors to Word2VecModel

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4582:
-
Assignee: Tobias Kässmann

> Add getVectors to Word2VecModel
> ---
>
> Key: SPARK-4582
> URL: https://issues.apache.org/jira/browse/SPARK-4582
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Tobias Kässmann
>Priority: Minor
> Fix For: 1.2.0
>
>
> Add getVectors to Word2VecModel for further processing. PR for branch-1.2:
> https://github.com/apache/spark/pull/3309



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4582) Add getVectors to Word2VecModel

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4582.
--
Resolution: Fixed

Issue resolved by pull request 3437
[https://github.com/apache/spark/pull/3437]

> Add getVectors to Word2VecModel
> ---
>
> Key: SPARK-4582
> URL: https://issues.apache.org/jira/browse/SPARK-4582
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
> Fix For: 1.2.0
>
>
> Add getVectors to Word2VecModel for further processing. PR for branch-1.2:
> https://github.com/apache/spark/pull/3309



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224011#comment-14224011
 ] 

Apache Spark commented on SPARK-3575:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3441

> Hive Schema is ignored when using convertMetastoreParquet
> -
>
> Key: SPARK-3575
> URL: https://issues.apache.org/jira/browse/SPARK-3575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Critical
>
> This can cause problems when for example one of the columns is defined as 
> TINYINT.  A class cast exception will be thrown since the parquet table scan 
> produces INTs while the rest of the execution is expecting bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1476) 2GB limit in spark for blocks

2014-11-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1476:
---
Target Version/s:   (was: 1.2.0)

> 2GB limit in spark for blocks
> -
>
> Key: SPARK-1476
> URL: https://issues.apache.org/jira/browse/SPARK-1476
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
> Environment: all
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Critical
> Attachments: 2g_fix_proposal.pdf
>
>
> The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
> the size of the block to 2GB.
> This has implication not just for managed blocks in use, but also for shuffle 
> blocks (memory mapped blocks are limited to 2gig, even though the api allows 
> for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
> This is a severe limitation for use of spark when used on non trivial 
> datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4525) MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers

2014-11-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4525:
---
Fix Version/s: 1.2.0

> MesosSchedulerBackend.resourceOffers cannot decline unused offers from 
> acceptedOffers
> -
>
> Key: SPARK-4525
> URL: https://issues.apache.org/jira/browse/SPARK-4525
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Blocker
> Fix For: 1.2.0
>
>
> After resourceOffers function is refactored - SPARK-2269 -, that function 
> doesn't decline unused offers from accepted offers. That's because when 
> driver.launchTasks is called, if that's tasks is empty, driver.launchTask 
> calls the declineOffer(offer.id). 
> {quote}
> Invoking this function with an empty collection of tasks declines offers in 
> their entirety (see SchedulerDriver.declineOffer(OfferID, Filters)).
> - 
> http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html#launchTasks(OfferID,%20java.util.Collection,%20Filters)
> {quote}
> In branch-1.1, resourcesOffers calls a launchTask function for all offered 
> offers, so driver declines unused resources, however, in current master, at 
> first offers are divided accepted and declined offers by their resources, and 
> delinedOffers are declined explicitly, and offers with task from 
> acceptedOffers are launched by driver.launchTasks, but, offers without from 
> acceptedOfers are not launched with empty task or declined explicitly. Thus, 
> mesos master judges thats offers used by TaskScheduler and there are no 
> resources remaing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour

2014-11-24 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223992#comment-14223992
 ] 

Cheng Lian commented on SPARK-4395:
---

[~davies] Sure, I'll take a look.

> Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
> --
>
> Key: SPARK-4395
> URL: https://issues.apache.org/jira/browse/SPARK-4395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: version 1.2.0-SNAPSHOT
>Reporter: Sameer Farooqui
>
> When I run this command it hangs for one to many hours and then finally 
> returns with successful results:
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> Note, the lab environment below is still active, so let me know if you'd like 
> to just access it directly.
> +++ My Environment +++
> - 1-node cluster in Amazon
> - RedHat 6.5 64-bit
> - java version "1.7.0_67"
> - SBT version: sbt-0.13.5
> - Scala version: scala-2.11.2
> Ran: 
> sudo yum -y update
> git clone https://github.com/apache/spark
> sudo sbt assembly
> +++ Data file used +++
> http://blueplastic.com/databricks/movielens/ratings.dat
> {code}
> >>> import re
> >>> import string
> >>> from pyspark.sql import SQLContext, Row
> >>> sqlContext = SQLContext(sc)
> >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
> >>>
> >>> def parse_ratings_line(line):
> ... match = re.search(RATINGS_PATTERN, line)
> ... if match is None:
> ... # Optionally, you can change this to just ignore if each line of 
> data is not critical.
> ... raise Error("Invalid logline: %s" % logline)
> ... return Row(
> ... UserID= int(match.group(1)),
> ... MovieID   = int(match.group(2)),
> ... Rating= int(match.group(3)),
> ... Timestamp = int(match.group(4)))
> ...
> >>> ratings_base_RDD = 
> >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat")
> ...# Call the parse_apace_log_line function on each line.
> ....map(parse_ratings_line)
> ...# Caches the objects in memory since they will be queried 
> multiple times.
> ....cache())
> >>> ratings_base_RDD.count()
> 1000209
> >>> ratings_base_RDD.first()
> Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
> >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
> >>> schemaRatings.registerTempTable("RatingsTable")
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> {code}
> (Now the Python shell hangs...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4525) MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers

2014-11-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4525.

Resolution: Fixed

> MesosSchedulerBackend.resourceOffers cannot decline unused offers from 
> acceptedOffers
> -
>
> Key: SPARK-4525
> URL: https://issues.apache.org/jira/browse/SPARK-4525
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Blocker
>
> After resourceOffers function is refactored - SPARK-2269 -, that function 
> doesn't decline unused offers from accepted offers. That's because when 
> driver.launchTasks is called, if that's tasks is empty, driver.launchTask 
> calls the declineOffer(offer.id). 
> {quote}
> Invoking this function with an empty collection of tasks declines offers in 
> their entirety (see SchedulerDriver.declineOffer(OfferID, Filters)).
> - 
> http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html#launchTasks(OfferID,%20java.util.Collection,%20Filters)
> {quote}
> In branch-1.1, resourcesOffers calls a launchTask function for all offered 
> offers, so driver declines unused resources, however, in current master, at 
> first offers are divided accepted and declined offers by their resources, and 
> delinedOffers are declined explicitly, and offers with task from 
> acceptedOffers are launched by driver.launchTasks, but, offers without from 
> acceptedOfers are not launched with empty task or declined explicitly. Thus, 
> mesos master judges thats offers used by TaskScheduler and there are no 
> resources remaing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4258) NPE with new Parquet Filters

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223981#comment-14223981
 ] 

Apache Spark commented on SPARK-4258:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3440

> NPE with new Parquet Filters
> 
>
> Key: SPARK-4258
> URL: https://issues.apache.org/jira/browse/SPARK-4258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Critical
>
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): 
> java.lang.NullPointerException: 
> parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
> parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
> parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
> parquet.filter2.predicate.Operators$Or.accept(Operators.java:302)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
> parquet.filter2.predicate.Operators$And.accept(Operators.java:290)
> 
> parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
> parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
> parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
> 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
> 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
> 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> {code}
> This occurs when reading parquet data encoded with the older version of the 
> library for TPC-DS query 34.  Will work on coming up with a smaller 
> reproduction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4539) History Server counts "incomplete" applications against the "retainedApplications" total, fails to show eligible "completed" applications

2014-11-24 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223979#comment-14223979
 ] 

Masayoshi TSUZUKI commented on SPARK-4539:
--

I assume you mean the parameter "spark.history.retainedApplications".
It is not the value of limit the number of apps listed on the HistoryServer UI, 
but the number of caches of the application detail info which is shown when we 
click the listed "App ID" link.
Even when spark.history.retainedApplications is set as 2, we can see more than 
10 apps listed.

By the way, I think your operation doesn't work properly.
As you know, just copying some existing application directory doesn't work 
because they both have same application id in EVENT_LOG_1 so it is needed to be 
modified.
If there are 2 app directories which have the same application id, 
HistoryServer skips listing.
And HistoryServer read only the directory whose modification time is later than 
the log directory was loaded last time.
So please try
  * update the modification time of the directory after you modified 
EVENT_LOG_1.
  * make sure you don't see the browser cache.

It works for me.
And of course, restarting HistoryServer is also a good idea to get all apps 
listed.


> History Server counts "incomplete" applications against the 
> "retainedApplications" total, fails to show eligible "completed" applications
> -
>
> Key: SPARK-4539
> URL: https://issues.apache.org/jira/browse/SPARK-4539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>
> I have observed the history server to return 0 or 1 applications from a 
> directory that contains many complete and incomplete applications (the latter 
> being application directories that are missing the {{APPLICATION_COMPLETE}} 
> file).
> Without having dug too much, my theory is that HistoryServer is seeing the 
> "incomplete" directories and counting them against the 
> {{retainedApplications}} maximum but not displaying them.
> One supporting anecdote for this is that I loaded HS against a directory that 
> had one complete application and nothing else, and HS worked as expected (I 
> saw the one application in the web UI).
> I then copied ~100 other application directories in, the majority of which 
> were "incomplete" (in particular, most of the ones that had the earliest 
> timestamps), and still only saw the one original completed application via 
> the web UI.
> Finally, I restarted the same server with the {{retainedApplications}} set to 
> 1000 (instead of 50; the directory a this point had ~10 completed 
> applications and 90 incomplete ones), and saw all/exactly the completed 
> applications, leading me to believe that they were being "boxed out" of the 
> maximum-50-retained-applications iteration of the history server.
> Silently failing on "incomplete" directories while still docking the count, 
> if that is indeed what is happening, is a pretty confusing failure mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4583) GradientBoostedTrees error logging should use loss being minimized

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223950#comment-14223950
 ] 

Apache Spark commented on SPARK-4583:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/3439

> GradientBoostedTrees error logging should use loss being minimized
> --
>
> Key: SPARK-4583
> URL: https://issues.apache.org/jira/browse/SPARK-4583
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
> * the gradient (and therefore loss) does not match that used by Friedman 
> (1999)
> * the error computation uses 0/1 accuracy, not log loss



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-24 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223922#comment-14223922
 ] 

Pedro Rodriguez edited comment on SPARK-1405 at 11/25/14 2:18 AM:
--

Finished an initial implementation of an LDA data generator. I have done some 
initial testing and it seems reasonable, but just initial testing at the 
moment. Will be looking at metrics other than "it looks good" to make sure that 
the data being generated is correct.

Implementation: 
https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala


was (Author: pedrorodriguez):
Finished an initial implementation of an LDA data generator. I have done some 
initial testing and it seems reasonable, but just initial testing at the 
moment. Will be looking at metrics other than "it looks good" to make sure that 
the data being generated looks reasonable.

Implementation: 
https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4266) Avoid expensive JavaScript for StagePages with huge numbers of tasks

2014-11-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4266.

   Resolution: Fixed
Fix Version/s: 1.2.0

[~kayousterhout] I'm resolving this because I saw you merged it.

> Avoid expensive JavaScript for StagePages with huge numbers of tasks
> 
>
> Key: SPARK-4266
> URL: https://issues.apache.org/jira/browse/SPARK-4266
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Blocker
> Fix For: 1.2.0
>
>
> Some of the new javascript added to handle hiding metrics significantly slows 
> the page load for stages with a lot of tasks (e.g., for a job with 10K tasks, 
> it took over a minute for the page to finish loading in Chrome on my laptop). 
>  There are at least two issues here:
> (1) The new table striping java script is much slower than the old CSS.  The 
> fancier javascript is only needed for the stage summary table, so we should 
> change the task table back to using CSS so that it doesn't slow the page load 
> for jobs with lots of tasks.
> (2) The javascript associated with hiding metrics is expensive when jobs have 
> lots of tasks, I think because the jQuery selectors have to traverse a much 
> larger DOM.   The ID selectors are much more efficient, so we should consider 
> switching to these, and/or avoiding this code in additional-metrics.js:
> $("input:checkbox:not(:checked)").each(function() {
> var column = "table ." + $(this).attr("name");
> $(column).hide();
> });
> by initially hiding the data when we generate the page in the render function 
> instead, which should be easy to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4585) Spark dynamic scaling executors use upper limit value as default.

2014-11-24 Thread Chengxiang Li (JIRA)
Chengxiang Li created SPARK-4585:


 Summary: Spark dynamic scaling executors use upper limit value as 
default.
 Key: SPARK-4585
 URL: https://issues.apache.org/jira/browse/SPARK-4585
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Chengxiang Li


With SPARK-3174, one can configure a minimum and maximum number of executors 
for a Spark application on Yarn. However, the application always starts with 
the maximum. It seems more reasonable, at least for Hive on Spark, to start 
from the minimum and scale up as needed up to the maximum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-24 Thread Nishkam Ravi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi updated SPARK-4584:

Comment: was deleted

(was: In YARN cluster mode.)

> 2x Performance regression for Spark-on-YARN
> ---
>
> Key: SPARK-4584
> URL: https://issues.apache.org/jira/browse/SPARK-4584
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Nishkam Ravi
>
> Significant performance regression observed for Spark-on-YARN (upto 2x) after 
> 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
>  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
> enough input dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-24 Thread Nishkam Ravi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi updated SPARK-4584:

Description: Significant performance regression observed for Spark-on-YARN 
(upto 2x) after 1.2 rebase. The offending commit is: 
70e824f750aa8ed446eec104ba158b0503ba58a9  from Oct 7th. Problem can be 
reproduced with JavaWordCount against a large enough input dataset in YARN 
cluster mode.  (was: Significant performance regression observed for 
Spark-on-YARN (upto 2x) after 1.2 rebase. The offending commit is: 
70e824f750aa8ed446eec104ba158b0503ba58a9  from Oct 7th. Problem can be 
reproduced with JavaWordCount against a large enough input dataset.)

> 2x Performance regression for Spark-on-YARN
> ---
>
> Key: SPARK-4584
> URL: https://issues.apache.org/jira/browse/SPARK-4584
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Nishkam Ravi
>
> Significant performance regression observed for Spark-on-YARN (upto 2x) after 
> 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
>  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
> enough input dataset in YARN cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-24 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223930#comment-14223930
 ] 

Nishkam Ravi commented on SPARK-4584:
-

In YARN cluster mode.

> 2x Performance regression for Spark-on-YARN
> ---
>
> Key: SPARK-4584
> URL: https://issues.apache.org/jira/browse/SPARK-4584
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Nishkam Ravi
>
> Significant performance regression observed for Spark-on-YARN (upto 2x) after 
> 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
>  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
> enough input dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-24 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-4584:
---

 Summary: 2x Performance regression for Spark-on-YARN
 Key: SPARK-4584
 URL: https://issues.apache.org/jira/browse/SPARK-4584
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Nishkam Ravi


Significant performance regression observed for Spark-on-YARN (upto 2x) after 
1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9  
from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
enough input dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223925#comment-14223925
 ] 

Apache Spark commented on SPARK-2926:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/3438

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4583) GradientBoostedTrees error logging should use loss being minimized

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4583:
-
Assignee: Joseph K. Bradley

> GradientBoostedTrees error logging should use loss being minimized
> --
>
> Key: SPARK-4583
> URL: https://issues.apache.org/jira/browse/SPARK-4583
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
> * the gradient (and therefore loss) does not match that used by Friedman 
> (1999)
> * the error computation uses 0/1 accuracy, not log loss



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4525) MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers

2014-11-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4525:
---
Target Version/s: 1.2.0  (was: 1.2.0, 1.3.0)

> MesosSchedulerBackend.resourceOffers cannot decline unused offers from 
> acceptedOffers
> -
>
> Key: SPARK-4525
> URL: https://issues.apache.org/jira/browse/SPARK-4525
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Blocker
>
> After resourceOffers function is refactored - SPARK-2269 -, that function 
> doesn't decline unused offers from accepted offers. That's because when 
> driver.launchTasks is called, if that's tasks is empty, driver.launchTask 
> calls the declineOffer(offer.id). 
> {quote}
> Invoking this function with an empty collection of tasks declines offers in 
> their entirety (see SchedulerDriver.declineOffer(OfferID, Filters)).
> - 
> http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html#launchTasks(OfferID,%20java.util.Collection,%20Filters)
> {quote}
> In branch-1.1, resourcesOffers calls a launchTask function for all offered 
> offers, so driver declines unused resources, however, in current master, at 
> first offers are divided accepted and declined offers by their resources, and 
> delinedOffers are declined explicitly, and offers with task from 
> acceptedOffers are launched by driver.launchTasks, but, offers without from 
> acceptedOfers are not launched with empty task or declined explicitly. Thus, 
> mesos master judges thats offers used by TaskScheduler and there are no 
> resources remaing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-24 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223922#comment-14223922
 ] 

Pedro Rodriguez commented on SPARK-1405:


Finished an initial implementation of an LDA data generator. I have done some 
initial testing and it seems reasonable, but just initial testing at the 
moment. Will be looking at metrics other than "it looks good" to make sure that 
the data being generated looks reasonable.

Implementation: 
https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4583) GradientBoostedTrees error logging should use loss being minimized

2014-11-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-4583:


 Summary: GradientBoostedTrees error logging should use loss being 
minimized
 Key: SPARK-4583
 URL: https://issues.apache.org/jira/browse/SPARK-4583
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor


Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
* the gradient (and therefore loss) does not match that used by Friedman (1999)
* the error computation uses 0/1 accuracy, not log loss



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4577) Python example of LBFGS for MLlib guide

2014-11-24 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223909#comment-14223909
 ] 

Davies Liu commented on SPARK-4577:
---

The Scala example about L-BFGS is actually about optimization, but we did not 
have python api for them, so I would like to close it.

> Python example of LBFGS for MLlib guide
> ---
>
> Key: SPARK-4577
> URL: https://issues.apache.org/jira/browse/SPARK-4577
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Reporter: Davies Liu
>Priority: Minor
>
> It should have a Python example of L-BFGS in MLlib guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4577) Python example of LBFGS for MLlib guide

2014-11-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-4577.
-
  Resolution: Won't Fix
Target Version/s:   (was: 1.2.0)

> Python example of LBFGS for MLlib guide
> ---
>
> Key: SPARK-4577
> URL: https://issues.apache.org/jira/browse/SPARK-4577
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Reporter: Davies Liu
>Priority: Minor
>
> It should have a Python example of L-BFGS in MLlib guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4409) Additional (but limited) Linear Algebra Utils

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4409:
-
Assignee: Burak Yavuz

> Additional (but limited) Linear Algebra Utils
> -
>
> Key: SPARK-4409
> URL: https://issues.apache.org/jira/browse/SPARK-4409
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Minor
>
> This ticket is to discuss the addition of a very limited number of local 
> matrix manipulation and generation methods that would be helpful in the 
> further development for algorithms on top of BlockMatrix (SPARK-3974), such 
> as Randomized SVD, and Multi Model Training (SPARK-1486).
> The proposed methods for addition are:
> For `Matrix`
>  -  map: maps the values in the matrix with a given function. Produces a new 
> matrix.
>  -  update: the values in the matrix are updated with a given function. 
> Occurs in place.
> Factory methods for `DenseMatrix`:
>  -  *zeros: Generate a matrix consisting of zeros
>  -  *ones: Generate a matrix consisting of ones
>  -  *eye: Generate an identity matrix
>  -  *rand: Generate a matrix consisting of i.i.d. uniform random numbers
>  -  *randn: Generate a matrix consisting of i.i.d. gaussian random numbers
>  -  *diag: Generate a diagonal matrix from a supplied vector
> *These methods already exist in the factory methods for `Matrices`, however 
> for cases where we require a `DenseMatrix`, you constantly have to add 
> `.asInstanceOf[DenseMatrix]` everywhere, which makes the code "dirtier". I 
> propose moving these functions to factory methods for `DenseMatrix` where the 
> putput will be a `DenseMatrix` and the factory methods for `Matrices` will 
> call these functions directly and output a generic `Matrix`.
> Factory methods for `SparseMatrix`:
>  -  speye: Identity matrix in sparse format. Saves a ton of memory when 
> dimensions are large, especially in Multi Model Training, where each row 
> requires being multiplied by a scalar.
>  -  sprand: Generate a sparse matrix with a given density consisting of 
> i.i.d. uniform random numbers.
>  -  sprandn: Generate a sparse matrix with a given density consisting of 
> i.i.d. gaussian random numbers.
>  -  diag: Generate a diagonal matrix from a supplied vector, but is memory 
> efficient, because it just stores the diagonal. Again, very helpful in Multi 
> Model Training.
> Factory methods for `Matrices`:
>  -  Include all the factory methods given above, but return a generic 
> `Matrix` rather than `SparseMatrix` or `DenseMatrix`.
>  -  horzCat: Horizontally concatenate matrices to form one larger matrix. 
> Very useful in both Multi Model Training, and for the repartitioning of 
> BlockMatrix.
>  -  vertCat: Vertically concatenate matrices to form one larger matrix. Very 
> useful for the repartitioning of BlockMatrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4409) Additional (but limited) Linear Algebra Utils

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4409:
-
Priority: Major  (was: Minor)

> Additional (but limited) Linear Algebra Utils
> -
>
> Key: SPARK-4409
> URL: https://issues.apache.org/jira/browse/SPARK-4409
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> This ticket is to discuss the addition of a very limited number of local 
> matrix manipulation and generation methods that would be helpful in the 
> further development for algorithms on top of BlockMatrix (SPARK-3974), such 
> as Randomized SVD, and Multi Model Training (SPARK-1486).
> The proposed methods for addition are:
> For `Matrix`
>  -  map: maps the values in the matrix with a given function. Produces a new 
> matrix.
>  -  update: the values in the matrix are updated with a given function. 
> Occurs in place.
> Factory methods for `DenseMatrix`:
>  -  *zeros: Generate a matrix consisting of zeros
>  -  *ones: Generate a matrix consisting of ones
>  -  *eye: Generate an identity matrix
>  -  *rand: Generate a matrix consisting of i.i.d. uniform random numbers
>  -  *randn: Generate a matrix consisting of i.i.d. gaussian random numbers
>  -  *diag: Generate a diagonal matrix from a supplied vector
> *These methods already exist in the factory methods for `Matrices`, however 
> for cases where we require a `DenseMatrix`, you constantly have to add 
> `.asInstanceOf[DenseMatrix]` everywhere, which makes the code "dirtier". I 
> propose moving these functions to factory methods for `DenseMatrix` where the 
> putput will be a `DenseMatrix` and the factory methods for `Matrices` will 
> call these functions directly and output a generic `Matrix`.
> Factory methods for `SparseMatrix`:
>  -  speye: Identity matrix in sparse format. Saves a ton of memory when 
> dimensions are large, especially in Multi Model Training, where each row 
> requires being multiplied by a scalar.
>  -  sprand: Generate a sparse matrix with a given density consisting of 
> i.i.d. uniform random numbers.
>  -  sprandn: Generate a sparse matrix with a given density consisting of 
> i.i.d. gaussian random numbers.
>  -  diag: Generate a diagonal matrix from a supplied vector, but is memory 
> efficient, because it just stores the diagonal. Again, very helpful in Multi 
> Model Training.
> Factory methods for `Matrices`:
>  -  Include all the factory methods given above, but return a generic 
> `Matrix` rather than `SparseMatrix` or `DenseMatrix`.
>  -  horzCat: Horizontally concatenate matrices to form one larger matrix. 
> Very useful in both Multi Model Training, and for the repartitioning of 
> BlockMatrix.
>  -  vertCat: Vertically concatenate matrices to form one larger matrix. Very 
> useful for the repartitioning of BlockMatrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4494:
-
Target Version/s: 1.3.0  (was: 1.1.1)

> IDFModel.transform() add support for single vector
> --
>
> Key: SPARK-4494
> URL: https://issues.apache.org/jira/browse/SPARK-4494
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Jean-Philippe Quemener
>
> For now when using the tfidf implementation of mllib you have no other 
> possibility to map your data back onto i.e. labels or ids than use a hackish 
> way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just 
> vectors and apply IDFModel 3. zip with original RDD 4. transform label and 
> new vector to LabeledPoint{quote}
> Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression]
> I think as in production alot of users want to map their data back to some 
> identifier, it would be a good imporvement to allow using a single vector on 
> IDFModel.transform()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4494:
-
Affects Version/s: (was: 1.1.0)
   1.1.1

> IDFModel.transform() add support for single vector
> --
>
> Key: SPARK-4494
> URL: https://issues.apache.org/jira/browse/SPARK-4494
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Jean-Philippe Quemener
>
> For now when using the tfidf implementation of mllib you have no other 
> possibility to map your data back onto i.e. labels or ids than use a hackish 
> way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just 
> vectors and apply IDFModel 3. zip with original RDD 4. transform label and 
> new vector to LabeledPoint{quote}
> Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression]
> I think as in production alot of users want to map their data back to some 
> identifier, it would be a good imporvement to allow using a single vector on 
> IDFModel.transform()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4494) IDFModel.transform() add support for single vector

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4494:
-
Affects Version/s: 1.2.0

> IDFModel.transform() add support for single vector
> --
>
> Key: SPARK-4494
> URL: https://issues.apache.org/jira/browse/SPARK-4494
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.1, 1.2.0
>Reporter: Jean-Philippe Quemener
>
> For now when using the tfidf implementation of mllib you have no other 
> possibility to map your data back onto i.e. labels or ids than use a hackish 
> way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just 
> vectors and apply IDFModel 3. zip with original RDD 4. transform label and 
> new vector to LabeledPoint{quote}
> Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression]
> I think as in production alot of users want to map their data back to some 
> identifier, it would be a good imporvement to allow using a single vector on 
> IDFModel.transform()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4510:
-
Assignee: Fan Jiang

> Add k-medoids Partitioning Around Medoids (PAM) algorithm
> -
>
> Key: SPARK-4510
> URL: https://issues.apache.org/jira/browse/SPARK-4510
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
> because it minimizes a sum of pairwise dissimilarities instead of a sum of 
> squared Euclidean distances. A medoid can be defined as the object of a 
> cluster, whose average dissimilarity to all the objects in the cluster is 
> minimal i.e. it is a most centrally located point in the cluster.
> The most common realisation of k-medoid clustering is the Partitioning Around 
> Medoids (PAM) algorithm and is as follows:
> Initialize: randomly select (without replacement) k of the n data points as 
> the medoids
> Associate each data point to the closest medoid. ("closest" here is defined 
> using any valid distance metric, most commonly Euclidean distance, Manhattan 
> distance or Minkowski distance)
> For each medoid m
> For each non-medoid data point o
> Swap m and o and compute the total cost of the configuration
> Select the configuration with the lowest cost.
> Repeat steps 2 to 4 until there is no change in the medoid.
> The new feature for MLlib will contain 5 new files
> /main/scala/org/apache/spark/mllib/clustering/PAM.scala
> /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
> /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
> /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
> /main/scala/org/apache/spark/examples/mllib/KMedoids.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4530) GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4530:
-
Target Version/s: 1.0.2, 1.2.0, 1.1.2

> GradientDescent get a wrong gradient value according to the gradient formula, 
> which is caused by the miniBatchSize parameter.
> -
>
> Key: SPARK-4530
> URL: https://issues.apache.org/jira/browse/SPARK-4530
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
>
> This bug is caused by {{RDD.sample}}
> The number of  {{RDD.sample}}  returns is not fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4530) GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4530:
-
Priority: Minor  (was: Major)

> GradientDescent get a wrong gradient value according to the gradient formula, 
> which is caused by the miniBatchSize parameter.
> -
>
> Key: SPARK-4530
> URL: https://issues.apache.org/jira/browse/SPARK-4530
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Minor
>
> This bug is caused by {{RDD.sample}}
> The number of  {{RDD.sample}}  returns is not fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4530) GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4530:
-
Assignee: Guoqiang Li

> GradientDescent get a wrong gradient value according to the gradient formula, 
> which is caused by the miniBatchSize parameter.
> -
>
> Key: SPARK-4530
> URL: https://issues.apache.org/jira/browse/SPARK-4530
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> This bug is caused by {{RDD.sample}}
> The number of  {{RDD.sample}}  returns is not fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4581) Refactorize StandardScaler to improve the transformation performance

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4581:
-
Target Version/s: 1.2.0
Assignee: DB Tsai

> Refactorize StandardScaler to improve the transformation performance
> 
>
> Key: SPARK-4581
> URL: https://issues.apache.org/jira/browse/SPARK-4581
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> The following optimizations are done to improve the StandardScaler model 
> transformation performance.
> 1) Covert Breeze dense vector to primitive vector to reduce the overhead.
> 2) Since mean can be potentially a sparse vector, we explicitly convert it to 
> dense primitive vector.
> 3) Have a local reference to `shift` and `factor` array so JVM can locate the 
> value with one operation call.
> 4) In pattern matching part, we use the mllib SparseVector/DenseVector 
> instead of breeze's vector to make the codebase cleaner. 
> Benchmark with mnist8m dataset:
> Before,
> DenseVector withMean and withStd: 50.97secs
> DenseVector withMean and withoutStd: 42.11secs
> DenseVector withoutMean and withStd: 8.75secs
> SparseVector withoutMean and withStd: 5.437
> With this PR,
> DenseVector withMean and withStd: 5.76secs
> DenseVector withMean and withoutStd: 5.28secs
> DenseVector withoutMean and withStd: 5.30secs
> SparseVector withoutMean and withStd: 1.27



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4547) OOM when making bins in BinaryClassificationMetrics

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4547:
-
Target Version/s: 1.3.0

> OOM when making bins in BinaryClassificationMetrics
> ---
>
> Key: SPARK-4547
> URL: https://issues.apache.org/jira/browse/SPARK-4547
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Sean Owen
>Priority: Minor
>
> Also following up on 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdK4s4TNkf3_ecLC6yD-pLpys_PpT3WB7Tp6=yoxuxf...@mail.gmail.com%3E
>  -- this one I intend to make a PR for a bit later. The conversation was 
> basically:
> {quote}
> Recently I was using BinaryClassificationMetrics to build an AUC curve for a 
> classifier over a reasonably large number of points (~12M). The scores were 
> all probabilities, so tended to be almost entirely unique.
> The computation does some operations by key, and this ran out of memory. It's 
> something you can solve with more than the default amount of memory, but in 
> this case, it seemed unuseful to create an AUC curve with such fine-grained 
> resolution.
> I ended up just binning the scores so there were ~1000 unique values
> and then it was fine.
> {quote}
> and:
> {quote}
> Yes, if there are many distinct values, we need binning to compute the AUC 
> curve. Usually, the scores are not evenly distribution, we cannot simply 
> truncate the digits. Estimating the quantiles for binning is necessary, 
> similar to RangePartitioner:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104
> Limiting the number of bins is definitely useful.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4547) OOM when making bins in BinaryClassificationMetrics

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4547:
-
Assignee: Sean Owen

> OOM when making bins in BinaryClassificationMetrics
> ---
>
> Key: SPARK-4547
> URL: https://issues.apache.org/jira/browse/SPARK-4547
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Also following up on 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdK4s4TNkf3_ecLC6yD-pLpys_PpT3WB7Tp6=yoxuxf...@mail.gmail.com%3E
>  -- this one I intend to make a PR for a bit later. The conversation was 
> basically:
> {quote}
> Recently I was using BinaryClassificationMetrics to build an AUC curve for a 
> classifier over a reasonably large number of points (~12M). The scores were 
> all probabilities, so tended to be almost entirely unique.
> The computation does some operations by key, and this ran out of memory. It's 
> something you can solve with more than the default amount of memory, but in 
> this case, it seemed unuseful to create an AUC curve with such fine-grained 
> resolution.
> I ended up just binning the scores so there were ~1000 unique values
> and then it was fine.
> {quote}
> and:
> {quote}
> Yes, if there are many distinct values, we need binning to compute the AUC 
> curve. Usually, the scores are not evenly distribution, we cannot simply 
> truncate the digits. Estimating the quantiles for binning is necessary, 
> similar to RangePartitioner:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104
> Limiting the number of bins is definitely useful.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4577) Python example of LBFGS for MLlib guide

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4577:
-
Issue Type: Improvement  (was: Bug)

> Python example of LBFGS for MLlib guide
> ---
>
> Key: SPARK-4577
> URL: https://issues.apache.org/jira/browse/SPARK-4577
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Reporter: Davies Liu
>Priority: Minor
>
> It should have a Python example of L-BFGS in MLlib guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4577) Python example of LBFGS for MLlib guide

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4577:
-
Target Version/s: 1.2.0

> Python example of LBFGS for MLlib guide
> ---
>
> Key: SPARK-4577
> URL: https://issues.apache.org/jira/browse/SPARK-4577
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, MLlib
>Reporter: Davies Liu
>Priority: Minor
>
> It should have a Python example of L-BFGS in MLlib guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3080:
-
Affects Version/s: 1.2.0

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Burak Yavuz
>Assignee: Xiangrui Meng
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4517) Improve memory efficiency for python broadcast

2014-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4517:
--
Component/s: PySpark

> Improve memory efficiency for python broadcast
> --
>
> Key: SPARK-4517
> URL: https://issues.apache.org/jira/browse/SPARK-4517
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> Currently, the Python broadcast (TorrentBroadcast) will have multiple copies 
> in :
> 1) 1 copy in python driver
> 2) 1 copy in disks of driver (serialized and compressed)
> 3) 2 copies in JVM driver (one is unserialized, one is serialized and 
> compressed)
> 4) 2 copies in executor (one is unserialized, one is serialized and 
> compressed)
> 5) one copy in each python worker.
> Some of them are different in HTTPBroadcast:
> 3)  one copy in memory of driver, one copy in disk (serialized and compressed)
> 4) one copy in memory of executor
> If the python broadcast is 4G, then it need 12G in driver, and 8+4x G in 
> executor (x is the number of python worker, it's the number of CPUs usually).
> The Python broadcast is already serialized and compressed in Python, it 
> should not be serialized and compressed again in JVM. Also, JVM does not need 
> to know the content of it, so it could be out of JVM.
> So, we should have specified broadcast implementation for Python, it stores 
> the serialized and compressed data in disks, transferred to executors in p2p 
> way (similar to TorrentBroadcast), sent to python workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4517) Improve memory efficiency for python broadcast

2014-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4517.
---
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Davies Liu

This was fixed in https://github.com/apache/spark/pull/3417

> Improve memory efficiency for python broadcast
> --
>
> Key: SPARK-4517
> URL: https://issues.apache.org/jira/browse/SPARK-4517
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> Currently, the Python broadcast (TorrentBroadcast) will have multiple copies 
> in :
> 1) 1 copy in python driver
> 2) 1 copy in disks of driver (serialized and compressed)
> 3) 2 copies in JVM driver (one is unserialized, one is serialized and 
> compressed)
> 4) 2 copies in executor (one is unserialized, one is serialized and 
> compressed)
> 5) one copy in each python worker.
> Some of them are different in HTTPBroadcast:
> 3)  one copy in memory of driver, one copy in disk (serialized and compressed)
> 4) one copy in memory of executor
> If the python broadcast is 4G, then it need 12G in driver, and 8+4x G in 
> executor (x is the number of python worker, it's the number of CPUs usually).
> The Python broadcast is already serialized and compressed in Python, it 
> should not be serialized and compressed again in JVM. Also, JVM does not need 
> to know the content of it, so it could be out of JVM.
> So, we should have specified broadcast implementation for Python, it stores 
> the serialized and compressed data in disks, transferred to executors in p2p 
> way (similar to TorrentBroadcast), sent to python workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2206:
-
Target Version/s: 1.3.0  (was: 1.2.0)

> Automatically infer the number of classification classes in multiclass 
> classification
> -
>
> Key: SPARK-2206
> URL: https://issues.apache.org/jira/browse/SPARK-2206
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> Currently, the user needs to specify the numClassesForClassification 
> parameter explicitly during multiclass classification for decision trees. 
> This feature will automatically infer this information (and possibly class 
> histograms) from the training data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3080:
-
Target Version/s: 1.3.0  (was: 1.2.0)

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Burak Yavuz
>Assignee: Xiangrui Meng
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4548) Python broadcast perf regression from Spark 1.1

2014-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4548.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3417
[https://github.com/apache/spark/pull/3417]

> Python broadcast perf regression from Spark 1.1
> ---
>
> Key: SPARK-4548
> URL: https://issues.apache.org/jira/browse/SPARK-4548
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.2.0
>
>
> Python broadcast in 1.2 is much slower than 1.1: 
> In spark-perf tests:
>   name1.1 1.2  speedup
> python-broadcast-w-set3.6316.68   -78.23%



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers

2014-11-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-927:
-
Affects Version/s: 0.9.1

> PySpark sample() doesn't work if numpy is installed on master but not on 
> workers
> 
>
> Key: SPARK-927
> URL: https://issues.apache.org/jira/browse/SPARK-927
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.8.0, 0.9.1, 1.0.2, 1.1.2
>Reporter: Josh Rosen
>Assignee: Matthew Farrellee
>Priority: Minor
>
> PySpark's sample() method crashes with ImportErrors on the workers if numpy 
> is installed on the driver machine but not on the workers.  I'm not sure 
> what's the best way to fix this.  A general mechanism for automatically 
> shipping libraries from the master to the workers would address this, but 
> that could be complicated to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers

2014-11-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-927:
-
Affects Version/s: 1.1.2
   1.0.2

> PySpark sample() doesn't work if numpy is installed on master but not on 
> workers
> 
>
> Key: SPARK-927
> URL: https://issues.apache.org/jira/browse/SPARK-927
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.8.0, 0.9.1, 1.0.2, 1.1.2
>Reporter: Josh Rosen
>Assignee: Matthew Farrellee
>Priority: Minor
>
> PySpark's sample() method crashes with ImportErrors on the workers if numpy 
> is installed on the driver machine but not on the workers.  I'm not sure 
> what's the best way to fix this.  A general mechanism for automatically 
> shipping libraries from the master to the workers would address this, but 
> that could be complicated to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4565) Add docs about advanced spark application development

2014-11-24 Thread Joseph E. Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223878#comment-14223878
 ] 

Joseph E. Gonzalez commented on SPARK-4565:
---

Hmm, I wonder if it would make more sense to have an "Advanced Programming 
Guide" since a lot of these ideas (and those we might add for GraphX) are 
around how to use the Spark APIs most efficiently rather than how to configure 
the cluster and would be a distraction in the standard programming guide.

> Add docs about advanced spark application development
> -
>
> Key: SPARK-4565
> URL: https://issues.apache.org/jira/browse/SPARK-4565
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Evan Sparks
>Priority: Minor
>
> [~shivaram], [~jegonzal] and I have been working on a brief document based on 
> our experiences writing high performance spark applications - MLlib, GraphX, 
> pipelines, ml-matrix, etc.
> It currently exists here - 
> https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing
> Would it make sense to add these tips and tricks to the Spark Wiki?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4582) Add getVectors to Word2VecModel

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4582:
-
Fix Version/s: 1.2.0

> Add getVectors to Word2VecModel
> ---
>
> Key: SPARK-4582
> URL: https://issues.apache.org/jira/browse/SPARK-4582
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
> Fix For: 1.2.0
>
>
> Add getVectors to Word2VecModel for further processing. PR for branch-1.2:
> https://github.com/apache/spark/pull/3309



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4565) Add docs about advanced spark application development

2014-11-24 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223864#comment-14223864
 ] 

Evan Sparks commented on SPARK-4565:


[~pwendell] suggested that we add this to the spark tuning guide and 
programming guide. What do you think?

> Add docs about advanced spark application development
> -
>
> Key: SPARK-4565
> URL: https://issues.apache.org/jira/browse/SPARK-4565
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Evan Sparks
>Priority: Minor
>
> [~shivaram], [~jegonzal] and I have been working on a brief document based on 
> our experiences writing high performance spark applications - MLlib, GraphX, 
> pipelines, ml-matrix, etc.
> It currently exists here - 
> https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing
> Would it make sense to add these tips and tricks to the Spark Wiki?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4582) Add getVectors to Word2VecModel

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223861#comment-14223861
 ] 

Apache Spark commented on SPARK-4582:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/3437

> Add getVectors to Word2VecModel
> ---
>
> Key: SPARK-4582
> URL: https://issues.apache.org/jira/browse/SPARK-4582
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
>
> Add getVectors to Word2VecModel for further processing. PR for branch-1.2:
> https://github.com/apache/spark/pull/3309



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4565) Add docs about advanced spark application development

2014-11-24 Thread Joseph E. Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223856#comment-14223856
 ] 

Joseph E. Gonzalez commented on SPARK-4565:
---

Yes!  However, we might want to organize it a bit more?  

> Add docs about advanced spark application development
> -
>
> Key: SPARK-4565
> URL: https://issues.apache.org/jira/browse/SPARK-4565
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Evan Sparks
>Priority: Minor
>
> [~shivaram], [~jegonzal] and I have been working on a brief document based on 
> our experiences writing high performance spark applications - MLlib, GraphX, 
> pipelines, ml-matrix, etc.
> It currently exists here - 
> https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing
> Would it make sense to add these tips and tricks to the Spark Wiki?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour

2014-11-24 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223853#comment-14223853
 ] 

Davies Liu commented on SPARK-4395:
---

[~lian cheng] Could you help to investigate the cache issue here?

> Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
> --
>
> Key: SPARK-4395
> URL: https://issues.apache.org/jira/browse/SPARK-4395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: version 1.2.0-SNAPSHOT
>Reporter: Sameer Farooqui
>
> When I run this command it hangs for one to many hours and then finally 
> returns with successful results:
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> Note, the lab environment below is still active, so let me know if you'd like 
> to just access it directly.
> +++ My Environment +++
> - 1-node cluster in Amazon
> - RedHat 6.5 64-bit
> - java version "1.7.0_67"
> - SBT version: sbt-0.13.5
> - Scala version: scala-2.11.2
> Ran: 
> sudo yum -y update
> git clone https://github.com/apache/spark
> sudo sbt assembly
> +++ Data file used +++
> http://blueplastic.com/databricks/movielens/ratings.dat
> {code}
> >>> import re
> >>> import string
> >>> from pyspark.sql import SQLContext, Row
> >>> sqlContext = SQLContext(sc)
> >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
> >>>
> >>> def parse_ratings_line(line):
> ... match = re.search(RATINGS_PATTERN, line)
> ... if match is None:
> ... # Optionally, you can change this to just ignore if each line of 
> data is not critical.
> ... raise Error("Invalid logline: %s" % logline)
> ... return Row(
> ... UserID= int(match.group(1)),
> ... MovieID   = int(match.group(2)),
> ... Rating= int(match.group(3)),
> ... Timestamp = int(match.group(4)))
> ...
> >>> ratings_base_RDD = 
> >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat")
> ...# Call the parse_apace_log_line function on each line.
> ....map(parse_ratings_line)
> ...# Caches the objects in memory since they will be queried 
> multiple times.
> ....cache())
> >>> ratings_base_RDD.count()
> 1000209
> >>> ratings_base_RDD.first()
> Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
> >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
> >>> schemaRatings.registerTempTable("RatingsTable")
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> {code}
> (Now the Python shell hangs...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4582) Add getVectors to Word2VecModel

2014-11-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4582:


 Summary: Add getVectors to Word2VecModel
 Key: SPARK-4582
 URL: https://issues.apache.org/jira/browse/SPARK-4582
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Minor


Add getVectors to Word2VecModel for further processing. PR for branch-1.2:

https://github.com/apache/spark/pull/3309



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4395) Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour

2014-11-24 Thread Sameer Farooqui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223851#comment-14223851
 ] 

Sameer Farooqui commented on SPARK-4395:


Hi Davies and Michael,

I can confirm that this works if I move the .cache() to AFTER the inferSchema 
as Davies suggested. But if the cache is first, then the hang occurs.

Workaround is suitable by me for now, although other people could also run into 
this if they're not aware of this JIRA...

> Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
> --
>
> Key: SPARK-4395
> URL: https://issues.apache.org/jira/browse/SPARK-4395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: version 1.2.0-SNAPSHOT
>Reporter: Sameer Farooqui
>
> When I run this command it hangs for one to many hours and then finally 
> returns with successful results:
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> Note, the lab environment below is still active, so let me know if you'd like 
> to just access it directly.
> +++ My Environment +++
> - 1-node cluster in Amazon
> - RedHat 6.5 64-bit
> - java version "1.7.0_67"
> - SBT version: sbt-0.13.5
> - Scala version: scala-2.11.2
> Ran: 
> sudo yum -y update
> git clone https://github.com/apache/spark
> sudo sbt assembly
> +++ Data file used +++
> http://blueplastic.com/databricks/movielens/ratings.dat
> {code}
> >>> import re
> >>> import string
> >>> from pyspark.sql import SQLContext, Row
> >>> sqlContext = SQLContext(sc)
> >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
> >>>
> >>> def parse_ratings_line(line):
> ... match = re.search(RATINGS_PATTERN, line)
> ... if match is None:
> ... # Optionally, you can change this to just ignore if each line of 
> data is not critical.
> ... raise Error("Invalid logline: %s" % logline)
> ... return Row(
> ... UserID= int(match.group(1)),
> ... MovieID   = int(match.group(2)),
> ... Rating= int(match.group(3)),
> ... Timestamp = int(match.group(4)))
> ...
> >>> ratings_base_RDD = 
> >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat")
> ...# Call the parse_apace_log_line function on each line.
> ....map(parse_ratings_line)
> ...# Caches the objects in memory since they will be queried 
> multiple times.
> ....cache())
> >>> ratings_base_RDD.count()
> 1000209
> >>> ratings_base_RDD.first()
> Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
> >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
> >>> schemaRatings.registerTempTable("RatingsTable")
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> {code}
> (Now the Python shell hangs...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4578) Row.asDict() should keep the type of values

2014-11-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4578.

   Resolution: Fixed
Fix Version/s: 1.2.0

Thanks davies I've resolved this.

> Row.asDict() should keep the type of values
> ---
>
> Key: SPARK-4578
> URL: https://issues.apache.org/jira/browse/SPARK-4578
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.2.0
>
>
> Current, the nested Row will be returned as tuple, it should be Row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4580) Document random forests and boosting in programming guide

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4580:
-
Assignee: Joseph K. Bradley

> Document random forests and boosting in programming guide
> -
>
> Key: SPARK-4580
> URL: https://issues.apache.org/jira/browse/SPARK-4580
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> New items in Spark 1.2 require documentation updates, especially in the 
> programming guide:
> * RandomForest
> * GradientBoostedTrees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4562) GLM testing time regressions from Spark 1.1

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4562.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3420
[https://github.com/apache/spark/pull/3420]

> GLM testing time regressions from Spark 1.1
> ---
>
> Key: SPARK-4562
> URL: https://issues.apache.org/jira/browse/SPARK-4562
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.2.0
>
>
> There is a performance regression in  test of GLM, it's introduced by 
> serialization change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4525) MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223841#comment-14223841
 ] 

Apache Spark commented on SPARK-4525:
-

User 'pwendell' has created a pull request for this issue:
https://github.com/apache/spark/pull/3436

> MesosSchedulerBackend.resourceOffers cannot decline unused offers from 
> acceptedOffers
> -
>
> Key: SPARK-4525
> URL: https://issues.apache.org/jira/browse/SPARK-4525
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Blocker
>
> After resourceOffers function is refactored - SPARK-2269 -, that function 
> doesn't decline unused offers from accepted offers. That's because when 
> driver.launchTasks is called, if that's tasks is empty, driver.launchTask 
> calls the declineOffer(offer.id). 
> {quote}
> Invoking this function with an empty collection of tasks declines offers in 
> their entirety (see SchedulerDriver.declineOffer(OfferID, Filters)).
> - 
> http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html#launchTasks(OfferID,%20java.util.Collection,%20Filters)
> {quote}
> In branch-1.1, resourcesOffers calls a launchTask function for all offered 
> offers, so driver declines unused resources, however, in current master, at 
> first offers are divided accepted and declined offers by their resources, and 
> delinedOffers are declined explicitly, and offers with task from 
> acceptedOffers are launched by driver.launchTasks, but, offers without from 
> acceptedOfers are not launched with empty task or declined explicitly. Thus, 
> mesos master judges thats offers used by TaskScheduler and there are no 
> resources remaing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running

2014-11-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4180:
-
Assignee: Josh Rosen

> SparkContext constructor should throw exception if another SparkContext is 
> already running
> --
>
> Key: SPARK-4180
> URL: https://issues.apache.org/jira/browse/SPARK-4180
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.2.0
>
>
> Spark does not currently support multiple concurrently-running SparkContexts 
> in the same JVM (see SPARK-2243).  Therefore, SparkContext's constructor 
> should throw an exception if there is an active SparkContext that has not 
> been shut down via {{stop()}}.
> PySpark already does this, but the Scala SparkContext should do the same 
> thing.  The current behavior with multiple active contexts is unspecified / 
> not understood and it may be the source of confusing errors (see the user 
> error report in SPARK-4080, for example).
> This should be pretty easy to add: just add a {{activeSparkContext}} field to 
> the SparkContext companion object and {{synchronize}} on it in the 
> constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an 
> example of this approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4578) Row.asDict() should keep the type of values

2014-11-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4578:
-
Assignee: Davies Liu

> Row.asDict() should keep the type of values
> ---
>
> Key: SPARK-4578
> URL: https://issues.apache.org/jira/browse/SPARK-4578
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> Current, the nested Row will be returned as tuple, it should be Row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4581) Refactorize StandardScaler to improve the transformation performance

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223792#comment-14223792
 ] 

Apache Spark commented on SPARK-4581:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3435

> Refactorize StandardScaler to improve the transformation performance
> 
>
> Key: SPARK-4581
> URL: https://issues.apache.org/jira/browse/SPARK-4581
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>
> The following optimizations are done to improve the StandardScaler model 
> transformation performance.
> 1) Covert Breeze dense vector to primitive vector to reduce the overhead.
> 2) Since mean can be potentially a sparse vector, we explicitly convert it to 
> dense primitive vector.
> 3) Have a local reference to `shift` and `factor` array so JVM can locate the 
> value with one operation call.
> 4) In pattern matching part, we use the mllib SparseVector/DenseVector 
> instead of breeze's vector to make the codebase cleaner. 
> Benchmark with mnist8m dataset:
> Before,
> DenseVector withMean and withStd: 50.97secs
> DenseVector withMean and withoutStd: 42.11secs
> DenseVector withoutMean and withStd: 8.75secs
> SparseVector withoutMean and withStd: 5.437
> With this PR,
> DenseVector withMean and withStd: 5.76secs
> DenseVector withMean and withoutStd: 5.28secs
> DenseVector withoutMean and withStd: 5.30secs
> SparseVector withoutMean and withStd: 1.27



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4581) Refactorize StandardScaler to improve the transformation performance

2014-11-24 Thread DB Tsai (JIRA)
DB Tsai created SPARK-4581:
--

 Summary: Refactorize StandardScaler to improve the transformation 
performance
 Key: SPARK-4581
 URL: https://issues.apache.org/jira/browse/SPARK-4581
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai


The following optimizations are done to improve the StandardScaler model 
transformation performance.

1) Covert Breeze dense vector to primitive vector to reduce the overhead.
2) Since mean can be potentially a sparse vector, we explicitly convert it to 
dense primitive vector.
3) Have a local reference to `shift` and `factor` array so JVM can locate the 
value with one operation call.
4) In pattern matching part, we use the mllib SparseVector/DenseVector instead 
of breeze's vector to make the codebase cleaner. 

Benchmark with mnist8m dataset:

Before,
DenseVector withMean and withStd: 50.97secs
DenseVector withMean and withoutStd: 42.11secs
DenseVector withoutMean and withStd: 8.75secs
SparseVector withoutMean and withStd: 5.437

With this PR,
DenseVector withMean and withStd: 5.76secs
DenseVector withMean and withoutStd: 5.28secs
DenseVector withoutMean and withStd: 5.30secs
SparseVector withoutMean and withStd: 1.27



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4578) Row.asDict() should keep the type of values

2014-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223742#comment-14223742
 ] 

Apache Spark commented on SPARK-4578:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3434

> Row.asDict() should keep the type of values
> ---
>
> Key: SPARK-4578
> URL: https://issues.apache.org/jira/browse/SPARK-4578
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Priority: Blocker
>
> Current, the nested Row will be returned as tuple, it should be Row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha

2014-11-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-4447:
-

Assignee: Sandy Ryza  (was: Patrick Wendell)

> Remove layers of abstraction in YARN code no longer needed after dropping 
> yarn-alpha
> 
>
> Key: SPARK-4447
> URL: https://issues.apache.org/jira/browse/SPARK-4447
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> For example, YarnRMClient and YarnRMClientImpl can be merged
> YarnAllocator and YarnAllocationHandler can be merged



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4447) Remove layers of abstraction in YARN code no longer needed after dropping yarn-alpha

2014-11-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-4447:
-

Assignee: Patrick Wendell

> Remove layers of abstraction in YARN code no longer needed after dropping 
> yarn-alpha
> 
>
> Key: SPARK-4447
> URL: https://issues.apache.org/jira/browse/SPARK-4447
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Patrick Wendell
>
> For example, YarnRMClient and YarnRMClientImpl can be merged
> YarnAllocator and YarnAllocationHandler can be merged



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4196) Streaming + checkpointing + saveAsNewAPIHadoopFiles = NotSerializableException for Hadoop Configuration

2014-11-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4196:
---
Assignee: Tathagata Das  (was: Patrick Wendell)

> Streaming + checkpointing + saveAsNewAPIHadoopFiles = 
> NotSerializableException for Hadoop Configuration
> ---
>
> Key: SPARK-4196
> URL: https://issues.apache.org/jira/browse/SPARK-4196
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Sean Owen
>Assignee: Tathagata Das
>
> I am reasonably sure there is some issue here in Streaming and that I'm not 
> missing something basic, but not 100%. I went ahead and posted it as a JIRA 
> to track, since it's come up a few times before without resolution, and right 
> now I can't get checkpointing to work at all.
> When Spark Streaming checkpointing is enabled, I see a 
> NotSerializableException thrown for a Hadoop Configuration object, and it 
> seems like it is not one from my user code.
> Before I post my particular instance see 
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408135046777-12202.p...@n3.nabble.com%3E
>  for another occurrence.
> I was also on customer site last week debugging an identical issue with 
> checkpointing in a Scala-based program and they also could not enable 
> checkpointing without hitting exactly this error.
> The essence of my code is:
> {code}
> final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
> JavaStreamingContextFactory streamingContextFactory = new
> JavaStreamingContextFactory() {
>   @Override
>   public JavaStreamingContext create() {
> return new JavaStreamingContext(sparkContext, new
> Duration(batchDurationMS));
>   }
> };
>   streamingContext = JavaStreamingContext.getOrCreate(
>   checkpointDirString, sparkContext.hadoopConfiguration(),
> streamingContextFactory, false);
>   streamingContext.checkpoint(checkpointDirString);
> {code}
> It yields:
> {code}
> 2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66
> org.apache.hadoop.conf.Configuration
> - field (class 
> "org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9",
> name: "conf$2", type: "class org.apache.hadoop.conf.Configuration")
> - object (class
> "org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9",
> )
> - field (class "org.apache.spark.streaming.dstream.ForEachDStream",
> name: "org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc",
> type: "interface scala.Function2")
> - object (class "org.apache.spark.streaming.dstream.ForEachDStream",
> org.apache.spark.streaming.dstream.ForEachDStream@cb8016a)
> ...
> {code}
> This looks like it's due to PairRDDFunctions, as this saveFunc seems
> to be  org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9
> :
> {code}
> def saveAsNewAPIHadoopFiles(
> prefix: String,
> suffix: String,
> keyClass: Class[_],
> valueClass: Class[_],
> outputFormatClass: Class[_ <: NewOutputFormat[_, _]],
> conf: Configuration = new Configuration
>   ) {
>   val saveFunc = (rdd: RDD[(K, V)], time: Time) => {
> val file = rddToFileName(prefix, suffix, time)
> rdd.saveAsNewAPIHadoopFile(file, keyClass, valueClass,
> outputFormatClass, conf)
>   }
>   self.foreachRDD(saveFunc)
> }
> {code}
> Is that not a problem? but then I don't know how it would ever work in Spark. 
> But then again I don't see why this is an issue and only when checkpointing 
> is enabled.
> Long-shot, but I wonder if it is related to closure issues like 
> https://issues.apache.org/jira/browse/SPARK-1866



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >