Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Mike Hynes
Hi Reynold/Ivan, People familiar with pandas and R dataframes will likely have used the dataframe "melt" idiom, which is the functionality I believe you are referring to: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html I have had to write this function myself in my own

Re: RDD.broadcast

2016-04-28 Thread Mike Hynes
I second knowing the use case for interest. I can imagine a case where knowledge of the RDD key distribution would help local computations, for relaticely few keys, but would be interested to hear your motive. Essentially, are you trying to achieve what would be an all-reduce type operation in

Re: executor delay in Spark

2016-04-24 Thread Mike Hynes
provement in > runtime when the partitioning is even (happens when count is moved). > > Any pointers in figuring out this issue is much appreciated. > > Regards, > Raghava. > > > > > On Fri, Apr 22, 2016 at 7:40 PM, Mike Hynes <91m...@gmail.com> wrote: > >

Re: executor delay in Spark

2016-04-22 Thread Mike Hynes
h spark-submit) at a > later stage also. > > Apart from introducing a dummy stage or running it from spark-shell, is > there any other option to fix this? > > Regards, > Raghava. > > > On Mon, Apr 18, 2016 at 12:17 AM, Mike Hynes <91m...@gmail.com> wrote: > >

Re: RDD Partitions not distributed evenly to executors

2016-04-06 Thread Mike Hynes
s, > -Khaled > > > > > On Mon, Apr 4, 2016 at 10:57 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> can you try: >> spark.shuffle.reduceLocality.enabled=false >> >> On Mon, Apr 4, 2016 at 8:17 PM, Mike Hynes <91m...@gmail.com> wrote: >>

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes
f anyone else has any other ideas or experience, please let me know. Mike On 4/4/16, Koert Kuipers <ko...@tresata.com> wrote: > we ran into similar issues and it seems related to the new memory > management. can you try: > spark.memory.useLegacyMode = true > > On Mo

RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes
[ CC'ing dev list since nearly identical questions have occurred in user list recently w/o resolution; c.f.: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html

Re: sbt publish-local fails with 2.0.0-SNAPSHOT

2016-02-01 Thread Mike Hynes
just created a JIRA ( > https://issues.apache.org/jira/browse/SPARK-13109) to track this. > > > On Mon, Feb 1, 2016 at 3:01 PM, Mike Hynes <91m...@gmail.com> wrote: > >> Hi devs, >> >> I used to be able to do some local development from the upstream >> master

sbt publish-local fails with 2.0.0-SNAPSHOT

2016-01-31 Thread Mike Hynes
Hi devs, I used to be able to do some local development from the upstream master branch and run the publish-local command in an sbt shell to publish the modified jars to the local ~/.ivy2 repository. I relied on this behaviour, since I could write other local packages that had my local

Re: Gradient Descent with large model size

2015-10-19 Thread Mike Hynes
Hi Alexander, Joseph, Evan, I just wanted to weigh in an empirical result that we've had on a standalone cluster with 16 nodes and 256 cores. Typically we run optimization tasks with 256 partitions for 1 partition per core, and find that performance worsens with more partitions than physical

Re: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-11 Thread Mike Hynes
Having only 2 workers for 5 machines would be your problem: you probably want 1 worker per physical machine, which entails running the spark-daemon.sh script to start a worker on those machines. The partitioning is agnositic to how many executors are available for running the tasks, so you can't

Re: RDD API patterns

2015-09-26 Thread Mike Hynes
Hello Devs, This email concerns some timing results for a treeAggregate in computing a (stochastic) gradient over an RDD of labelled points, as is currently done in the MLlib optimization routine for SGD. In SGD, the underlying RDD is downsampled by a fraction f \in (0,1], and the subgradients

treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Mike Hynes
for things like > task serialization and other platform overheads. You've got to balance how > much computation you want to do vs. the amount of time you want to spend > waiting for the platform. > > - Evan > > On Sat, Sep 26, 2015 at 9:27 AM, Mike Hynes <91m...@gmail.com> wrote:

Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Mike Hynes
a slow link for the > last portion this could really make a difference. > > On Sat, Sep 26, 2015 at 10:20 AM, Mike Hynes <91m...@gmail.com> wrote: > >> Hi Evan, >> >> (I just realized my initial email was a reply to the wrong thread; I'm >> very sorry about

Re: OOM in spark driver

2015-09-02 Thread Mike Hynes
Just a thought; this has worked for me before on standalone client with a similar OOM error in a driver thread. Try setting: export SPARK_DAEMON_MEMORY=4G #or whatever size you can afford on your machine in your environment/spark-env.sh before running spark-submit. Mike On 9/2/15, ankit tyagi

Re: Broadcast variable of size 1 GB fails with negative memory exception

2015-07-29 Thread Mike Hynes
. imran On Tue, Jul 28, 2015 at 10:56 PM, Mike Hynes 91m...@gmail.com wrote: Hi Imran, Thanks for your reply. I have double-checked the code I ran to generate an nxn matrix and nx1 vector for n = 2^27. There was unfortunately a bug in it, where instead of having typed 134,217,728 for n = 2^27, I

Re: Broadcast variable of size 1 GB fails with negative memory exception

2015-07-28 Thread Mike Hynes
to exhibit the same error. On Tue, Jul 28, 2015 at 12:37 PM, Mike Hynes 91m...@gmail.com wrote: Hello Devs, I am investigating how matrix vector multiplication can scale for an IndexedRowMatrix in mllib.linalg.distributed. Currently, I am broadcasting the vector to be multiplied on the right

Broadcast variable of size 1 GB fails with negative memory exception

2015-07-28 Thread Mike Hynes
Hello Devs, I am investigating how matrix vector multiplication can scale for an IndexedRowMatrix in mllib.linalg.distributed. Currently, I am broadcasting the vector to be multiplied on the right. The IndexedRowMatrix is stored across a cluster with up to 16 nodes, each with 200 GB of memory.

Re: Questions about Fault tolerance of Spark

2015-07-10 Thread MIKE HYNES
Gentle bump on this topic; how to test the fault tolerance and previous benchmark results are both things we are interested in as well.  Mike div Original message /divdivFrom: 牛兆捷 nzjem...@gmail.com /divdivDate:07-09-2015 04:19 (GMT-05:00) /divdivTo: dev@spark.apache.org,

Re: Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-09 Thread Mike Hynes
transfer. It could be that there is no (measurable) wait time b/c the next blocks are fetched before they are needed. Shuffle writes occur in the normal task execution thread, though, so we (try to) measure all of it. On Sun, Jun 7, 2015 at 11:12 PM, Mike Hynes 91m...@gmail.com wrote: Hi

Re: Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-09 Thread Mike Hynes
Ahhh---forgive my typo: what I mean is, (t2 - t1) = (t_ser + t_deser + t_exec) is satisfied, empirically. On 6/10/15, Mike Hynes 91m...@gmail.com wrote: Hi Imran, Thank you for your email. In examing the condition (t2 - t1) (t_ser + t_deser + t_exec), I have found it to be true, although I

Stages with non-arithmetic numbering Timing metrics in event logs

2015-06-07 Thread Mike Hynes
behavior on the driver UI? (that running on port 4040), If you click on the stage id header you can sort the stages based on IDs. Thanks Best Regards On Fri, Jun 5, 2015 at 10:21 PM, Mike Hynes 91m...@gmail.com wrote: Hi folks, When I look at the output logs for an iterative Spark program

Scheduler question: stages with non-arithmetic numbering

2015-06-05 Thread Mike Hynes
Hi folks, When I look at the output logs for an iterative Spark program, I see that the stage IDs are not arithmetically numbered---that is, there are gaps between stages and I might find log information about Stage 0, 1,2, 5, but not 3 or 4. As an example, the output from the Spark logs below

Re: Spark config option 'expression language' feedback request

2015-03-31 Thread Mike Hynes
Hi, This is just a thought from my experience setting up Spark to run on a linux cluster. I found it a bit unusual that some parameters could be specified as command line args to spark-submit, others as env variables, and some in a configuration file. What I ended up doing was writing my own bash

Re: [ERROR] bin/compute-classpath.sh: fails with false positive test for java 1.7 vs 1.6

2015-02-24 Thread Mike Hynes
show? are you sure you don't have JRE 7 but JDK 6 installed? On Tue, Feb 24, 2015 at 11:02 PM, Mike Hynes 91m...@gmail.com wrote: ./bin/compute-classpath.sh fails with error: $ jar -tf assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar nonexistent/class/path

[ERROR] bin/compute-classpath.sh: fails with false positive test for java 1.7 vs 1.6

2015-02-24 Thread Mike Hynes
./bin/compute-classpath.sh fails with error: $ jar -tf assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar nonexistent/class/path java.util.zip.ZipException: invalid CEN header (bad signature) at java.util.zip.ZipFile.open(Native Method) at