Hi Reynold/Ivan,
People familiar with pandas and R dataframes will likely have used the
dataframe "melt" idiom, which is the functionality I believe you are
referring to:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
I have had to write this function myself in my own
I second knowing the use case for interest. I can imagine a case where
knowledge of the RDD key distribution would help local computations, for
relaticely few keys, but would be interested to hear your motive.
Essentially, are you trying to achieve what would be an all-reduce type
operation in
provement in
> runtime when the partitioning is even (happens when count is moved).
>
> Any pointers in figuring out this issue is much appreciated.
>
> Regards,
> Raghava.
>
>
>
>
> On Fri, Apr 22, 2016 at 7:40 PM, Mike Hynes <91m...@gmail.com> wrote:
>
>
h spark-submit) at a
> later stage also.
>
> Apart from introducing a dummy stage or running it from spark-shell, is
> there any other option to fix this?
>
> Regards,
> Raghava.
>
>
> On Mon, Apr 18, 2016 at 12:17 AM, Mike Hynes <91m...@gmail.com> wrote:
>
>
s,
> -Khaled
>
>
>
>
> On Mon, Apr 4, 2016 at 10:57 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> can you try:
>> spark.shuffle.reduceLocality.enabled=false
>>
>> On Mon, Apr 4, 2016 at 8:17 PM, Mike Hynes <91m...@gmail.com> wrote:
>>
f anyone else has any other ideas or experience, please let me know.
Mike
On 4/4/16, Koert Kuipers <ko...@tresata.com> wrote:
> we ran into similar issues and it seems related to the new memory
> management. can you try:
> spark.memory.useLegacyMode = true
>
> On Mo
[ CC'ing dev list since nearly identical questions have occurred in
user list recently w/o resolution;
c.f.:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html
just created a JIRA (
> https://issues.apache.org/jira/browse/SPARK-13109) to track this.
>
>
> On Mon, Feb 1, 2016 at 3:01 PM, Mike Hynes <91m...@gmail.com> wrote:
>
>> Hi devs,
>>
>> I used to be able to do some local development from the upstream
>> master
Hi devs,
I used to be able to do some local development from the upstream
master branch and run the publish-local command in an sbt shell to
publish the modified jars to the local ~/.ivy2 repository.
I relied on this behaviour, since I could write other local packages
that had my local
Hi Alexander, Joseph, Evan,
I just wanted to weigh in an empirical result that we've had on a
standalone cluster with 16 nodes and 256 cores.
Typically we run optimization tasks with 256 partitions for 1
partition per core, and find that performance worsens with more
partitions than physical
Having only 2 workers for 5 machines would be your problem: you
probably want 1 worker per physical machine, which entails running the
spark-daemon.sh script to start a worker on those machines.
The partitioning is agnositic to how many executors are available for
running the tasks, so you can't
Hello Devs,
This email concerns some timing results for a treeAggregate in
computing a (stochastic) gradient over an RDD of labelled points, as
is currently done in the MLlib optimization routine for SGD.
In SGD, the underlying RDD is downsampled by a fraction f \in (0,1],
and the subgradients
for things like
> task serialization and other platform overheads. You've got to balance how
> much computation you want to do vs. the amount of time you want to spend
> waiting for the platform.
>
> - Evan
>
> On Sat, Sep 26, 2015 at 9:27 AM, Mike Hynes <91m...@gmail.com> wrote:
a slow link for the
> last portion this could really make a difference.
>
> On Sat, Sep 26, 2015 at 10:20 AM, Mike Hynes <91m...@gmail.com> wrote:
>
>> Hi Evan,
>>
>> (I just realized my initial email was a reply to the wrong thread; I'm
>> very sorry about
Just a thought; this has worked for me before on standalone client
with a similar OOM error in a driver thread. Try setting:
export SPARK_DAEMON_MEMORY=4G #or whatever size you can afford on your machine
in your environment/spark-env.sh before running spark-submit.
Mike
On 9/2/15, ankit tyagi
.
imran
On Tue, Jul 28, 2015 at 10:56 PM, Mike Hynes 91m...@gmail.com wrote:
Hi Imran,
Thanks for your reply. I have double-checked the code I ran to
generate an nxn matrix and nx1 vector for n = 2^27. There was
unfortunately a bug in it, where instead of having typed 134,217,728
for n = 2^27, I
to exhibit the same error.
On Tue, Jul 28, 2015 at 12:37 PM, Mike Hynes 91m...@gmail.com wrote:
Hello Devs,
I am investigating how matrix vector multiplication can scale for an
IndexedRowMatrix in mllib.linalg.distributed.
Currently, I am broadcasting the vector to be multiplied on the right
Hello Devs,
I am investigating how matrix vector multiplication can scale for an
IndexedRowMatrix in mllib.linalg.distributed.
Currently, I am broadcasting the vector to be multiplied on the right.
The IndexedRowMatrix is stored across a cluster with up to 16 nodes,
each with 200 GB of memory.
Gentle bump on this topic; how to test the fault tolerance and previous
benchmark results are both things we are interested in as well.
Mike
div Original message /divdivFrom: 牛兆捷
nzjem...@gmail.com /divdivDate:07-09-2015 04:19 (GMT-05:00)
/divdivTo: dev@spark.apache.org,
transfer. It could
be that there is no (measurable) wait time b/c the next blocks are fetched
before they are needed. Shuffle writes occur in the normal task execution
thread, though, so we (try to) measure all of it.
On Sun, Jun 7, 2015 at 11:12 PM, Mike Hynes 91m...@gmail.com wrote:
Hi
Ahhh---forgive my typo: what I mean is,
(t2 - t1) = (t_ser + t_deser + t_exec)
is satisfied, empirically.
On 6/10/15, Mike Hynes 91m...@gmail.com wrote:
Hi Imran,
Thank you for your email.
In examing the condition (t2 - t1) (t_ser + t_deser + t_exec), I
have found it to be true, although I
behavior on the driver UI? (that running on port
4040), If you click on the stage id header you can sort the stages based
on
IDs.
Thanks
Best Regards
On Fri, Jun 5, 2015 at 10:21 PM, Mike Hynes 91m...@gmail.com wrote:
Hi folks,
When I look at the output logs for an iterative Spark program
Hi folks,
When I look at the output logs for an iterative Spark program, I see
that the stage IDs are not arithmetically numbered---that is, there
are gaps between stages and I might find log information about Stage
0, 1,2, 5, but not 3 or 4.
As an example, the output from the Spark logs below
Hi,
This is just a thought from my experience setting up Spark to run on a
linux cluster. I found it a bit unusual that some parameters could be
specified as command line args to spark-submit, others as env variables,
and some in a configuration file. What I ended up doing was writing my own
bash
show? are you
sure you don't have JRE 7 but JDK 6 installed?
On Tue, Feb 24, 2015 at 11:02 PM, Mike Hynes 91m...@gmail.com wrote:
./bin/compute-classpath.sh fails with error:
$ jar -tf
assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar
nonexistent/class/path
./bin/compute-classpath.sh fails with error:
$ jar -tf
assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar
nonexistent/class/path
java.util.zip.ZipException: invalid CEN header (bad signature)
at java.util.zip.ZipFile.open(Native Method)
at
26 matches
Mail list logo