I have used Spark for several years and realize from recent chatter on this
list that I don’t really understand how it uses memory.
Specifically is spark.executor.memory and spark.driver.memory taken from the
JVM heap when does Spark take memory from JVM heap and when it is from off JVM
heap.
IntelliJ Scala works well when debugging master=local. Has anyone used it for
remote/cluster debugging? I’ve heard it is possible...
From: Luiz Camargo
Reply: Luiz Camargo
Date: April 7, 2020 at 10:26:35 AM
To: Dennis Suhari
Cc: yeikel valdes , zahidr1...@gmail.com
, user@spark.apache.org
todays
question.
From: Matt Cheah
Reply: Matt Cheah
Date: July 1, 2019 at 5:14:05 PM
To: Pat Ferrel ,
user@spark.apache.org
Subject: Re: k8s orchestrating Spark service
> We’d like to deploy Spark Workers/Executors and Master (whatever master
is easiest to talk about since we really do
run our Driver and Executors considering that the Driver is part of the
Server process?
Maybe we are talking past each other with some mistaken assumptions (on my
part perhaps).
From: Pat Ferrel
Reply: Pat Ferrel
Date: July 1, 2019 at 4:57:20 PM
To: user@spark.apache.org , Matt
Cheah
anyone have something they like?
From: Matt Cheah
Reply: Matt Cheah
Date: July 1, 2019 at 4:45:55 PM
To: Pat Ferrel ,
user@spark.apache.org
Subject: Re: k8s orchestrating Spark service
Sorry, I don’t quite follow – why use the Spark standalone cluster as an
in-between layer when one can just
of services including Spark. The rest work,
we are asking if anyone has seen a good starting point for adding Spark as
a k8s managed service.
From: Matt Cheah
Reply: Matt Cheah
Date: July 1, 2019 at 3:26:20 PM
To: Pat Ferrel ,
user@spark.apache.org
Subject: Re: k8s orchestrating Spark service
We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.
Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.
So back to
It is always dangerous to run a NEWER version of code on an OLDER cluster.
The danger increases with the semver change and this one is not just a
build #. In other word 2.4 is considered to be a fairly major change from
2.3. Not much else can be said.
From: Nicolas Paris
Reply:
In order to create an application that executes code on Spark we have a
long lived process. It periodically runs jobs programmatically on a Spark
cluster, meaning it does not use spark-submit. The Jobs it executes have
varying requirements for memory so we want to have the Spark Driver run in
the
Streams have no end until watermarked or closed. Joins need bounded
datasets, et voila. Something tells me you should consider the streaming
nature of your data and whether your joins need to use increments/snippets
of infinite streams or to re-join the entire contents of the streams
accumulated
@Riccardo
Spark does not do the DL learning part of the pipeline (afaik) so it is
limited to data ingestion and transforms (ETL). It therefore is optional
and other ETL options might be better for you.
Most of the technologies @Gourav mentions have their own scaling based on
their own compute
Thanks, are you referring to
https://github.com/spark-jobserver/spark-jobserver or the undocumented REST
job server included in Spark?
From: Jason Nerothin
Reply: Jason Nerothin
Date: March 28, 2019 at 2:53:05 PM
To: Pat Ferrel
Cc: Felix Cheung
, Marcelo
Vanzin , user
Subject: Re
;-)
Great idea. Can you suggest a project?
Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only
launches trivially in test apps since most uses are as a lib.
From: Felix Cheung
Reply: Felix Cheung
Date: March 28, 2019 at 9:42:31 AM
To: Pat Ferrel , Marcelo
Vanzin
Cc
e mode you might be able to
use this:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala
Lastly, you can always check where Spark processes run by executing ps on
the machine, i.e. `ps aux | grep java`.
Best,
Jianneng
*From:* Pat Ferrel
*Dat
Reply: Marcelo Vanzin
Date: March 26, 2019 at 1:59:36 PM
To: Pat Ferrel
Cc: user
Subject: Re: spark.submit.deployMode: cluster
If you're not using spark-submit, then that option does nothing.
If by "context creation API" you mean "new SparkContext()" or an
equ
I have a server that starts a Spark job using the context creation API. It
DOES NOY use spark-submit.
I set spark.submit.deployMode = “cluster”
In the GUI I see 2 workers with 2 executors. The link for running
application “name” goes back to my server, the machine that launched the
job.
This is
:07 AM
To: Pat Ferrel
Cc: Akhil Das , user
Subject: Re: Where does the Driver run?
Hi Pat,
Indeed, I don't think that it's possible to use cluster mode w/o
spark-submit. All the docs I see appear to always describe needing to use
spark-submit for cluster mode -- it's not even compatible
only guessing at that).
Further; if we don’t use spark-submit we can’t use deployMode = cluster ???
From: Akhil Das
Reply: Akhil Das
Date: March 24, 2019 at 7:45:07 PM
To: Pat Ferrel
Cc: user
Subject: Re: Where does the Driver run?
There's also a driver ui (usually available on port 4040
60g
BTW I would expect this to create one Executor, one Driver, and the Master
on 2 Workers.
From: Andrew Melo
Reply: Andrew Melo
Date: March 24, 2019 at 12:46:35 PM
To: Pat Ferrel
Cc: Akhil Das , user
Subject: Re: Where does the Driver run?
Hi Pat,
On Sun, Mar 24, 2019 at 1:03 PM
60g
From: Andrew Melo
Reply: Andrew Melo
Date: March 24, 2019 at 12:46:35 PM
To: Pat Ferrel
Cc: Akhil Das , user
Subject: Re: Where does the Driver run?
Hi Pat,
On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel wrote:
> Thanks, I have seen this many times in my research. Paraphrasi
: Akhil Das
Date: March 23, 2019 at 9:26:50 PM
To: Pat Ferrel
Cc: user
Subject: Re: Where does the Driver run?
If you are starting your "my-app" on your local machine, that's where the
driver is running.
[image: image.png]
Hope this helps.
<https://spark.apache.org/docs/l
I have researched this for a significant amount of time and find answers
that seem to be for a slightly different question than mine.
The Spark 2.3.3 cluster is running fine. I see the GUI on “
http://master-address:8080;, there are 2 idle workers, as configured.
I have a Scala application that
Executor.java:163)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at
io.netty.util.concurren
From: Pat Ferrel
Reply: Pat Ferrel
Date: February 12, 2019 at 5:40:41 PM
To: user@spark.apache.org
Subject: Spark with Kubernetes connecting to pod id, not address
We have a k8s deployment of several services including Apache Spark. All
services seem to be operational. Our application
I have a task that would benefit from more cores but the standalone scheduler
launches it when only a subset are available. I’d rather use all cluster cores
on this task.
Is there a way to tell the scheduler to finish everything before allocating
resources to a task? Like "finish everything
I have a large Map that is assembled in the driver and broadcast to each node.
My question is how best to allocate memory for this. The Driver has to have
enough memory for the Maps, but only one copy is serialized to each node. What
type of memory should I size to match the Maps? Is the
I’m getting *huge* execution times on a moderate sized dataset during the
RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
calculation. I’m using Spark 1.5.1 and from researching I would expect this
calculation to be linearly proportional to the number of partitions as a
wrote:
Are you sure it's isEmpty? and not an upstream stage? isEmpty is
definitely the action here. It doesn't make sense that take(1) is so
much faster since it's the "same thing".
On Wed, Dec 9, 2015 at 5:11 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> Sure, I thought th
<so...@cloudera.com> wrote:
>
> On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> The “Any” is required by the code it is being passed to, which is the
>> Elasticsearch Spark index writing code. The values are actually RDD[(String,
>> Map[String,
river is trivial. Or maybe adapt
some version of the implementation of take() to be an optimized, smarter
isEmpty(). Neither seemed worth the overhead at the time, but this could be a
case against that, if it turns out somehow to be serialization time.
On Wed, Dec 9, 2015 at 5:55 PM, Pat Ferr
<so...@cloudera.com> wrote:
It should at best collect 1 item to the driver. This means evaluating
at least 1 element of 1 partition. I can imagine pathological cases
where that's slow, but, do you have any more info? how slow is slow
and what is slow?
On Wed, Dec 9, 2015 at 4:41 PM, Pat
Our project is having a hard time following what we are supposed to do to
migrate this function from Spark 1.2 to 1.3.
/**
* Dump matrix as computed Mahout's DRM into specified (HD)FS path
* @param path
*/
def dfsWrite(path: String) = {
val ktag = implicitly[ClassTag[K]]
Using Spark streaming to create a large volume of small nano-batch input files,
~4k per file, thousands of ‘part-x’ files. When reading the nano-batch
files and doing a distributed calculation my tasks run only on the machine
where it was launched. I’m launching in “yarn-client” mode. The
(rowIDColumn) - tokens(columnIDPosition)
}
interactions.cache()
On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele gangele...@gmail.com wrote:
Will you be able to paste code here?
On 23 April 2015 at 22:21, Pat Ferrel p...@occamsmachete.com
mailto:p...@occamsmachete.com wrote:
Using Spark
Argh, I looked and there really isn’t that much data yet. There will be
thousands but starting small.
I bet this is just a total data size not requiring all workers thing—sorry,
nevermind.
On Apr 23, 2015, at 10:30 AM, Pat Ferrel p...@occamsmachete.com wrote:
They are in HDFS so available
, 2015, at 10:23 AM, Sean Owen so...@cloudera.com wrote:
Where are the file splits? meaning is it possible they were also
(only) available on one node and that was also your driver?
On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel p...@occamsmachete.com wrote:
Sure
var columns = mc.textFile(source
They are in HDFS so available on all workers
On Apr 23, 2015, at 10:29 AM, Pat Ferrel p...@occamsmachete.com wrote:
Physically? Not sure, they were written using the nano-batch rdds in a
streaming job that is in a separate driver. The job is a Kafka consumer.
Would that effect all derived
Running on Spark 1.1.1 Hadoop 2.4 with Yarn AWS dedicated cluster (non-EMR)
Is this in our code or config? I’ve never run into a TaskResultLost, not sure
what can cause that.
TaskResultLost (result lost from block manager)
nivea.m https://gd-a.slack.com/team/nivea.m[11:01 AM]
collect at
/15e72f7bc22337cf6653
Michael
On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel p...@occamsmachete.com
mailto:p...@occamsmachete.com wrote:
It’s a long story but there are many dirs with smallish part- files in them
so we create a list of the individual files as input to
sparkContext.textFile
... ::
Nil).flatMap(new ReadLinesSafe(_))
You can also build up the list of files by running a Spark job:
https://gist.github.com/marmbrus/15e72f7bc22337cf6653
https://gist.github.com/marmbrus/15e72f7bc22337cf6653
Michael
On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel p...@occamsmachete.com
mailto:p
many tasks (its not a
jvm per task).
On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel p...@occamsmachete.com
mailto:p...@occamsmachete.com wrote:
Any advice on dealing with a large number of separate input files?
On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com
mailto:p
Any advice on dealing with a large number of separate input files?
On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com wrote:
We have many text files that we need to read in parallel. We can create a comma
delimited list of files to pass in to sparkContext.textFile(fileList
We have many text files that we need to read in parallel. We can create a comma
delimited list of files to pass in to sparkContext.textFile(fileList). The list
can get very large (maybe 1) and is all on hdfs.
The question is: what is the most performant way to read them? Should they be
Sab, not sure what you require for the similarity metric or your use case but
you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise)
here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
:
Yes. I ran into this problem with mahout snapshot and spark 1.2.0 not really
trying to figure out why that was a problem, since there were already too many
moving parts in my app. Obviously there is a classpath issue somewhere.
/Erlend
On 27 Feb 2015 22:30, Pat Ferrel p...@occamsmachete.com
where the serializer is and
where the user class is.
At the top you said Pat that you didn't try this, but why not?
On Fri, Feb 27, 2015 at 10:11 PM, Pat Ferrel p...@occamsmachete.com wrote:
I’ll try to find a Jira for it. I hope a fix is in 1.3
On Feb 27, 2015, at 1:59 PM, Pat Ferrel p
25, 2015 at 5:17 PM, Pat Ferrel p...@occamsmachete.com wrote:
The root Spark pom has guava set at a certain version number. It’s very hard
to read the shading xml. Someone suggested that I try using
userClassPathFirst but that sounds too heavy handed since I don’t really
care which version
I don’t use spark-submit I have a standalone app.
So I guess you want me to add that key/value to the conf in my code and make
sure it exists on workers.
On Feb 27, 2015, at 1:47 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Fri, Feb 27, 2015 at 1:42 PM, Pat Ferrel p...@occamsmachete.com
erl...@hamnaberg.net wrote:
Hi.
I have had a simliar issue. I had to pull the JavaSerializer source into my own
project, just so I got the classloading of this class under control.
This must be a class loader issue with spark.
-E
On Fri, Feb 27, 2015 at 8:52 PM, Pat Ferrel p...@occamsmachete.com
Thanks! that worked.
On Feb 27, 2015, at 1:50 PM, Pat Ferrel p...@occamsmachete.com wrote:
I don’t use spark-submit I have a standalone app.
So I guess you want me to add that key/value to the conf in my code and make
sure it exists on workers.
On Feb 27, 2015, at 1:47 PM, Marcelo Vanzin van
I’ll try to find a Jira for it. I hope a fix is in 1.3
On Feb 27, 2015, at 1:59 PM, Pat Ferrel p...@occamsmachete.com wrote:
Thanks! that worked.
On Feb 27, 2015, at 1:50 PM, Pat Ferrel p...@occamsmachete.com wrote:
I don’t use spark-submit I have a standalone app.
So I guess you want me
I changed in the spark master conf, which is also the only worker. I added a
path to the jar that has guava in it. Still can’t find the class.
Trying Erland’s idea next.
On Feb 27, 2015, at 1:35 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Fri, Feb 27, 2015 at 1:30 PM, Pat Ferrel p
Getting an error that confuses me. Running a largish app on a standalone
cluster on my laptop. The app uses a guava HashBiMap as a broadcast value. With
Spark 1.1.0 I simply registered the class and its serializer with kryo like
this:
algebra first?
On Jan 17, 2015, at 6:27 PM, Reza Zadeh r...@databricks.com wrote:
We're focused on providing block matrices, which makes transposition simple:
https://issues.apache.org/jira/browse/SPARK-3434
https://issues.apache.org/jira/browse/SPARK-3434
On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel
these together.
On Jan 17, 2015, at 9:37 AM, Andrew Musselman andrew.mussel...@gmail.com
wrote:
Excellent, thanks Pat.
On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com
mailto:p...@occamsmachete.com wrote:
Mahout’s Spark implementation of rowsimilarity is in the Scala
rowSimilarities.
On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel p...@occamsmachete.com
mailto:p...@occamsmachete.com wrote:
BTW it looks like row and column similarities (cosine based) are coming to
MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the
master yet. Does anyone know
Mahout’s Spark implementation of rowsimilarity is in the Scala
SimilarityAnalysis class. It actually does either row or column similarity but
only supports LLR at present. It does [AA’] for columns or [A’A] for rows first
then calculates the distance (LLR) for non-zero elements. This is a major
Actually the spark-itemsimilarity job and related code in the Spark module of
Mahout creates all-pairs similarity too. It’s designed to use with a search
engine, which provides the query part of the recommender. Integrate the two and
you have a near realtime scalable item-based/cooccurrence
I have a job that searches for input recursively and creates a string of
pathnames to treat as one input.
The files are part-x files and they are fairly small. The job seems to take
a long time to complete considering the size of the total data (150m) and only
runs on the master machine.
I see the default and max cores settings but these seem to control total cores
per cluster.
My cobbled together home cluster needs the Master to not use all its cores or
it may lock up (it does other things). Is there a way to control max cores used
for a particular cluster machine in
Looks like I can do this by not using start-all.sh but starting each worker
separately passing in a '--cores n' to the master? No config/env way?
On Nov 18, 2014, at 3:14 PM, Pat Ferrel p...@occamsmachete.com wrote:
I see the default and max cores settings but these seem to control total cores
This seems to work only on a ‘worker’ not the master? So I’m back to having no
way to control cores on the master?
On Nov 18, 2014, at 3:24 PM, Pat Ferrel p...@occamsmachete.com wrote:
Looks like I can do this by not using start-all.sh but starting each worker
separately passing in a '--cores
OK hacking the start-slave.sh did it
On Nov 18, 2014, at 4:12 PM, Pat Ferrel p...@occamsmachete.com wrote:
This seems to work only on a ‘worker’ not the master? So I’m back to having no
way to control cores on the master?
On Nov 18, 2014, at 3:24 PM, Pat Ferrel p...@occamsmachete.com wrote
maven cache is laid out differently but it does work on Linux and BSD/mac.
Still looks like a hack to me.
On Oct 21, 2014, at 1:28 PM, Pat Ferrel p...@occamsmachete.com wrote:
Doesn’t this seem like a dangerous error prone hack? It will build different
bits on different machines. It doesn’t
Trying to upgrade from Spark 1.0.1 to 1.1.0. Can’t imagine the upgrade is the
problem but anyway...
I get a NoClassDefFoundError for RandomGenerator when running a driver from the
CLI. But only when using a named master, even a standalone master. If I run
using master = local[4] the job
I’ve read several discussions of the error here and so have wiped all cluster
machines and copied the master’s spark build to the rest of the cluster. I’ve
built my job on the master using the correct Spark version as a dependency and
even build that version of Spark. I still get the
the same version of Apache Spark on each node of the cluster? And I am
not only asking about current project version (1.0.0, 1.1.0 etc.) but also
about package type (hadoop 1.x, hadoop 2.x).
On Fri, Oct 17, 2014 at 12:35 AM, Pat Ferrel p...@occamsmachete.com wrote:
I’ve read several discussions
I’ve created a CLI driver for a Spark version of a Mahout job called item
similarity with several tests that all work fine on local[4] Spark standalone.
The code even reads and writes to clustered HDFS. But switching to clustered
Spark has a problem that seems tied to a broadcast and/or
, 2014 at 9:13 AM, Pat Ferrel pat.fer...@gmail.com wrote:
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since
Spark supports several FS schemes I’m unclear about how much to assume about
using the hadoop file systems APIs and conventions. Concretely if I pass a
pattern
, concatenate their paths like that and
pass the single string to
textFile().
Nick
On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:
sc.textFile(URI) supports reading multiple files in parallel but only with a
wildcard. I need to walk a dir tree, match a regex
Warning noob question:
The sc.textFile(URI) method seems to support reading from files in parallel but
you have to supply some wildcard URI, which greatly limits how the storage is
structured. Is there a simple way to pass in a URI list or is it an exercise
left for the student?
71 matches
Mail list logo