Hey,
I have quite a few jobs appearing in the web-ui with the description run at
ThreadPoolExecutor.java:1142.
Are these generated by SparkSQL internally?
There are so many that they cause a RejectedExecutionException when the
thread-pool runs out of space for them.
RejectedExecutionException
I would guess the opposite is true for highly iterative benchmarks (common in
graph processing and data-science).
Spark has a pretty large overhead per iteration, more optimisations and
planning only makes this worse.
Sure people implemented things like dijkstra's algorithm in spark
(a problem
Sorry, that should be shortest path, and diameter of the graph.
I shouldn't write emails before I get my morning coffee...
On 06 Jul 2015, at 09:09, Jan-Paul Bultmann janpaulbultm...@me.com wrote:
I would guess the opposite is true for highly iterative benchmarks (common in
graph processing
Hey,
I noticed that my code spends hours with `generateTreeString` even though the
actual dag/dataframe execution takes seconds.
I’m running a query that grows exponential in the number of iterations when
evaluated without caching,
but should be linear when caching previous results.
E.g.
iterations due to the problem :).
As a workaround, you can break the iterations into smaller ones and trigger
them manually in sequence.
You mean` write` ing them to disk after each iteration?
Thanks :), Jan
-Original Message-
From: Jan-Paul Bultmann [mailto:janpaulbultm...@me.com
Hey,
Is there a way to do a distinct operation on each partition only?
My program generates quite a few duplicate tuples and it would be nice to
remove some of these as an optimisation
without having to reshuffle the data.
I’ve also noticed that plans generated with an unique transformation have
Hey,
What is the recommended way to create literal columns in java?
Scala has the `lit` function from `org.apache.spark.sql.functions`.
Should it be called from java as well?
Cheers jan
-
To unsubscribe, e-mail:
Hey,
Does somebody know the kinds of dependencies that the new SQL operators produce?
I’m specifically interested in the relational join operation as it seems
substantially more optimized.
The old join was narrow on two RDDs with the same partitioner.
Is the relational join narrow as well?