from:"Jan\-Paul Bultmann"

Jobs with unknown origin.

2015-07-08 Thread Jan-Paul Bultmann

Hey, I have quite a few jobs appearing in the web-ui with the description run at ThreadPoolExecutor.java:1142. Are these generated by SparkSQL internally? There are so many that they cause a RejectedExecutionException when the thread-pool runs out of space for them. RejectedExecutionException

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann

I would guess the opposite is true for highly iterative benchmarks (common in graph processing and data-science). Spark has a pretty large overhead per iteration, more optimisations and planning only makes this worse. Sure people implemented things like dijkstra's algorithm in spark (a problem

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann

Sorry, that should be shortest path, and diameter of the graph. I shouldn't write emails before I get my morning coffee... On 06 Jul 2015, at 09:09, Jan-Paul Bultmann janpaulbultm...@me.com wrote: I would guess the opposite is true for highly iterative benchmarks (common in graph processing

generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Jan-Paul Bultmann

Hey, I noticed that my code spends hours with `generateTreeString` even though the actual dag/dataframe execution takes seconds. I’m running a query that grows exponential in the number of iterations when evaluated without caching, but should be linear when caching previous results. E.g.

Re: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Jan-Paul Bultmann

iterations due to the problem :). As a workaround, you can break the iterations into smaller ones and trigger them manually in sequence. You mean` write` ing them to disk after each iteration? Thanks :), Jan -Original Message- From: Jan-Paul Bultmann [mailto:janpaulbultm...@me.com

Soft distinct on data frames.

2015-05-28 Thread Jan-Paul Bultmann

Hey, Is there a way to do a distinct operation on each partition only? My program generates quite a few duplicate tuples and it would be nice to remove some of these as an optimisation without having to reshuffle the data. I’ve also noticed that plans generated with an unique transformation have

spark sql, creating literal columns in java.

2015-05-05 Thread Jan-Paul Bultmann

Hey, What is the recommended way to create literal columns in java? Scala has the `lit` function from `org.apache.spark.sql.functions`. Should it be called from java as well? Cheers jan - To unsubscribe, e-mail:

Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Jan-Paul Bultmann

Hey, Does somebody know the kinds of dependencies that the new SQL operators produce? I’m specifically interested in the relational join operation as it seems substantially more optimized. The old join was narrow on two RDDs with the same partitioner. Is the relational join narrow as well?

Jobs with unknown origin.

Re: Benchmark results between Flink and Spark

Re: Benchmark results between Flink and Spark

generateTreeString causes huge performance problems on dataframe persistence

Re: generateTreeString causes huge performance problems on dataframe persistence

Soft distinct on data frames.

spark sql, creating literal columns in java.

Spark SQL transformations, narrow vs. wide

8 matches

Site Navigation

Mail list logo

Footer information