Jobs with unknown origin.

2015-07-08 Thread Jan-Paul Bultmann
Hey, I have quite a few jobs appearing in the web-ui with the description run at ThreadPoolExecutor.java:1142. Are these generated by SparkSQL internally? There are so many that they cause a RejectedExecutionException when the thread-pool runs out of space for them. RejectedExecutionException

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
I would guess the opposite is true for highly iterative benchmarks (common in graph processing and data-science). Spark has a pretty large overhead per iteration, more optimisations and planning only makes this worse. Sure people implemented things like dijkstra's algorithm in spark (a problem

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
Sorry, that should be shortest path, and diameter of the graph. I shouldn't write emails before I get my morning coffee... On 06 Jul 2015, at 09:09, Jan-Paul Bultmann janpaulbultm...@me.com wrote: I would guess the opposite is true for highly iterative benchmarks (common in graph processing

generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Jan-Paul Bultmann
Hey, I noticed that my code spends hours with `generateTreeString` even though the actual dag/dataframe execution takes seconds. I’m running a query that grows exponential in the number of iterations when evaluated without caching, but should be linear when caching previous results. E.g.

Re: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Jan-Paul Bultmann
iterations due to the problem :). As a workaround, you can break the iterations into smaller ones and trigger them manually in sequence. You mean` write` ing them to disk after each iteration? Thanks :), Jan -Original Message- From: Jan-Paul Bultmann [mailto:janpaulbultm...@me.com

Soft distinct on data frames.

2015-05-28 Thread Jan-Paul Bultmann
Hey, Is there a way to do a distinct operation on each partition only? My program generates quite a few duplicate tuples and it would be nice to remove some of these as an optimisation without having to reshuffle the data. I’ve also noticed that plans generated with an unique transformation have

spark sql, creating literal columns in java.

2015-05-05 Thread Jan-Paul Bultmann
Hey, What is the recommended way to create literal columns in java? Scala has the `lit` function from `org.apache.spark.sql.functions`. Should it be called from java as well? Cheers jan - To unsubscribe, e-mail:

Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Jan-Paul Bultmann
Hey, Does somebody know the kinds of dependencies that the new SQL operators produce? I’m specifically interested in the relational join operation as it seems substantially more optimized. The old join was narrow on two RDDs with the same partitioner. Is the relational join narrow as well?