Re: CBO not working for Parquet Files

2018-09-06 Thread emlyn
e missing something, as it seems that partitioned parquet files would be a common use case, and if this is a bug in Spark I would have expected it to have been picked up sooner. Has anybody managed to get cbo working with partitioned parquet files? Is this a known issue? Thanks, Emlyn -- Sent f

Re: Concurrent Spark jobs

2016-03-31 Thread emlyn
In case anyone else has the same problem and finds this - in my case it was fixed by increasing spark.sql.broadcastTimeout (I used 9000). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26648.html Sent from the Apache Spark

Re: Concurrent Spark jobs

2016-01-25 Thread emlyn
Jean wrote > Have you considered using pools? > http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools > > I haven't tried that by myself, but it looks like pool setting is applied > per thread so that means it's possible to configure fair scheduler, so > that more, than one

Re: Concurrent Spark jobs

2016-01-21 Thread emlyn
Thanks for the responses (not sure why they aren't showing up on the list). Michael wrote: > The JDBC wrapper for Redshift should allow you to follow these > instructions. Let me know if you run into any more issues. >

Spark 1.6 ignoreNulls in first/last aggregate functions

2016-01-21 Thread emlyn
As I understand it, Spark 1.6 changes the behaviour of the first and last aggregate functions to take nulls into account (where they were ignored in 1.5). From SQL you can use "IGNORE NULLS" to get the old behaviour back. How do I ignore nulls

Re: Spark 1.6 ignoreNulls in first/last aggregate functions

2016-01-21 Thread emlyn
Turns out I can't use a user defined aggregate function, as they are not supported in Window operations. There surely must be some way to do a last_value with ignoreNulls enabled in Spark 1.6? Any ideas for workarounds? -- View this message in context:

Concurrent Spark jobs

2016-01-19 Thread emlyn
We have a Spark application that runs a number of ETL jobs, writing the outputs to Redshift (using databricks/spark-redshift). This is triggered by calling DataFrame.write.save on the different DataFrames one after another. I noticed that during the Redshift load while the output of one job is

Merging compatible schemas on Spark 1.6.0

2016-01-13 Thread emlyn
I have a series of directories on S3 with parquet data, all with compatible (but not identical) schemas. We verify that the schemas stay compatible when they evolve using org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility. On Spark 1.5, I could read these into a DataFrame with

Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread Emlyn Corrin
wrote: > do you have JAVA_HOME set to a java 7 jdk? > > 2015-10-23 7:12 GMT-04:00 emlyn <em...@swiftkey.com>: > >> xjlin0 wrote >> > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with >> > or without Hadoop or home compiled with an

Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread emlyn
xjlin0 wrote > I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with > or without Hadoop or home compiled with ant or maven). There was no error > message in v1.4.x, system prompt nothing. On v1.5.x, once I enter > $SPARK_HOME/bin/pyspark or spark-shell, I got > > Error:

Re: Cannot start REPL shell since 1.4.0

2015-10-23 Thread emlyn
emlyn wrote > > xjlin0 wrote >> I cannot enter REPL shell in 1.4.0/1.4.1/1.5.0/1.5.1(with pre-built with >> or without Hadoop or home compiled with ant or maven). There was no >> error message in v1.4.x, system prompt nothing. On v1.5.x, once I enter >> $SPARK_HOME