Support for arrays parquet vectorized reader

2019-04-16 Thread Mick Davies
een considered or whether this work is something that could be useful to the wider community. Regards Mick Davies -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsub

Re: Will higher order functions in spark SQL be pushed upstream?

2018-04-19 Thread Mick Davies
Hi, Regarding higher order functions > Yes, we intend to contribute this to open source. It doesn't look like this is in 2.3.0, at least I can't find it. Do you know when it might reach open source. Thanks Mick -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Unit tests can generate spurious shutdown messages

2015-06-02 Thread Mick Davies
If I write unit tests that indirectly initialize org.apache.spark.util.Utils, for example use sql types, but produce no logging, I get the following unpleasant stack trace in my test output. This caused by the the Utils class adding a shutdown hook which logs the message logDebug(Shutdown hook

Re: Optimize encoding/decoding strings when using Parquet

2015-02-13 Thread Mick Davies
I have put in a PR on Parquet to support dictionaries when filters are pushed down, which should reduce binary conversion overhear when Spark pushes down string predicates on columns that are dictionary encoded. https://github.com/apache/incubator-parquet-mr/pull/117 It's blocked at the moment

Re: Caching tables at column level

2015-02-13 Thread Mick Davies
Thanks - we have tried this and it works nicely. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377p10618.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Caching tables at column level

2015-02-01 Thread Mick Davies
I have been working a lot recently with denormalised tables with lots of columns, nearly 600. We are using this form to avoid joins. I have tried to use cache table with this data, but it proves too expensive as it seems to try to cache all the data in the table. For data sets such as the one I

Are there any plans to run Spark on top of Succinct

2015-01-22 Thread Mick Davies
http://succinct.cs.berkeley.edu/wp/wordpress/ Looks like a really interesting piece of work that could dovetail well with Spark. I have been trying recently to optimize some queries I have running on Spark on top of Parquet but the support from Parquet for predicate push down especially for

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Added a JIRA to track https://issues.apache.org/jira/browse/SPARK-5309 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10189.html Sent from the Apache Spark Developers List mailing list

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Here are some timings showing effect of caching last Binary-String conversion. Query times are reduced significantly and variation in timings due to reduction in garbage is very significant. Set of sample queries selecting various columns, applying some filtering and then aggregating Spark 1.2.0

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Looking at Parquet code - it looks like hooks are already in place to support this. In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary for this purpose. These are not used by CatalystPrimitiveConverter. I think that it would be pretty straightforward to

Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Mick Davies
Hi, It seems that a reasonably large proportion of query time using Spark SQL seems to be spent decoding Parquet Binary objects to produce Java Strings. Has anyone considered trying to optimize these conversions as many are duplicated. Details are outlined in the conversation in the user