from:"Mick Davies"

Support for arrays parquet vectorized reader

2019-04-16 Thread Mick Davies

ave been considered or whether this work is something that could be useful to the wider community. Regards Mick Davies -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-

Re: Will higher order functions in spark SQL be pushed upstream?

2018-04-19 Thread Mick Davies

Hi, Regarding higher order functions > Yes, we intend to contribute this to open source. It doesn't look like this is in 2.3.0, at least I can't find it. Do you know when it might reach open source. Thanks Mick -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --

Unit tests can generate spurious shutdown messages

2015-06-02 Thread Mick Davies

If I write unit tests that indirectly initialize org.apache.spark.util.Utils, for example use sql types, but produce no logging, I get the following unpleasant stack trace in my test output. This caused by the the Utils class adding a shutdown hook which logs the message logDebug("Shutdown hook ca

Re: Caching tables at column level

2015-02-13 Thread Mick Davies

Thanks - we have tried this and it works nicely. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377p10618.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -

Re: Optimize encoding/decoding strings when using Parquet

2015-02-13 Thread Mick Davies

I have put in a PR on Parquet to support dictionaries when filters are pushed down, which should reduce binary conversion overhear when Spark pushes down string predicates on columns that are dictionary encoded. https://github.com/apache/incubator-parquet-mr/pull/117 It's blocked at the moment as

Caching tables at column level

2015-02-01 Thread Mick Davies

I have been working a lot recently with denormalised tables with lots of columns, nearly 600. We are using this form to avoid joins. I have tried to use cache table with this data, but it proves too expensive as it seems to try to cache all the data in the table. For data sets such as the one I

Are there any plans to run Spark on top of Succinct

2015-01-22 Thread Mick Davies

http://succinct.cs.berkeley.edu/wp/wordpress/ Looks like a really interesting piece of work that could dovetail well with Spark. I have been trying recently to optimize some queries I have running on Spark on top of Parquet but the support from Parquet for predicate push down especially for dict

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies

Looking at Parquet code - it looks like hooks are already in place to support this. In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary for this purpose. These are not used by CatalystPrimitiveConverter. I think that it would be pretty straightforward to

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies

Here are some timings showing effect of caching last Binary->String conversion. Query times are reduced significantly and variation in timings due to reduction in garbage is very significant. Set of sample queries selecting various columns, applying some filtering and then aggregating Spark 1.2.0

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies

Added a JIRA to track https://issues.apache.org/jira/browse/SPARK-5309 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10189.html Sent from the Apache Spark Developers List mailing list arch

Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Mick Davies

Hi, It seems that a reasonably large proportion of query time using Spark SQL seems to be spent decoding Parquet Binary objects to produce Java Strings. Has anyone considered trying to optimize these conversions as many are duplicated. Details are outlined in the conversation in the user mailing

Support for arrays parquet vectorized reader

Re: Will higher order functions in spark SQL be pushed upstream?

Unit tests can generate spurious shutdown messages

Re: Caching tables at column level

Re: Optimize encoding/decoding strings when using Parquet

Caching tables at column level

Are there any plans to run Spark on top of Succinct

Re: Optimize encoding/decoding strings when using Parquet

Re: Optimize encoding/decoding strings when using Parquet

Re: Optimize encoding/decoding strings when using Parquet

Optimize encoding/decoding strings when using Parquet

11 matches

Site Navigation

Mail list logo

Footer information