Hi, I'm working with a medical data model that uses arrays of simple types to represent things like the drug exposures and conditions that are associated with a patient.
Using this model, patient data is co-located and is consequently processed by Spark more efficiently. The data is stored in parquet format. In order to improve processing time we have experimented with adding support for simple arrays to the parquet vectorized reader. This change gives us significant performance improvements, > 4x faster for some operations. I was wondering whether any enhancements like this have been considered or whether this work is something that could be useful to the wider community. Regards Mick Davies -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org