That makes a lot of sense. Just one question with regarding to handling complex types - do you mean maps/arrays/etc. (repetitions in Parquet)? As in, if I created a Parquet table from some JSON files with a rather complex/nested structure, it would fall back to individual copies?
I guess either way, you almost want a Hybrid approach. If a particular page fully matches the query, you want to batch read the values into a vector; if a page doesn't match at all, you completely skip it; if a page partially matches, you copy individual values out. Sounds like a bit of rework though. On Thu, Jan 8, 2015 at 1:57 PM, Adam Gilmore <a...@pharmadata.net.au> wrote: > That makes a lot of sense. Just one question with regarding to handling > complex types - do you mean maps/arrays/etc. (repetitions in Parquet)? As > in, if I created a Parquet table from some JSON files with a rather > complex/nested structure, it would fall back to individual copies? > > > Regards, > > > *Adam Gilmore* > > Director of Technology > > a...@pharmadata.net.au > > > +61 421 997 655 (Mobile) > > 1300 733 876 (AU) > > +617 3171 9902 (Intl) > > > *PharmaData* > > Data Intelligence Solutions for Pharmacy > > www.PharmaData.net.au <http://www.pharmadata.net.au/> > > > > [image: pharmadata-sig] > > > > *Disclaimer* > > This communication including any attachments may contain information that > is either confidential or otherwise protected from disclosure and is > intended solely for the use of the intended recipient. If you are not the > intended recipient please immediately notify the sender by e-mail and > delete the original transmission and its contents. Any unauthorised use, > dissemination, forwarding, printing, or copying of this communication > including any file attachments is prohibited. The recipient should check > this email and any attachments for viruses and other defects. The Company > disclaims any liability for loss or damage arising in any way from this > communication including any file attachments. > > On Thu, Jan 8, 2015 at 12:05 PM, Jason Altekruse <altekruseja...@gmail.com > > wrote: > >> The parquet library provides an interface for accessing individual values >> of each column (as well as a record assembly interface for populating java >> objects). As parquet is columnar, and the Drill in-memory storage format >> is >> also columnar, we can get much better read performance on queries where >> most of the data is needed if we do copies of long runs of values rather >> than a large number of individual copies. >> >> This obviously does not give us great performance for point queries, where >> a small subset of the data is needed. While these use cases are prevalent >> and we are hoping to fix this issue soon, as we wrote the original >> implementation we were interested in stretching the bounds of how fast we >> could load a volume of data into the engine. >> >> The second reader that was written to handle complex types does use the >> current 'columnar' interface exposed by the parquet library, but it still >> requires us to make individual copies for each value. Even as we >> experimented with early versions of the project pushdown provided by the >> parquet codebase, we were unable to match the performance of reading and >> filtering the data ourselves. This was not fully explored, and a number of >> enhancements have been made to the parquet mainline that may give us the >> performance we are looking for in these cases. We haven't had time to >> revisit it so far. >> >> -Jason Altekruse >> >> On Wed, Jan 7, 2015 at 4:04 PM, Adam Gilmore <dragoncu...@gmail.com> >> wrote: >> >> > Out of interest, is there a reason Drill implemented effectively its own >> > Parquet reading implementation as opposed to using the reading classes >> from >> > the Parquet project itself? Were there particular performance reasons >> for >> > this? >> > >> > On Thu, Jan 8, 2015 at 2:22 AM, Jason Altekruse < >> altekruseja...@gmail.com> >> > wrote: >> > >> > > Just made one, I put some comments there from the design discussions >> we >> > > have had in the past. >> > > >> > > https://issues.apache.org/jira/browse/DRILL-1950 >> > > >> > > - Jason Altekruse >> > > >> > > On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore <dragoncu...@gmail.com> >> > > wrote: >> > > >> > > > Just a quick follow up on this - is there a JIRA item for >> implementing >> > > push >> > > > down predicates for Parquet scans or do we need to create one? >> > > > >> > > > On Tue, Jan 6, 2015 at 1:56 AM, Jason Altekruse < >> > > altekruseja...@gmail.com> >> > > > wrote: >> > > > >> > > > > Hi Adam, >> > > > > >> > > > > I have a few thoughts that might explain the difference in query >> > times. >> > > > > Drill is able to read a subset of the data from a parquet file, >> when >> > > > > selecting only a few columns out of a large file. Drill will give >> you >> > > > > faster results if you ask for 3 columns instead of 10 in terms of >> > read >> > > > > performance. However, we are still working on further optimizing >> the >> > > > reader >> > > > > by making use of the statistics contained in the block and page >> > > > meta-data, >> > > > > that will allow us to skip reading a subset of a column, as the >> > parquet >> > > > > writer can store min/max values for blocks of data. >> > > > > >> > > > > If you ran a query that was summing over a column, the reason it >> was >> > > > faster >> > > > > is because it avoided a bunch of individual value copies as we >> > filtered >> > > > out >> > > > > the records that were not needed. This currently takes place in a >> > > > separate >> > > > > filter operator and should be pushed down into the read operation >> to >> > > make >> > > > > use of the file meta-data and eliminate some of the reads. >> > > > > >> > > > > -Jason >> > > > > >> > > > > >> > > > > >> > > > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore < >> dragoncu...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > Hi guys, >> > > > > > >> > > > > > I have a question re Parquet. I'm not sure if this is a Drill >> > > question >> > > > > or >> > > > > > Parquet, but thought I'd start here. >> > > > > > >> > > > > > I have a sample dataset of ~100M rows in a Parquet file. It's >> > quick >> > > to >> > > > > sum >> > > > > > a single column across the whole dataset. >> > > > > > >> > > > > > I have a column which has approx 100 unique values (e.g. a >> customer >> > > > ID). >> > > > > > When I filter on that column by one of those values (to reduce >> the >> > > set >> > > > to >> > > > > > ~1M values), the query takes longer. >> > > > > > >> > > > > > This doesn't make a lot of sense to me - I would have expected >> the >> > > > > Parquet >> > > > > > format to only bring back segments that match that and only sum >> > those >> > > > > > values. I would expect that this would make the query >> magnitudes >> > > > faster, >> > > > > > not slower. >> > > > > > >> > > > > > Other columnar formats I've used (e.g. ORCFile, SQL Server >> > > Columnstore) >> > > > > > have acted this way, so I can't quite understand why Parquet >> > doesn't >> > > > act >> > > > > > the same. >> > > > > > >> > > > > > Can anyone suggest what I'm doing wrong? >> > > > > > >> > > > > >> > > > >> > > >> > >> > >