Jacques, Is this something you think makes sense and could be accommodated?
Regards, -Stefan On Fri, Sep 18, 2015 at 12:13 PM, Stefán Baxter <ste...@activitystream.com> wrote: > Hi, > > The short back story is this: > > - We are serving multiple tenants with vastly different data volume > and needs > - there no such thing as fixed period segment sizes (to get to approx. > volume per segment) > > - We do queries that combined information from historical and fresh > (streaming) data (parquet and json/avro respectively) using joins > - currently we are using loggers to emit the streaming data but this > will replaced > > - The "fresh" data (json/avro) files live in a single directory > - 1 file per day > > - Fresh data is occasionally transformed from json/avro to parquet > - the frequency of this is set on tenant/volume basis > > This is why we need/like to*: > > - Use directory structure and file names as a flexible chronological > partitions (via UDFs) > - Use parquet partitions for "logical data separation" based on other > attributes than time > > * Please remember that adding new data to parquet files would > eliminate the need for much of this > ** The same is true if would move this whole thing to some metadata > driven environment like Hive > > The Historical (parquet) directory structure might look something like > this: > > 1. /<tenant>/<source>/streaming/2015/09/10 > - high volume :: data transformed daily > > 2. /<tenant>/<source>/streaming/2015/W10 > - medium volume :: data transformed weekly > > 3. /<tenant>/<source>/streaming/2015/09 > - low(er) volume :: data transformed monthly > > So yes, we think that having the ability to evaluate full paths and file > names where we can affect the pruning/scanning with appropriate exceptions > would help us gain some sanity :). > > I realize that pruning should preferably be done in the planning phase but > this would allow for a not-too-messy interception of the scanning process. > > Best regards, > -Stefan > > > On Fri, Sep 18, 2015 at 6:01 AM, Jacques Nadeau <jacq...@dremio.com> > wrote: > >> Can you also provide some examples of what you are trying to accomplish? >> >> It seems like you might be saying that you want a virtual attribute for >> the >> entire path rather than individual pieces? Also remember that partition >> pruning can also be done if you're using Parquet files without all the >> dirN >> syntax. >> >> -- >> Jacques Nadeau >> CTO and Co-Founder, Dremio >> >> On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter < >> ste...@activitystream.com> >> wrote: >> >> > Hi, >> > >> > I have been writing a few simple utility functions for Drill and >> staring at >> > the cumbersome dirN conditions required to take advantage of directory >> > pruning. >> > >> > Would it be possible to allow UDFs to throw fileOutOfScope and >> > directoryOutOfScope exceptions that would allow me to a) write a failry >> > clever inRange(from, to, dirN...) function and would b) allow for >> > additional pruning during execution? >> > >> > Maybe I'm seeing this all wrong but the process of complicating all >> queries >> > with a, sometimes quite complicated, dirN tail just seems like too much >> > redundancy. >> > >> > Regards, >> > -Stefan >> > >> > >