Jacques,

Is this something you think makes sense and could be accommodated?

Regards,
 -Stefan

On Fri, Sep 18, 2015 at 12:13 PM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Hi,
>
> The short back story is this:
>
>    - We are serving multiple tenants with vastly different data volume
>    and needs
>    - there no such thing as fixed period segment sizes (to get to approx.
>    volume per segment)
>
>    - We do queries that combined information from historical and fresh
>    (streaming) data (parquet and json/avro respectively) using joins
>    - currently we are using loggers to emit the streaming data but this
>    will replaced
>
>    - The "fresh" data (json/avro)  files live in a single directory
>    - 1 file per day
>
>    - Fresh data is occasionally transformed from json/avro to parquet
>    - the frequency of this is set on tenant/volume basis
>
> This is why we need/like to*:
>
>    - Use directory structure and file names as a flexible chronological
>    partitions (via UDFs)
>    - Use parquet partitions for "logical data separation" based on other
>    attributes than time
>
>    * Please remember that adding new data to parquet files would
>    eliminate the need for much of this
>    ** The same is true if would move this whole thing to some metadata
>    driven environment like Hive
>
> The Historical (parquet) directory structure might look something like
> this:
>
>    1. /<tenant>/<source>/streaming/2015/09/10
>    - high volume :: data transformed daily
>
>    2. /<tenant>/<source>/streaming/2015/W10
>    - medium volume :: data transformed weekly
>
>    3. /<tenant>/<source>/streaming/2015/09
>    - low(er) volume :: data transformed monthly
>
> So yes, we think that having the ability to evaluate full paths and file
> names where we can affect the pruning/scanning with appropriate exceptions
> would help us gain some sanity :).
>
> I realize that pruning should preferably be done in the planning phase but
> this would allow for a not-too-messy interception of the scanning process.
>
> Best regards,
>  -Stefan
>
>
> On Fri, Sep 18, 2015 at 6:01 AM, Jacques Nadeau <jacq...@dremio.com>
> wrote:
>
>> Can you also provide some examples of what you are trying to accomplish?
>>
>> It seems like you might be saying that you want a virtual attribute for
>> the
>> entire path rather than individual pieces? Also remember that partition
>> pruning can also be done if you're using Parquet files without all the
>> dirN
>> syntax.
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>> On Thu, Sep 17, 2015 at 10:42 AM, Stefán Baxter <
>> ste...@activitystream.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I have been writing a few simple utility functions for Drill and
>> staring at
>> > the cumbersome dirN conditions required to take advantage of directory
>> > pruning.
>> >
>> > Would it be possible to allow UDFs to throw fileOutOfScope and
>> > directoryOutOfScope exceptions that would allow me to a) write a failry
>> > clever inRange(from, to, dirN...) function and would b) allow for
>> > additional pruning during execution?
>> >
>> > Maybe I'm seeing this all wrong but the process of complicating all
>> queries
>> > with a, sometimes quite complicated, dirN tail just seems like too much
>> > redundancy.
>> >
>> > Regards,
>> >  -Stefan
>> >
>>
>
>

Reply via email to