Re: A incomplete map to a maze - or - the jurney down the rabbit hole

Stefán Baxter Tue, 04 Aug 2015 04:34:08 -0700

Hi,

You are right, partition pruning works using simple queries (I will adjust
our queries accordingly):


   - select count(*) from dfs.tmp.`/analytics/processed/test/events` as t
   where dir0 = '2014';

I tried to run the following query:

   - select count(*) from dfs.tmp.`/analytics/processed/test/events` as t
   where dir0 = dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11'
   as timestamp),COALESCE(dir0,'-')) and dir1 = dirInRange(cast('2015-04-10'
   as timestamp),cast('2015-07-11' as
   timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'));

This later query uses a UDF (dirInRange) that operates solely on
static+directory information (see my earlier post on the same subject) but
that does not work.

Using a non-static-value for directory evaluation/planning does not work
and can be quite limiting since periods can, for example, overlap years and
months and then filters on dir0/dir1/dir2 combinations can become quite
verbose (but works :).

I understand the limitation and apologize for making a false claim but it
would be great if partition pruning would work with functions like these.

Best regards,
 -Stefan


On Tue, Aug 4, 2015 at 4:56 AM, rahul challapalli <
challapallira...@gmail.com> wrote:

> As Jinfeng mentioned, directory based partition pruning should work. You
> might also be interested in DRILL-3333
> <https://issues.apache.org/jira/browse/DRILL-3333> which allows you to
> auto
> partition data when using CTAS.
>
> - Rahul
>
> On Mon, Aug 3, 2015 at 5:03 PM, Jinfeng Ni <jinfengn...@gmail.com> wrote:
>
> > For the issues about partition pruning over directories, could you please
> > provide more detail information? Drill should do partition pruning based
> on
> > directory. If it does not work the way you want,  probably there is a bug
> > in the code.  We would appreciate if you can provide more detail, so that
> > we could re-produce the problem and get it fix asap.
> >
> > Regards,
> >
> > Jinfeng
> >
> >
> > On Mon, Aug 3, 2015 at 4:44 PM, Stefán Baxter <ste...@activitystream.com
> >
> > wrote:
> >
> > > Hi,
> > >
> > > I have been meaning to write a blog post regarding our Drill
> experiments
> > > but I though I might share some thoughts here first.
> > > Hopefully some newbie can benefit from this and perhaps it sheds some
> > light
> > > on what drives newcomers in this community (or at least some part of
> > them).
> > >
> > > A bit of a back story. I work for a small startup that has been
> > developing
> > > it's application for *two long* but we hope to lunch our product in
> > > Q1/2016.
> > > We have been using Druid as a real-time analytic store and, even though
> > > Druid does extremely well what it does, we are now forced to replace it
> > to
> > > allow for functionality that we can not be without.
> > >
> > > We started to evaluate several solutions and we are now left with a
> > handful
> > > that we like. We have not had the time to evaluate all of them to
> > extremes
> > > but we feel comfortable with the general approach we have taken.
> > >
> > > There were many things for us to consider and I want to run through
> them
> > > here to seek advice from those here that know better. (I apologize for
> > the
> > > long read)
> > >
> > > *Initial questions we had:*
> > >
> > >    - Do we want to operate our own (distributed) file system or rely on
> > >    something like S3?
> > >    - Are there other options?
> > >
> > >    - Do we favor "bringing the computation to the data" or "bring the
> > data
> > >    to the computation"?
> > >    - For us Drill+S3 is a "bring data to the computation" solution
> while
> > >    Hadoop+Hive+MapReduce is a "bringing the computation to the data"
> > > solution
> > >
> > >    - How can we effectively merge streaming data with historical data
> and
> > >    process both at the same time?
> > >    - Something that Druid does very well
> > >
> > >    - What is the best way to merge data from different sources in a
> > >    polyglot data / format / sources environment?
> > >
> > >    - How can we minimize the effect of large number of small tenants?
> > >    - Warming up tenant specific indexes in Elastic Search can, for
> > example,
> > >    be costly (What a great solution though)
> > >
> > >    - How can we evict long-tail data from fast storage and gradually
> move
> > >    it into deep storage over time ?
> > >    - ... while avoiding a lengthy warm-up period
> > >    - ... while using deep storage also as backup even for fresh data
> > >
> > >    - Can we cache aggregates or would we just be building a cache that
> > >    next-to-nothing would ever hit?
> > >    - aggregation for days, weeks, months with all the mostly used spins
> > >    - historical data does not change here so one level of complexity is
> > >    absent
> > >
> > >    - Can we achieve sub-second results dealing with our tenant specific
> > >    data sets
> > >    - We will never be FB/Twitter but we still got some sizable chunks
> of
> > >    data :)
> > >
> > >    - Is there a stack available that already works like this?
> > >    - MapR / BDAS etc. etc.
> > >
> > >    - Can we build this using "pure" Apache components?
> > >    - We would like to have the option of going without
> > support/license-fees
> > >    but we also like the option of professional services with SLAs and
> > paid
> > >    subscriptions
> > >
> > >    - Is this all a premature optimization?
> > >    - aka. "let's just use Elastic Search for now"
> > >
> > >
> > > *Storage format*
> > >
> > > We decided early on that the we liked Parquet <
> > https://parquet.apache.org/
> > > >
> > > enough to commit to it for our processed/historical data.
> > > We know that ORC <https://orc.apache.org/> is also great but the we
> > > believe
> > > that Parquet will, with its  support for nested structures, broad
> appeal
> > > and interoperability become a de-facto standard before long.
> > > (We really like ORC as well :) )
> > >
> > > Parquet has so many things going for it so I jump to what we see as
> > current
> > > cons:
> > >
> > >    - Parquet is does not compress as well as ORC
> > >    - Seems to makes up for it in retrieval speed (
> > >
> > >
> >
> https://developer.ibm.com/hadoop/blog/2014/09/19/big-sql-3-0-file-formats-usage-performance/
> > >    )
> > >
> > >    - Parquet is still missing some things that can make it even better
> > for
> > >    analytics
> > >    - Like: Bloom filters <
> > https://issues.apache.org/jira/browse/PARQUET-41
> > > >,
> > >    HyperLogLog <https://issues.apache.org/jira/browse/PARQUET-42>
> > >     and indexing/sorting
> > >
> > >    - Hive has some problems dealing with data in nested structures in
> > >    Parquet files (but that may have changed)
> > >
> > >    - It's ineffective to gradually add data to parquet files (or so we
> > >    believe) so streaming data should be batched into Parquet files
> > >
> > > Our streaming data is mixed schema (partially schemed and partially
> > schema
> > > less) so using Avro <https://avro.apache.org/> or Protobuff
> > > <https://developers.google.com/protocol-buffers/> is doable but a
> tricky
> > > while JSON is very doable but freakishly verbose.
> > > Drill will, for example, allow us to join JSON data with Parquet data
> and
> > > in our tests we have been merging log information, in JSON format, with
> > > historical/older data stored in Parqet.
> > > Querying log files like this is sub-optimal, even though they are
> stored
> > on
> > > PCIE-flash, so we still look for viable alternatives.
> > >
> > > Worst case scenario for us is to continue to work with this log
> approach
> > > but then using Avro.
> > >
> > > Relevant Drill points:
> > >
> > >    - +  Drill knows all the file formats we like to use (and then some)
> > >    - -   Drill still misses RDBMs access (via JDBC) but that will show
> up
> > >    soon enough (for us)
> > >
> > >
> > > *Query engine*
> > >
> > > We like SQL.
> > >
> > > It's the key to so many systems, tools and resources that selecting a
> > > proprietary query language is not an option (any more).
> > > The query engines we know (of) in this space are:
> > >
> > >    - Presto <https://prestodb.io/>
> > >    - Apache Drill <https://drill.apache.org/>
> > >    - Apache Spark <http://spark.apache.org/>
> > >
> > > They all have in common the ability to run distributed queries and
> merge
> > > data from different sources using different format, which is great!
> > > Spark does, in addition to this, offer a range of capabilities
> > > (ML/Graph/etc.) that are appealing to us.
> > >
> > > We have not decided yet what to do but Drill is very appealing to us
> and
> > we
> > > have it running in the "production" system used for our dev. partners.
> > >
> > > Relevant Drill points:
> > >
> > >    - + Drill does SQL well and is getting better at dealing with
> > >    evolving/dirty schema (aka. JSON curve balls)
> > >    - + Drill is getting better and better in window functions and
> various
> > >    aggregations
> > >    - + Drill can be accessed via JDBC/ODBC and it has a Spark
> connection
> > as
> > >    well
> > >    - + Drill does User Defined Functions (Could be a bit more
> accessible
> > >    but hey, they are there!)
> > >
> > > *Storage:*
> > >
> > > It's very appealing to use Hive <https://hive.apache.org/> and
> > Hadoop/HDFS
> > > <https://hadoop.apache.org/> to store our data and Hive can point to
> > data
> > > in S3 as well as in other storage locations.
> > > Many solutions know how to use the Hive catalog and some optimization
> > seems
> > > to be available the utilizes it.
> > >
> > > Still we are not convinced that we want to build a hadoop
> > > <https://hadoop.apache.org/> cluster (yeah, it's mostly prejudice)
> > > We don't want to use S3 directly or be forced to live in the EC2
> > ecosystem
> > > for it to make any operation sense. (yeah, mostly we are cheep)
> > >
> > > Ideally we would like something like this:
> > >
> > >    - Use S3 (or any other cost effective alternative) as deep storage
> > >
> > >    - Cache new and in-demand-data locally to minimize the S3 traffic as
> > >    much as possible (Large tired cache: t1:Flash, t2:SSD and t3:HDD)
> > >
> > >    - We would like to use a file system that plays nice with the
> Parquet
> > >    file format and dos the scans/reads required to effectively work
> with
> > > the
> > >    columnar nature of it.
> > >    we are still not sure if HDFS is the ideal match for parquet from
> this
> > >    perspective but we think that both Gluster <http://www.gluster.org/
> >
> > > and
> > >    Ceph <http://ceph.com/> could be (even though Ceph is a Blob/Object
> > >    store like S3 underneath)
> > >
> > > Relevant Drill Points (for file system based operations):
> > >
> > >    - - All Drillbits (nodes) require access to the same data for
> > >    distribution/planning to work ("bring the data to the computation")
> > >    planner that knows about "single-node-data" and can plan for it
> would
> > be
> > >    ideal for alternatives like the one we have in mind
> > >
> > >    - - Partition pruning does not work with directories.
> > >    Partitioning data based on time span into a sub-directory structure
> > has
> > >    no benefits (Drill goes through them all, always)
> > >
> > > We were quite optimistic when we came across Tachyon
> > > <http://tachyon-project.org/> and got it to work with Drill (see
> earlier
> > > email).
> > > It seemed like the ideal bridge between Drill and S3 that would allow
> us
> > to
> > > do multi-tiered caching and smart sharing of content between Drillbits
> > > (nodes) without fetching it multiple times from S3.
> > > But Tachyon currently has a single drawback that we can not accept. It
> > > stored all files in S3 in a proprietary way that does not reflect the
> > "file
> > > system" that it fronts.
> > > Also; Accessing Tachyon seems to be a API only exercise and NFS access,
> > for
> > > example, does not exist.  (In that way Tachyon is wicket fast and slow
> > (as
> > > in dim) at the same time)
> > >
> > > *To oversimplify - the choice is now between the beaten path
> > > (MapR/BDAS/etc.) or ... *
> > >
> > > *A)* the 12" mix - feat. file system only:
> > >
> > >    - Query engine: *Drill*
> > >    - Streaming/log data format: *Avro* / *JSON*
> > >    - Historical Data Format: *Parquet*
> > >    - Extra dimensions / enrichment information: *RDBM* / *Cassandra* /
> > >    *Solr* / *More*
> > >    - File System: *S3 + Tachyon* / *Gluster* / *Ceph*
> > >
> > > *B)* the 6" mix - feat. Johnny Cash:
> > >
> > >    - Query engine: *Drill / Presto / Spark / Spark + Drill*
> > >    - Streaming/log data format: *Avro* / *JSON*
> > >    - Historical Data Format: *Parquet*
> > >    - Extra dimensions / enrichment information: *RDBM* / *Cassandra* /
> > >    *Solr* / *More*
> > >    - File System:
> > > *Hadoop + Hive *
> > >
> > >
> > > *General Drill observations that are relevant to us (partly
> redundant):*
> > >
> > >    - REST API does only return JSON
> > >    - Uncompressed and incorrectly typed (count(*) returns a number in
> > >    quotes)
> > >    - the result set is always known and returning Avro/Protobuf could
> be
> > >    beneficial
> > >
> > >    - Schema discovery is still a bit immature / fragile / temperamental
> > >    - This is understandable and will change
> > >
> > >    - Partition pruning does not work for directories
> > >    - See point above (Storage)
> > >
> > >    - Schema information is not available for data in the file system
> > >    - This will likely change (at least the schema is created on query
> > time)
> > >
> > >    - Metadata queries are not available (data discovery queries)
> > >    - Discovering fields, cardinality and composition for data that
> meets
> > >    certain criteria
> > >    - This is essential in building good UI for tenant specific data
> (data
> > >    that varies between tenants)
> > >
> > > I will, in the near future, try to involve this into a more informative
> > > blog post and share it for those starting down this path.
> > >
> > > There are is so much more going on in/around this space (Tez
> > > <https://tez.apache.org/>, Flink <https://flink.apache.org/> ...)
> > >
> > > All input is welcomed.
> > >
> > > Best regards,
> > >  -Stefan Baxter
> > >
> > > PS. I'm a bit dyslexic and sometimes the little bastards sneak though.
> > Just
> > > laugh and shake your head when you come across them :)
> > >
> >
>

Re: A incomplete map to a maze - or - the jurney down the rabbit hole

Reply via email to