1. Drill connects to JDBC sources using the JDBC Storage plugin and you should probably try that.That said, the JDBC storage plugin, AFAIK, does not do push down of filters to the database. But you could try this to confirm. 2. What you need to look at first is the query profile for these queries that you are running. The operators overview section will tell you where is most of the time being spent. 3. The phsyical plan is basically a text that describes what Drill thinks is the nature of teh underlying data and will create an execution plan based on that. For things, such as network cost, we don't currently have any information which is why those are zero.
Coming back to the details you provided, here is what you should know.. - The pyDrill python driver is a community-provided driver which makes use of REST APIs to do the actual query. REST APIs are inefficient for getching millions of rows because it involces using HTTP requests and numerous layers between your application and Drill. - You could try connecting to Drill with it's JDBC driver by using the JayDeBeApi package in Python ( https://pypi.python.org/pypi/JayDeBeApi/#usage). Disclaimer: I've never tested Drill with JayDeBeApi , but it looked like something more performant than pyDrill. If you can get this to work, we'd love it if you can share your experience (and the steps) with us. ~ Kunal On Fri, Mar 9, 2018 at 11:10 AM, Karan Hegde <[email protected]> wrote: > Hi, > > First of, Drill is an amazing tool and makes my work so much easier when I > have different data types and sources to deal with, its a real life-saver. > So thank you all for creating it! > > I've been testing Drill locally with an MS SQL Server, using the PyDrill > Python driver. My larger queries are taking a long time to run. For e.g. a > query extracting ~1 million rows and 8 columns takes 30-40 minutes to > execute. > > I have a few questions I'm hoping to get some clarity on: > > 1. Is this performance expected or unusual and if so, is it because > Drill is not optimized for RDBMS querying? > > 2. Is there any way for me to speed up queries (apart from running it on > a cluster in distributed mode), for e.g. by specifying the schema in the > query so that Drill doesn't need to spend time in schema discovery? > > 3. When I look the Physical Plan on the Drill interface for the query, I > don't see any values in *cumulative cost *in any of my fragment profiles > *:* > > cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 > memory} > > Is there a particular reason for this? > > If you've got any other tips/tricks for working with relational databases > in Drill, please do let me know. > > Thank You!! > Karan Hegde >
