Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-20 Thread lucas.g...@gmail.com
Right, that makes sense and I understood that. The thing I'm wondering about (And i think the answer is 'no' at this stage). When the optimizer is running and pushing predicates down, does it take into account indexing and other storage layer strategies in determining which predicates are process

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-20 Thread Mich Talebzadeh
here below Gary filtered_df = spark.hiveContext.sql(""" SELECT * FROM df WHERE type = 'type' AND action = 'action' AND audited_changes LIKE '---\ncompany_id:\n- %' """) filtered_audits.registerTempTable("filtered_df") you are using hql to read

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
Ok, so when Spark is forming queries it's ignorant of the underlying storage layer index. If there is an index on a table Spark doesn't take that into account when doing the predicate push down in optimization. In that case why does spark push 2 of my conditions (where fieldx = 'action') to the da

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread Mich Talebzadeh
remember your indexes are in RDBMS. In this case MySQL. When you are reading from that table you have an 'id' column which I assume is an integer and you are making parallel threads through JDBC connection to that table. You can see the threads in MySQL if you query it. You can see multiple threads

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
If the underlying table(s) have indexes on them. Does spark use those indexes to optimize the query? IE if I had a table in my JDBC data source (mysql in this case) had several indexes and my query was filtering on one of the fields with an index. Would spark know to push that predicate to the da

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread Mich Talebzadeh
sorry what do you mean my JDBC table has an index on it? Where are you reading the data from the table? I assume you are referring to "id" column on the table that you are reading through JDBC connection. Then you are creating a temp Table called "df". That temp table is created in temporary work

Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
IE: If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down? I noticed in a query like this: df = spark.hiveContext.read.jdbc( url=jdbc_url, table="schema.table", column="id", lowerBound=lower_bound_id, upperBound=upper_bound_id, numPartitio