Right, that makes sense and I understood that.
The thing I'm wondering about (And i think the answer is 'no' at this
stage).
When the optimizer is running and pushing predicates down, does it take
into account indexing and other storage layer strategies in determining
which predicates are process
here below Gary
filtered_df = spark.hiveContext.sql("""
SELECT
*
FROM
df
WHERE
type = 'type'
AND action = 'action'
AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")
you are using hql to read
Ok, so when Spark is forming queries it's ignorant of the underlying
storage layer index.
If there is an index on a table Spark doesn't take that into account when
doing the predicate push down in optimization. In that case why does spark
push 2 of my conditions (where fieldx = 'action') to the da
remember your indexes are in RDBMS. In this case MySQL. When you are
reading from that table you have an 'id' column which I assume is an
integer and you are making parallel threads through JDBC connection to that
table. You can see the threads in MySQL if you query it. You can see
multiple threads
If the underlying table(s) have indexes on them. Does spark use those
indexes to optimize the query?
IE if I had a table in my JDBC data source (mysql in this case) had several
indexes and my query was filtering on one of the fields with an index.
Would spark know to push that predicate to the da
sorry what do you mean my JDBC table has an index on it? Where are you
reading the data from the table?
I assume you are referring to "id" column on the table that you are reading
through JDBC connection.
Then you are creating a temp Table called "df". That temp table is created
in temporary work
IE: If my JDBC table has an index on it, will the optimizer consider that
when pushing predicates down?
I noticed in a query like this:
df = spark.hiveContext.read.jdbc(
url=jdbc_url,
table="schema.table",
column="id",
lowerBound=lower_bound_id,
upperBound=upper_bound_id,
numPartitio