Indexing w spark joins?

Michael Segel Mon, 17 Oct 2016 09:50:54 -0700

Hi,

Apologies if I’ve asked this question before but I didn’t see it in the list 
and I’m certain that my last surviving brain cell has gone on strike over my 
attempt to reduce my caffeine intake…


Posting this to both user and dev because I think the question / topic jumps in 
to both camps.


Again since I’m a relative newbie on spark… I may be missing something so 
apologies up front…


With respect to Spark SQL,  in pre 2.0.x,  there were only hash joins?  In post 
2.0.x you have hash, semi-hash , and sorted list merge.

For the sake of simplicity… lets forget about cross product joins…

Has anyone looked at how we could use inverted tables to improve query 
performance?

The issue is that when you have a data sewer (lake) , what happens when your 
use case query is orthogonal to how your data is stored? This means full table 
scans.
By using secondary indexes, we can reduce this albeit at a cost of increasing 
your storage footprint by the size of the index.

Are there any JIRAs open that discuss this?

Indexes to assist in terms of ‘predicate push downs’ (using the index when a 
field in a where clause is indexed) rather than performing a full table scan.
Indexes to assist in the actual join if the join column is on an indexed column?

In the first, using an inverted table to produce a sort ordered set of row keys 
that you would then use in the join process (same as if you produced the subset 
based on the filter.)

To put this in perspective… here’s a dummy use case…

CCCis (CCC) is the middle man in the insurance industry. They have a piece of 
software that sits in the repair shop (e.g Joe’s Auto Body) and works with 
multiple insurance carriers.
The primary key in their data is going to be Insurance Company | Claim ID.  
This makes it very easy to find a specific claim for further processing.

Now lets say I want to do some analysis on determining the average cost of 
repairing a front end collision of a Volvo S80?
Or
Break down the number and types of accidents by car manufacturer , model and 
color.  (Then see if there is any correlation between car color and # and type 
of accidents)


As you can see, all of these queries are orthogonal to my storage.  So I need 
to create secondary indexes to help sift thru the data efficiently.

Does this make sense?

Please Note: I did some work for CCC back in the late 90’s. Any resemblance to 
their big data efforts is purely coincidence  and you can replace CCC with 
Allstate, Progressive, StateFarm or some other auto insurance company …

Thx

-Mike

Indexing w spark joins?

Reply via email to