Hi, Apologies if I’ve asked this question before but I didn’t see it in the list and I’m certain that my last surviving brain cell has gone on strike over my attempt to reduce my caffeine intake…
Posting this to both user and dev because I think the question / topic jumps in to both camps. Again since I’m a relative newbie on spark… I may be missing something so apologies up front… With respect to Spark SQL, in pre 2.0.x, there were only hash joins? In post 2.0.x you have hash, semi-hash , and sorted list merge. For the sake of simplicity… lets forget about cross product joins… Has anyone looked at how we could use inverted tables to improve query performance? The issue is that when you have a data sewer (lake) , what happens when your use case query is orthogonal to how your data is stored? This means full table scans. By using secondary indexes, we can reduce this albeit at a cost of increasing your storage footprint by the size of the index. Are there any JIRAs open that discuss this? Indexes to assist in terms of ‘predicate push downs’ (using the index when a field in a where clause is indexed) rather than performing a full table scan. Indexes to assist in the actual join if the join column is on an indexed column? In the first, using an inverted table to produce a sort ordered set of row keys that you would then use in the join process (same as if you produced the subset based on the filter.) To put this in perspective… here’s a dummy use case… CCCis (CCC) is the middle man in the insurance industry. They have a piece of software that sits in the repair shop (e.g Joe’s Auto Body) and works with multiple insurance carriers. The primary key in their data is going to be Insurance Company | Claim ID. This makes it very easy to find a specific claim for further processing. Now lets say I want to do some analysis on determining the average cost of repairing a front end collision of a Volvo S80? Or Break down the number and types of accidents by car manufacturer , model and color. (Then see if there is any correlation between car color and # and type of accidents) As you can see, all of these queries are orthogonal to my storage. So I need to create secondary indexes to help sift thru the data efficiently. Does this make sense? Please Note: I did some work for CCC back in the late 90’s. Any resemblance to their big data efforts is purely coincidence and you can replace CCC with Allstate, Progressive, StateFarm or some other auto insurance company … Thx -Mike