I'm not a software engineer by training and I hope that there's an existing
best practice for the problem I'm trying to solve. I'm using Spark 2.4.5,
Hadoop 2.7, Hive 1.2.

I have a large table (terabytes) from an external source (which is beyond
my control) where the data is stored in a key-value format with an array of
values:

| id | val
+ - +-----------
| k1 | <v1,v2,v3>
| k2 | <v2>
| k3 | <v3,v5>

I want to invert the map so that I have a collection of keys for each value
(let's assume I don't care about uniqueness):

| id | val
+ - + --------------
| v1 | <k1>
| v2 | <k1, k2>
| v3 | <k1, k3>
| v5 | <k3>

It seems like a lot of shuffle is required somehow, but I'm not sure what
the best approach is. I've written solutions using DataFrame (with
explode(), groupBy() and collect_set()) and with RDD but it's always very
expensive.

Is there a best practice technique for this kind of operation? My leading
thought so far is to restage the data in a partitioned, bucketed flat table
as an intermediary step but that too is costly in terms of disk space and
transform time.

Thanks,
Patrick

Reply via email to