Dear community, I have a table with inlinks to URLs, i.e. many URLs point to http://google.com, less URLs point to http://somesmallweb.page.
It has very wide and very skinny rows - the distribution is following a power law. I do not know a priori how many columns a row has. Also, I can't identify a schema to introduce a good partitioning. Currently, I am thinking about introducing splits by: pk is like (URL, splitnumber), where splitnumber is initially 1 and hash URL mod splitnumber would determine the splitnumber on insert. I would need a separate table to maintain the splitnumber and a spark-cassandra-connector job counts the columns and and increases/doubles the number of splits on demand. This means then that I would have to move e.g. (URL1,0) -> (URL1,1) when splitnumber would be 2. Would you do the same? Is there a better way? Thanks! Adam