Dear community,

I have a table with inlinks to URLs, i.e. many URLs point to
http://google.com, less URLs point to http://somesmallweb.page.

It has very wide and very skinny rows - the distribution is following a
power law. I do not know a priori how many columns a row has. Also, I can't
identify a schema to introduce a good partitioning.

Currently, I am thinking about introducing splits by: pk is like (URL,
splitnumber), where splitnumber is initially 1 and  hash URL mod
splitnumber would determine the splitnumber on insert. I would need a
separate table to maintain the splitnumber and a spark-cassandra-connector
job counts the columns and and increases/doubles the number of splits on
demand. This means then that I would have to move e.g. (URL1,0) -> (URL1,1)
when splitnumber would be 2.

Would you do the same? Is there a better way?

Thanks!
Adam

Reply via email to