Hello Samza community! I'm currently building a graph processing POC with Samza as an engine and faced an interesting problem.
What I'm trying to do is to cache a graph in KV storage(RocksDbK) using simple notation: key: node1_id.node2_id (or nodeIds separated by a dot) value: empty string for now. The first operation I was trying to implement is to get all adjacent nodes of a particular node. And in code it is: nodeIterator = store.range( String.join(".", nodeId, String.valueOf(Character.MIN_VALUE)), String.join(".", nodeId, String.valueOf(Character.MAX_VALUE))); And it worked pretty well locally on a small graph, but I got weird results when I tried to increase graph size. Experiment: Number of edges: 5 billions Number of nodes: around 90M Kafka partitions: 20 Kafka brokers: 3 with 4TB of space. Yarn workers: 2 with 256Gb of RAM and 2TB of disc space each. So, I pumped this whole grap into samza and it swallowed it just fine. But when I tried to query on of the nodes which has a lot of neighbours It returned only 55% of them. Number of actual neighbours: 7.7M Number of returned neighbours: 4.4M I restreamed RockDB changelog topic and I can see all this edges stored there, but query still returnes only 4.3M nodes. So, currently I'm trying to figure out what will be the best way of debugging it. So, here are some questions: 1) Have anyone seen such a behaviour before? 2) What is the best way to debug it on a remote machine? Any particular logs to look for? Any RockDb config params that should be enabled? 3) Is it a good idea to store a graph in such a format? Thank you, Alex