Hello Samza community!

I'm currently building a graph processing POC with Samza as an engine and
faced an interesting problem.

What I'm trying to do is to cache a graph in KV storage(RocksDbK) using
simple notation:
key: node1_id.node2_id (or nodeIds separated by a dot)
value: empty string for now.

The first operation I was trying to implement is to get all adjacent nodes
of a particular node.
And in code it is:

nodeIterator = store.range(
        String.join(".", nodeId, String.valueOf(Character.MIN_VALUE)),
        String.join(".", nodeId, String.valueOf(Character.MAX_VALUE)));

And it worked pretty well locally on a small graph, but I got weird results
when I tried to increase graph size.

Experiment:
Number of edges: 5 billions
Number of nodes: around 90M
Kafka partitions: 20
Kafka brokers: 3 with 4TB of space.
Yarn workers: 2 with 256Gb of RAM and 2TB of disc space each.

So, I pumped this whole grap into samza and it swallowed it just fine. But
when I tried to query on of the nodes which has a lot of neighbours It
returned only 55% of them.
Number of actual neighbours: 7.7M
Number of returned neighbours: 4.4M

I restreamed RockDB changelog topic and I can see all this edges stored
there, but query still returnes only 4.3M nodes.

So, currently I'm trying to figure out what will be the best way of
debugging it. So, here are some questions:
1) Have anyone seen such a behaviour before?
2) What is the best way to debug it on a remote machine? Any particular
logs to look for? Any RockDb config params that should be enabled?
3)  Is it a good idea to store a graph in such a format?

Thank you,
Alex

Reply via email to