Hi,

I guess enforcing a Join Strategy by default is not the best option since you can't assume what the user did before actually calling the Gelly functions and how the data looks like (maybe its one of the 1% graphs where the relation is the other way around or the vertex data set is very large); maybe the datasets are already sorted / partitioned. Another solution could be overloading the Gelly functions that use joins and letting the users decide to give hints or not?

As an example, I am currently benchmarking graphs with up to 700M vertices and 3B edges on a YARN cluster and at one point in the job I need to join vertices and edges. I also tried to give the broadcast-hash-second (vertices) hint and the job performed significantly slower than letting the system decide.

Best,
Martin

On 22.08.2015 09:51, Andra Lungu wrote:
Hey everyone,

When coding for my thesis, I observed that half of the current Gelly
functions (the ones that use join operators) fail on a cluster environment
with the following exception:

java.lang.IllegalArgumentException: Too few memory segments provided. Hash Join
needs at least 33 memory segments.

This is because, in 99% of the cases, the vertex data set is significantly
smaller than the edge data set. What I did to get rid of the error was the
following:

DataSet<Tuple2<Edge<K, EV>, Vertex<K, VV>>> edgesWithSources = edges
       .join(this.vertices,
JoinOperatorBase.JoinHint.BROADCAST_HASH_SECOND).where(0).equalTo(0)

In short, I added join hints. I believe this should also be in Gelly, in
case someone bumps into the same problem somewhere in the future.

What do you think?


Reply via email to