Hi, It's me again. After a day's work I've coded a Giraph solution for my problem at hand. I gave it a run on a medium dataset and it's notably faster than other approaches.
However the goal is to process larger inputs, for example I've a larger dataset that the result graph is about 400GB when represented in edge format and in text file. And I think the edges that the algorithm created all reside in the cluster's memory. So it means that for this big dataset, I need a cluster with ~ 400GB main memory to run? Is there any possibilities that I can output "on the go" that means I don't need to construct the whole graph, an edge is outputed to HDFS immediately instead of being created in main memory then be outputed? Thanks! -- *JU Han* Software Engineer Intern @ KXEN Inc. UTC - Université de Technologie de Compiègne * **GI06 - Fouille de Données et Décisionnel* +33 0619608888