Re: Using Encoding to reduce GraphX's static graph memory consumption

Adnan Haider Tue, 23 Feb 2016 13:14:02 -0800

Hi
I have created a jira for this issue here.
<https://issues.apache.org/jira/browse/SPARK-13460?jql=project%20%3D%20SPARK%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC>
As
for the pull request, my implementation is based on removing localSrcIds
and storing an array of offsets into localDstIds. I am running into issues
with this method when testing operations which create partitions from
existing edge partitions. The other method for implementing this would be
to use a hashmap from local id to offset/length pairs. This seems to work
fine but there is more storage overhead associated with this since each
source vertex requires space for three integers.


Thanks, Adnan Haider
B.S Candidate, Computer Science
Illinois Institute of Technology
www.adnanhaider.com

On Mon, Feb 22, 2016 at 8:16 AM, Adnan Haider <[email protected]> wrote:

> Yes, sounds good. I can submit the pull request.
> On 22 Feb 2016 00:35, "Reynold Xin" <[email protected]> wrote:
>
>> + Joey
>>
>> We think this is worth doing. Are you interested in submitting a pull
>> request?
>>
>>
>> On Sat, Feb 20, 2016 at 8:05 PM ahaider3 <[email protected]> wrote:
>>
>>> Hi,
>>> I have been looking through the GraphX source code, dissecting the reason
>>> for its high memory consumption compared to the on-disk size of the
>>> graph. I
>>> have found that there may be room to reduce the memory footprint of the
>>> graph structures. I think the biggest savings can come from the
>>> localSrcIds
>>> and localDstIds in EdgePartitions.
>>>
>>> In particular, instead of storing both a source and destination local ID
>>> for
>>> each edge, we could store only the destination id. For example after
>>> sorting
>>> edges by global source id, we can map each of the source vertices first
>>> to
>>> local values followed by unmapped global destination ids. This would make
>>> localSrcIds sorted starting from 0 to n, where n is the number of
>>> distinct
>>> global source ids. Then instead of actually storing the local source id
>>> for
>>> each edge, we can store an array of size n, with each element storing an
>>> index into localDstIds.  From my understanding, this would also eliminate
>>> the need for storing an index for indexed scanning, since each element in
>>> localSrcIds would be the start of a cluster. From some extensive testing,
>>> this along with some delta encoding strategies on localDstIds and the
>>> mapping structures can reduce memory consumption of the graph by nearly
>>> half.
>>>
>>> However, I am not entirely sure if there is any reason for storing both
>>> localSrcIds and localDstIds for each edge in terms of integration of
>>> future
>>> functionalities, such as graph mutations. I noticed there was another
>>> post
>>> similar to this one as well, but it had not replies.
>>>
>>> The idea is quite similar to  Netflix graph library
>>> <https://github.com/Netflix/netflix-graph>   and would be happy to open
>>> a
>>> jira on this issue with partial improvements. But, I may not be
>>> completely
>>> correct with my thinking!
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>

Re: Using Encoding to reduce GraphX's static graph memory consumption

Reply via email to