Re: [Dbpedia-gsoc] GSoC '15 - Interest in 5.14 (Scalable querying of the live DBpedia data stream)

Pablo Estrada Sun, 22 Mar 2015 01:46:56 -0700

Hi all,
I've done more research, and thought about some possible implementations of
the project. Here are some ideas:

1. I think we can implement a class called "HdtLiveDatasource", which would
share many characteristics with "HdtDatasource", but would have some extra
fields in the "settings" field of the datasource configuration.
Particularly, a field such as "update_source" pointing to the changesets
generated by the extraction framework. The "HdtLiveDatasource" would keep a
thread to do the work of retrieving and updating the new data.

2. Currently, there is not much locking/threading logic written into the
HdtDatasource class, because HDT is read only, right? So if we were to
modify the HDT file 'live', we'd need to add locking, and all necessary
logic for concurrency. Modifying the same file would require to keep an
exclusive lock for a potentially long time. This option might have big
contention.

3. Instead of rewriting the same file, we could write a new file, just
adding the new data. That way we would only need to have a short exclusive
lock to switch from old to new file. This option would have little
contention, but large storage requirements. I am assuming that the network
is the major bottleneck for the server, so I hope having one thread writing
a new file to disk, and other threads retrieving information from another
file will not make the hard drive into a bottleneck.

What do you think about these options, Ruben?

Pablo

On Sat, Mar 21, 2015 at 12:06 AM Pablo Estrada <polecito...@gmail.com>
wrote:

> Hi guys,
> Thank you for brainstorming with me and giving me some guidance. I'll look
> into the HDT file format, and the logic in Server.js working with the HDT
> file to see what seems to be the most reasonable way of implementing this.
> I like the idea of keeping a cache while regenerating the HDT file
> index/contents; but I'll need to do a bit more research first.
> I'll write back again soon, and maybe start preparing my proposal.
>
> Also, since contributing to the project before the SoC is desirable; is
> there an issue in the bug tracker that you think I should look at to get
> started?
>
> Thanks again,
> Pablo.
>
> On Fri, Mar 20, 2015, 10:42 PM Ruben Verborgh <ruben.verbo...@ugent.be>
> wrote:
>
>> Hi Pablo,
>>
>> > 1. Update the HDT file (According to What is HDT, HDF is a read-only
>> format, so it might not be feasible).
>>
>> …which is why we have this project :-)
>>
>> Can you find a way to update HDT files?
>> Can you improve HDT so that it allows updates?
>> Could you perhaps make a combination
>> of different HDT files to give the live result?
>>
>> It's definitely feasible—just not within the current limits.
>> The challenge in this project is to find out!
>>
>> > 2. Possibly, keep an in-memory cache of triples, where we would keep
>> modified triples permanently (This could potentially raise the memory
>> requirements of the server...)
>>
>> Maybe have a pipeline so that new triples
>> are slowly migrated from cache to disk?
>>
>> > 3. Keep a file in disk, in a format that can be written to/read from
>> efficiently, and keep information about updated triples here (This seems
>> like a reasonable option...)
>>
>> Possible too!
>>
>> > If that's the case, then to 'start up' a 'Live' TPFS, we need to know
>> the time when the HDT file was generated, and then we need to run the
>> 'triple update' function over all the triples that have been changed since
>> then. Correct? (This would make startup potentially quite slow, but I guess
>> that's okay).
>>
>> Sure, but in the meantime, the server should remain active.
>> I.e., the question is: how can we keep a server running,
>> while updating the triples in the meantime/background,
>> while still staying easy on server resources (RAM / CPU).
>>
>> Note that we don't need to find the right answer here and now;
>> thinking about multiple directions is part of the project.
>>
>> Best,
>>
>> Ruben
>
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Re: [Dbpedia-gsoc] GSoC '15 - Interest in 5.14 (Scalable querying of the live DBpedia data stream)

Reply via email to