Re: Is this the right way to work with large number of N-Triples?

Anuj Kumar Wed, 16 Mar 2011 10:52:01 -0700

Hi Andy,

I have loaded few n-triples into TDB in the offline mode using tdbloader.
Loading as well as query is fast but if I try to use a regex, it becomes
very slow. It is taking few minutes. On my 32bit machine it takes more than
10 mins (expected due to limited memory ~ 1.5GB) and on my 64bit machine
(8GB) it takes around 5 mins.


The query is pretty exhaustive, correct me if it is happening due to the
filter-

SELECT ?abstract
WHERE {
 ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l .
 FILTER regex(?l, "Futurama", "i") .
 ?resource <http://dbpedia.org/ontology/abstract> ?abstract
}

I have loaded few abstracts from dbpedia dump and I am trying to get the
abstracts from the label. This is very slow. If I remove the FILTER and give
the exact label, it is fast (should be because of TDB indexing).

What is the right way to do such regex search or text search over the graph?
I have seen suggestions to use Lucene and I also saw the LARQ initiative. Is
that the right way to go?

Thanks,
Anuj

On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
[email protected]> wrote:

> Just so you know: The TDB bulkloader can load all the data offline - it's
> faster than using Fuseki for data loading online.
>
>        Andy
>
>
> On 15/03/11 11:22, Anuj Kumar wrote:
>
>> Hi Andy,
>>
>> Thanks for the info. I have loaded few GBs using Fuseki Server but I
>> didn't
>> try RiotReader or Java APIs for TDB. Will try that.
>> Thanks for the response.
>>
>> Regards,
>> Anuj
>>
>> On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
>> [email protected]>  wrote:
>>
>>  1/ Have you considered reading the DBpedia data into TDB?  This would
>>> keep
>>> the triples on-disk (and have cached in-memory versions of a subset).
>>>
>>> 2/ A file can be read sequentially by using the parser directly (See
>>> RiotReader and pass in a Sink<Triple>  that processes the stream of
>>> triples).
>>>
>>>        Andy
>>>
>>>
>>> On 14/03/11 18:42, Anuj Kumar wrote:
>>>
>>>  Hi All,
>>>>
>>>> I am new to Jena and trying to explore it to work with large number of
>>>> N-Triples. The requirement is to read large number of N-Triples. For
>>>> example, a nt file from DBpedia dump that may run into GBs. I have to
>>>> read
>>>> these triples, pick specific ones and further link it to the resource of
>>>> another set of triples. The goal is to link some of the entities based
>>>> on
>>>> Linked Data concept. Once the mapping is done, I have to query the model
>>>> from that point onwards. I don't want to work by loading both the source
>>>> and
>>>> target dataset in-memory.
>>>>
>>>> To achieve this, I have first created a file model maker and then a
>>>> named
>>>> model for the specific dataset being mapped. Now, I need to read the
>>>> Triples
>>>> and add the mapping to this new model. What should be the right
>>>> approach?
>>>>
>>>> One way is to load the model using FileManager and iterate through the
>>>> statements and map them accordingly to the named model (i.e. our mapped
>>>> model) and at the end close it. This will work, but it will load all of
>>>> the
>>>> triples in memory. Is this the right way to proceed or is there a way to
>>>> read the model sequentially at the time of mapping?
>>>>
>>>> Just trying to understand the efficient way to map large set of
>>>> N-Triples.
>>>> Need your suggestions.
>>>>
>>>> Thanks,
>>>> Anuj
>>>>
>>>>
>>>>
>>

Re: Is this the right way to work with large number of N-Triples?

Reply via email to