Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Bartalus Gáspár Thu, 14 Jul 2022 23:41:41 -0700

Hi Andy & Lorenz,

Thanks for the clarification and support.


Best regards,
Gaspar

> On 14 Jul 2022, at 19:36, Andy Seaborne <a...@apache.org> wrote:
> 
> 
> 
> On 07/07/2022 16:19, Lorenz Buehmann wrote:
>> I think we should wait for Andy here with further input as he's the persons 
>> who basically designed and implemented all the fancy stuff and knows better 
>> advice for sure.
>> @Andy Did you read the whole discussion and can you verify that it's 
>> expected behavior that lot's of daily updates lead to such a big growth of 
>> the node table files?
> 
> Sorry for the delay.
> 
> There is no Lucene index by default.
> 
> SPO.dat is not nodes table related - it is the base level of the SPO B+Tree. 
> SPO.idn is the tree above the base level and SPO.bpt keeps the pointers to 
> the root block and some size information.
> 
> The issue looks to be the large numbers of small updates. TDB2 used a 
> copy-by-write MVCC scheme which means transactions can proceed without 
> needing latches (database locks) but has the consequence of needing 
> compaction.
> 
> TDB1 with Fuseki is worth a try. It does not use the scheme.  It does grow 
> but much more slowly. It is limited in the size of updates it can handle but 
> the limit is no where near what you describe.
> 
> Also worth trying is compaction and deletion
> 
> /$/compact/db_name?deleteOld=true
> 
> which will delete the old database after compaction (only the one just 
> compacted. Old ones can be manually deleted).
> 
>    Andy
> 
>> On 07.07.22 10:53, Bartalus Gáspár wrote:
>>> Hi Lorenz,
>>> 
>>> Would you recommend using tdb1 instead of tdb2 for our use case? What would 
>>> be the differences?
>>> We are using fuseki 4.5.0 btw.
>>> 
>>> Gaspar
>>> 
>>>> On 6 Jul 2022, at 14:39, Bartalus Gáspár 
>>>> <bartalus.gas...@codespring.ro.INVALID> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Most of the updates are DELETE/INSERT queries, i.e
>>>> 
>>>> DELETE {?s ?p ?oldValue}
>>>> INSERT {?s ?p ?newValue}
>>>> WHERE {
>>>>   OPTIONAL {?s ?p ?oldValue}
>>>>   #derive ?newValue from somewhere
>>>> }
>>>> 
>>>> We also have some separate DELETE queries and INSERT queries.
>>>> 
>>>> I’ve tried HTTP POST /$/compact/db_name and as a result the files are 
>>>> getting back to normal size. However, as far as I can tell the old files 
>>>> are also kept. This is the folder structure I see:
>>>> - databases/db_name/Data-0001 - with the old large files
>>>> - databases/db_name/Data-0002 - presumably the result of the compact 
>>>> operation with normal file sizes.
>>>> 
>>>> Is there also some operation (http or cli) that would keep only one (the 
>>>> latest) data folder, i.e. delete the old files from Data-0001?
>>>> 
>>>> Gaspar
>>>> 
>>>>> On 6 Jul 2022, at 12:52, Lorenz Buehmann 
>>>>> <buehm...@informatik.uni-leipzig.de> wrote:
>>>>> 
>>>>> Ok, interesting
>>>>> 
>>>>> so
>>>>> 
>>>>> we have
>>>>> 
>>>>> - 150k triples, rather small dataset
>>>>> 
>>>>> - loaded into 10MB node table files
>>>>> 
>>>>> - 10 updates every 5s
>>>>> 
>>>>> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day
>>>>> 
>>>>> - and leads to 10GB node table files
>>>>> 
>>>>> 
>>>>> Can you share the shape of those update queries?
>>>>> 
>>>>> 
>>>>> After doing a "compact" operation, the files are getting back to "normal" 
>>>>> size?
>>>>> 
>>>>> 
>>>>> On 06.07.22 10:36, Bartalus Gáspár wrote:
>>>>>> Hi Lorenz,
>>>>>> 
>>>>>> Thanks for quick feedback and clarification on lucene indexes.
>>>>>> 
>>>>>> Here are my answers to your questions:
>>>>>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, 
>>>>>> the others are below 200Kb.
>>>>>> - The overall number of triples after data upload is  ~150000.
>>>>>> - We have around 10 SPARQL UPDATE queries that are executed on a regular 
>>>>>> and frequent basis, i.e. every 5 seconds. We also have 5 such queries 
>>>>>> that are executed each minute. But most of the time they do not produce 
>>>>>> any outcome, i.e. the dataset is not altered, and when they do, there 
>>>>>> are just a couple of triples that are added to the dataset.
>>>>>> - These *.dat files start from ~10Mb in size, and after a day or so some 
>>>>>> of them grow to ~10Gb.
>>>>>> 
>>>>>> We have ~300 blank nodes, and ~half of the triples have a literal in the 
>>>>>> object position, so ~75000.
>>>>>> 
>>>>>> Best regards,
>>>>>> Gaspar
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann 
>>>>>>> <buehm...@informatik.uni-leipzig.de> wrote:
>>>>>>> 
>>>>>>> Hi and welcome Gaspar.
>>>>>>> 
>>>>>>> 
>>>>>>> Those files do contain the node tables.
>>>>>>> 
>>>>>>> A Lucene index is never computed by default and would be contained in 
>>>>>>> Lucene specific index files.
>>>>>>> 
>>>>>>> 
>>>>>>> Can you give some details about the
>>>>>>> 
>>>>>>> - size of the files
>>>>>>> - the number of triples
>>>>>>> - the number triples added/removed/changed
>>>>>>> - the frequency of updates
>>>>>>> - how much the files grow
>>>>>>> - what kind of data you insert? Lots of blank nodes? Or literals?
>>>>>>> 
>>>>>>> Also, did you try a compact operation during time?
>>>>>>> 
>>>>>>> Lorenz
>>>>>>> 
>>>>>>> On 06.07.22 09:40, Bartalus Gáspár wrote:
>>>>>>>> Hi Jena support team,
>>>>>>>> 
>>>>>>>> We are experiencing an issue with Jena Fuseki databases. In the 
>>>>>>>> databases folder we see some files called SPO.dat, OSP.dat, etc., and 
>>>>>>>> the size of these files are growing quickly. From our understanding 
>>>>>>>> these files are containing the Lucene indexes. We would have two 
>>>>>>>> questions:
>>>>>>>> 
>>>>>>>> 1. Why are these files growing rapidly, although the underlying data 
>>>>>>>> (triples) are not being changed, or only slightly changed?
>>>>>>>> 2. Can we disable indexing easily, since we are not using full text 
>>>>>>>> searches in our SPARQL queries?
>>>>>>>> 
>>>>>>>> Our usage of Jena Fuseki:
>>>>>>>> 
>>>>>>>> * Start the server with `fuseki-server —port 3030`
>>>>>>>> * Create databases with HTTP POST to 
>>>>>>>> `/$/datasets?state=active&dbType=tdb2&dbName=db_name`
>>>>>>>> * Upload ttl files with HTTP POST to /db_name/data
>>>>>>>> 
>>>>>>>> Thanks in advance for your feedback, and if you’d require more input 
>>>>>>>> from our side, please let me know.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Gaspar Bartalus
>>>>>>>>

smime.p7s
Description: S/MIME cryptographic signature

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Reply via email to