Le lun., janv. 16 2023 at 16:01:51 +0000, Andy Seaborne
<[email protected]> a écrit :
Dear Andy,
For the exponential increase in time, it is when I load the files one
by one by multiple command line.
For example, I get the following loading times with tdb2.tdbloader (I
have observe the same phenomenon with requests).
Loading the file Uniref_1.nt: 35min
Then loading the file Uniref_2.nt (in the same graph in the same
datasets than Uniref_1.nt) : 237min
Loading the Uniref_1.nt file and the Uniref_2.nt file in same
tdb2.tdbloader command line: 150min
Upload a file in a non-empty datasets/graph is longer than upload this
same file and an other in the same time.
On 16/01/2023 13:16, Steven Blanchard wrote:
Hello,
I would like to upload a very large dataset (UniRef) to a fuseki
database.
How big (in triples)?
In total, we have ~ 4 000 000 000 triples to insert.
I tried to upload file by file but the upload time was exponential
with each file added.
code use :
```python
url: str = f"{jena_url}/{db_name}/data
multipart_data: MultipartEncoder = MultipartEncoder(
fields={
"file": (
f"{file_name}",
open(
f"{path_file}",
"rb",
),
"text/turtle",
)
}
)
response : requests.Request = requests.post(
url,
data=multipart_data,
headers={"Content-Type": multipart_data.content_type},
cookies=cookies,
)
```
That is a multi-part file upload.
Does the Fuseki log show a single POST?
Does it have a Content-length? (run Fuseki with "-v" to see headers)
Is the Python client taking time to assemble the request?
It's not the root issue but I'd like to understand what various
setups do in practice and what arrives at the server.
For the multipart, I have found this conversation :
https://stackoverflow.com/questions/54549464/programmaticaly-upload-dataset-to-fuseki.
Having been quickly blocked by the exponential increase in loading
times, I did not note the times to tell you if it was faster or not
than not using it.
Then I tried to upload with the command tdb2.tdbloader.
By uploading all the files in the same command the upload became
very much faster. Also, tdb2.tdbloader has an option to parallelize
the upload.
If you load into a live server, Fuseki does a safe add to the
database within a database transaction that does not consume all the
server hardware resources. If the data is bad (all too often the
case) or the client break the connection so that the data is corrupt,
the transaction will abort and the database is intact in the original
state.
Uploading the uniref database will not be done often and will not be
updated afterwards. But I note well your warnings, for the other
(small) databases the loading times are very acceptable I would use the
fuseki server.
This is to keep a balance between upload, integrity, and responding
to queries.
Load performance is hardware sensitive.
Is this using an SSD? If so, local or remote?
code use :
```bash
bin/tdb2.tdbloader --loader=parallel --loc
fuseki/base/databases/uniref/ data/uniref_*
```
The problem with tdb2 is that it does not work in http.
The parallel loader will saturate the I/O at scale. It's greedy!
I/O is the limiting factor at scale.
tdb2.tdbloader (default and parallel) runs without database
transactions. Different parts of the database are in different states
(the "parallel" bit). An aborted load, or programme crash, will
destroy the database. It is best used loading from empty.
When you say crash the database, is it just the datasets where the data
is uploaded or all the datasets of the server?
I create a new datasets by version of my software so it's possible for
me to upload the uniprot Graph first and next the other data by
requests to be safe.
I would like to know if it is possible to get the same performance
as tdb2 (loading all files at once, parallelization...) by using an
http request?
Not as of today.
Could the functionality be added? Yes - a good case for a Fuseki
Module to make the functionality opt-in because it can take out the
server and break the database.
I'm also open to other suggestions to optimize this file loading.
Loading offline and putting in the new database is the way to exploit
faster loading.
What explains this exponential evolution of the upload time when
adding data in several times?
Loading slows down over time.
1. Hardware sizes mean more I/O and less %-age cached as the loaded
data grows.
2. Loading is in effect a sort so that's n-log(n) once the scale is
large enough to see
Andy
Thank you for your help,
Steven