Hi Harri,

I guess you are using TDB2 if you have -Xmx2G. TDB1 wil use even more heap space.

All those named variables mean that the intermediate results are being held onto. That includes the "no change" case. It looks like REPLACE and no change is still a new string.

There is at least 8 Gbytes just there by my rough calculation.

Things to try:

1/
Replace the use of named variables by a single expression
REPLACE (REPLACE( .... ))

2/ (expanding on Macros' email):
If you are using TDB2:

First transaction:
COPY vice:pageocrdata TO vice:pageocrdata_clean
or
insert {
    graph vice:pageocrdata_clean {
      ?page vice:ocrtext ?X .
    }
  }
  where {
    graph vice:pageocrdata {
      ?page vice:ocrtext ?X .
    }

then applies the changes:

WITH vice:pageocrdata_clean
DELETE { ?page vice:ocrtext ?ocr }
INSERT { ?page vice:ocrtext ?ocr7 }
WHERE {
    ?page vice:ocrtext ?ocr .
    BIND(replace(?ocr1,'uͤ','ü') AS ?ocr7)
    FILTER (?ocr != ?ocr7)
}

3/
If TDB1 and none of that works, maybe reduce the internal transaction space as well

It so happens that SELECT LIMIT OFFSET is predictable for a persistent database (this is not portable!!).

WHERE {
   {
     SELECT ?ocr
     { graph vice:pageocrdata { ?page vice:ocrtext ?ocr . }
     OFFSET ... LIMIT ...
   }
   All the BIND
}

(or filter by , starts ?ocr starts with "A" then with "B"

    Andy


BTW : replace(str(?ocr), ...
Any URIs will turn into strings and any language tags will be lost.

On 23/09/2021 16:28, Marco Neumann wrote:
"not to bind" to be read as "just bind once"

On Thu, Sep 23, 2021 at 4:25 PM Marco Neumann <marco.neum...@gmail.com>
wrote:

set -Xmx to 8G and try not to bind the variable and to see if this
alleviates the issue.

On Thu, Sep 23, 2021 at 12:41 PM Harri Kiiskinen <harri.kiiski...@utu.fi>
wrote:

Hi!

I'm trying to run a simple update query that reads strings from one
graph, processes them, and stores to another:


------------------------------------------------------------------------------
   insert {
     graph vice:pageocrdata_clean {
       ?page vice:ocrtext ?ocr7 .
     }
   }
   where {
     graph vice:pageocrdata {
       ?page vice:ocrtext ?ocr .
     }
     bind (replace(str(?ocr),'ſ','s') as ?ocr1)
     bind (replace(?ocr1,'uͤ','ü') as ?ocr2)
     bind (replace(?ocr2,'aͤ','ä') as ?ocr3)
     bind (replace(?ocr3,'oͤ','ö') as ?ocr4)
     bind (replace(?ocr4,"[⸗—]\n",'') as ?ocr5)
     bind (replace(?ocr5,"\n",' ') as ?ocr6)
     bind (replace(?ocr6,"[ ]+",' ') as ?ocr7)
   }

-------------------------------------------------------------------------------
The source graph has some 250,000 triples that fill the WHERE criterium.
The strings are one to two thousand characters in length.

I'm running the query using the Fuseki web UI, and it ends each time with
"Bad Request (#400) Java heap space". The fuseki log does not show any
error except for the Bad Request #400. I'm quite surprised by this problem,
because the update operation is a simple and straightforward data
processing, with no ordering etc.

I started with -Xmx2G, but even increasing the heap to -Xmx12G only
increases the time it takes for Fuseki to return the same error.

Is there something wrong with the SPARQL above? Is there something that
increases the memory use unnecessarily?

Best,

Harri Kiiskinen



--


---
Marco Neumann
KONA



Reply via email to