sure and if its fits your your use case even better On Fri, Sep 24, 2021 at 9:41 AM Harri Kiiskinen <harri.kiiski...@utu.fi> wrote:
> Perhaps so; but as a tool, Jena, and SPARQL in general, is very suitable > for managing and processing data so that the processes can be described > and repeated. For example in this case, processing the results of the > OCR is very quick compared to the actual OCR process, so I prefer to > store the original results of the OCR somewhere, and do post-processing > – which may require other stages than just the one presented here – > later. For any external solution, I'd have to store the original text > somewhere in any case, and keep track of the file names etc. > > In this case, the actual run of the corrected SPARQL took only some tens > of seconds, which is rather good, especially compared to the amount of > time it would take to write the necessary scripts and data management > for making this simple process repeatable with external solutions. > > And in fact, if a database cannot be used for managing and processing > data, I don't what what it should be used for :-) > > Harri > > > On 24.9.2021 11.21, Marco Neumann wrote: > > All that said, I would think you'd be best advised to run this type of > > operation outside of Jena during preprocessing with CLI tools such as > grep, > > sed, awk or ack. > > > > On Fri, Sep 24, 2021 at 9:14 AM Harri Kiiskinen <harri.kiiski...@utu.fi> > > wrote: > > > >> Hi all, > >> > >> and thanks for the support! I did manage to resolve the problem by > >> modifying the query, detailed comments below. > >> > >> Harri K. > >> > >> On 23.9.2021 22.47, Andy Seaborne wrote: > >>> I guess you are using TDB2 if you have -Xmx2G. TDB1 wil use even more > >>> heap space. > >> > >> Yes, TDB2. > >> > >>> All those named variables mean that the intermediate results are being > >>> held onto. That includes the "no change" case. It looks like REPLACE > and > >>> no change is still a new string. > >> > >> I was a afraid this might be the vase. > >> > >>> There is at least 8 Gbytes just there by my rough calculation. > >> > >> -Xmx12G was not enough, so even more, I guess. > >> > >>> Things to try: > >>> > >>> 1/ > >>> Replace the use of named variables by a single expression > >>> REPLACE (REPLACE( .... )) > >> > >> This did the trick. Combining all the replaces to one as above was > >> enough to keep the memory use below 7 GB. > >> > >> I also tried replacing the BIND's with the Jena-specific LET-constructs > >> (https://jena.apache.org/documentation/query/assignment.html) but that > >> had no effect – is the LET just a pre-SPARQL-1.1 addition that is > >> practically same as BIND, or is there a meaningful difference between > >> the two? > >> > >>> 2/ (expanding on Macros' email): > >>> If you are using TDB2: > >>> > >>> First transaction: > >>> COPY vice:pageocrdata TO vice:pageocrdata_clean > >>> or > >>> insert { > >>> graph vice:pageocrdata_clean { > >>> ?page vice:ocrtext ?X . > >>> } > >>> } > >>> where { > >>> graph vice:pageocrdata { > >>> ?page vice:ocrtext ?X . > >>> } > >>> > >>> then applies the changes: > >>> > >>> WITH vice:pageocrdata_clean > >>> DELETE { ?page vice:ocrtext ?ocr } > >>> INSERT { ?page vice:ocrtext ?ocr7 } > >>> WHERE { > >>> ?page vice:ocrtext ?ocr . > >>> BIND(replace(?ocr1,'uͤ','ü') AS ?ocr7) > >>> FILTER (?ocr != ?ocr7) > >>> } > >> > >> Is there a big difference in working within one graphs as compared to > >> intergraph update operations? Just asking because I'm compartmentalizing > >> my data to different graphs quite much, but if it is significantly more > >> expensive, I may have to rethink some processes, like shown above. > >> > >>> 3/ > >>> If TDB1 and none of that works, maybe reduce the internal transaction > >>> space as well > >>> > >>> It so happens that SELECT LIMIT OFFSET is predictable for a persistent > >>> database (this is not portable!!). > >>> > >>> WHERE { > >>> { > >>> SELECT ?ocr > >>> { graph vice:pageocrdata { ?page vice:ocrtext ?ocr . } > >>> OFFSET ... LIMIT ... > >>> } > >>> All the BIND > >>> } > >>> > >>> (or filter by , starts ?ocr starts with "A" then with "B" > >>> > >>> Andy > >> > >> Ah, yes, of course, this may become handy with even larger datasets. > >> > >>> BTW : replace(str(?ocr), ... > >>> Any URIs will turn into strings and any language tags will be lost. > >> > >> Yes, that is unnecessary. > >> > >>> On 23/09/2021 16:28, Marco Neumann wrote: > >>>> "not to bind" to be read as "just bind once" > >>>> > >>>> On Thu, Sep 23, 2021 at 4:25 PM Marco Neumann < > marco.neum...@gmail.com> > >>>> wrote: > >>>> > >>>>> set -Xmx to 8G and try not to bind the variable and to see if this > >>>>> alleviates the issue. > >>>>> > >>>>> On Thu, Sep 23, 2021 at 12:41 PM Harri Kiiskinen > >>>>> <harri.kiiski...@utu.fi> > >>>>> wrote: > >>>>> > >>>>>> Hi! > >>>>>> > >>>>>> I'm trying to run a simple update query that reads strings from one > >>>>>> graph, processes them, and stores to another: > >>>>>> > >>>>>> > >>>>>> > >> > ------------------------------------------------------------------------------ > >> > >>>>>> > >>>>>> insert { > >>>>>> graph vice:pageocrdata_clean { > >>>>>> ?page vice:ocrtext ?ocr7 . > >>>>>> } > >>>>>> } > >>>>>> where { > >>>>>> graph vice:pageocrdata { > >>>>>> ?page vice:ocrtext ?ocr . > >>>>>> } > >>>>>> bind (replace(str(?ocr),'ſ','s') as ?ocr1) > >>>>>> bind (replace(?ocr1,'uͤ','ü') as ?ocr2) > >>>>>> bind (replace(?ocr2,'aͤ','ä') as ?ocr3) > >>>>>> bind (replace(?ocr3,'oͤ','ö') as ?ocr4) > >>>>>> bind (replace(?ocr4,"[⸗—]\n",'') as ?ocr5) > >>>>>> bind (replace(?ocr5,"\n",' ') as ?ocr6) > >>>>>> bind (replace(?ocr6,"[ ]+",' ') as ?ocr7) > >>>>>> } > >>>>>> > >>>>>> > >> > ------------------------------------------------------------------------------- > >> > >>>>>> > >>>>>> The source graph has some 250,000 triples that fill the WHERE > >>>>>> criterium. > >>>>>> The strings are one to two thousand characters in length. > >>>>>> > >>>>>> I'm running the query using the Fuseki web UI, and it ends each time > >>>>>> with > >>>>>> "Bad Request (#400) Java heap space". The fuseki log does not show > any > >>>>>> error except for the Bad Request #400. I'm quite surprised by this > >>>>>> problem, > >>>>>> because the update operation is a simple and straightforward data > >>>>>> processing, with no ordering etc. > >>>>>> > >>>>>> I started with -Xmx2G, but even increasing the heap to -Xmx12G only > >>>>>> increases the time it takes for Fuseki to return the same error. > >>>>>> > >>>>>> Is there something wrong with the SPARQL above? Is there something > >> that > >>>>>> increases the memory use unnecessarily? > >>>>>> > >>>>>> Best, > >>>>>> > >>>>>> Harri Kiiskinen > >>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> > >>>>> > >>>>> --- > >>>>> Marco Neumann > >>>>> KONA > >>>>> > >>>>> > >>>> > >> > >> > >> -- > >> Tutkijatohtori / post-doctoral researcher > >> Viral Culture in the Early Nineteenth-Century Europe (ViCE) > >> Movie Making Finland: Finnish fiction films as audiovisual big data, > >> 1907–2017 (MoMaF) > >> Turun yliopisto / University of Turku > >> > > > > > > > -- > Tutkijatohtori / post-doctoral researcher > Viral Culture in the Early Nineteenth-Century Europe (ViCE) > Movie Making Finland: Finnish fiction films as audiovisual big data, > 1907–2017 (MoMaF) > Turun yliopisto / University of Turku > -- --- Marco Neumann KONA