sure and if its fits your your use case even better

On Fri, Sep 24, 2021 at 9:41 AM Harri Kiiskinen <harri.kiiski...@utu.fi>
wrote:

> Perhaps so; but as a tool, Jena, and SPARQL in general, is very suitable
> for managing and processing data so that the processes can be described
> and repeated. For example in this case, processing the results of the
> OCR is very quick compared to the actual OCR process, so I prefer to
> store the original results of the OCR somewhere, and do post-processing
> – which may require other stages than just the one presented here –
> later. For any external solution, I'd have to store the original text
> somewhere in any case, and keep track of the file names etc.
>
> In this case, the actual run of the corrected SPARQL took only some tens
> of seconds, which is rather good, especially compared to the amount of
> time it would take to write the necessary scripts and data management
> for making this simple process repeatable with external solutions.
>
> And in fact, if a database cannot be used for managing and processing
> data, I don't what what it should be used for :-)
>
> Harri
>
>
> On 24.9.2021 11.21, Marco Neumann wrote:
> > All that said, I would think you'd be best advised to run this type of
> > operation outside of Jena during preprocessing with CLI tools such as
> grep,
> > sed, awk or ack.
> >
> > On Fri, Sep 24, 2021 at 9:14 AM Harri Kiiskinen <harri.kiiski...@utu.fi>
> > wrote:
> >
> >> Hi all,
> >>
> >> and thanks for the support! I did manage to resolve the problem by
> >> modifying the query, detailed comments below.
> >>
> >> Harri K.
> >>
> >> On 23.9.2021 22.47, Andy Seaborne wrote:
> >>> I guess you are using TDB2 if you have -Xmx2G. TDB1 wil use even more
> >>> heap space.
> >>
> >> Yes, TDB2.
> >>
> >>> All those named variables mean that the intermediate results are being
> >>> held onto. That includes the "no change" case. It looks like REPLACE
> and
> >>> no change is still a new string.
> >>
> >> I was a afraid this might be the vase.
> >>
> >>> There is at least 8 Gbytes just there by my rough calculation.
> >>
> >> -Xmx12G was not enough, so even more, I guess.
> >>
> >>> Things to try:
> >>>
> >>> 1/
> >>> Replace the use of named variables by a single expression
> >>> REPLACE (REPLACE( .... ))
> >>
> >> This did the trick. Combining all the replaces to one as above was
> >> enough to keep the memory use below 7 GB.
> >>
> >> I also tried replacing the BIND's with the Jena-specific LET-constructs
> >> (https://jena.apache.org/documentation/query/assignment.html) but that
> >> had no effect – is the LET just a pre-SPARQL-1.1 addition that is
> >> practically same as BIND, or is there a meaningful difference between
> >> the two?
> >>
> >>> 2/ (expanding on Macros' email):
> >>> If you are using TDB2:
> >>>
> >>> First transaction:
> >>> COPY vice:pageocrdata TO vice:pageocrdata_clean
> >>> or
> >>> insert {
> >>>       graph vice:pageocrdata_clean {
> >>>         ?page vice:ocrtext ?X .
> >>>       }
> >>>     }
> >>>     where {
> >>>       graph vice:pageocrdata {
> >>>         ?page vice:ocrtext ?X .
> >>>       }
> >>>
> >>> then applies the changes:
> >>>
> >>> WITH vice:pageocrdata_clean
> >>> DELETE { ?page vice:ocrtext ?ocr }
> >>> INSERT { ?page vice:ocrtext ?ocr7 }
> >>> WHERE {
> >>>       ?page vice:ocrtext ?ocr .
> >>>       BIND(replace(?ocr1,'uͤ','ü') AS ?ocr7)
> >>>       FILTER (?ocr != ?ocr7)
> >>> }
> >>
> >> Is there a big difference in working within one graphs as compared to
> >> intergraph update operations? Just asking because I'm compartmentalizing
> >> my data to different graphs quite much, but if it is significantly more
> >> expensive, I may have to rethink some processes, like shown above.
> >>
> >>> 3/
> >>> If TDB1 and none of that works, maybe reduce the internal transaction
> >>> space as well
> >>>
> >>> It so happens that SELECT LIMIT OFFSET is predictable for a persistent
> >>> database (this is not portable!!).
> >>>
> >>> WHERE {
> >>>      {
> >>>        SELECT ?ocr
> >>>        { graph vice:pageocrdata { ?page vice:ocrtext ?ocr . }
> >>>        OFFSET ... LIMIT ...
> >>>      }
> >>>      All the BIND
> >>> }
> >>>
> >>> (or filter by , starts ?ocr starts with "A" then with "B"
> >>>
> >>>       Andy
> >>
> >> Ah, yes, of course, this may become handy with even larger datasets.
> >>
> >>> BTW : replace(str(?ocr), ...
> >>> Any URIs will turn into strings and any language tags will be lost.
> >>
> >> Yes, that is unnecessary.
> >>
> >>> On 23/09/2021 16:28, Marco Neumann wrote:
> >>>> "not to bind" to be read as "just bind once"
> >>>>
> >>>> On Thu, Sep 23, 2021 at 4:25 PM Marco Neumann <
> marco.neum...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> set -Xmx to 8G and try not to bind the variable and to see if this
> >>>>> alleviates the issue.
> >>>>>
> >>>>> On Thu, Sep 23, 2021 at 12:41 PM Harri Kiiskinen
> >>>>> <harri.kiiski...@utu.fi>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi!
> >>>>>>
> >>>>>> I'm trying to run a simple update query that reads strings from one
> >>>>>> graph, processes them, and stores to another:
> >>>>>>
> >>>>>>
> >>>>>>
> >>
> ------------------------------------------------------------------------------
> >>
> >>>>>>
> >>>>>>     insert {
> >>>>>>       graph vice:pageocrdata_clean {
> >>>>>>         ?page vice:ocrtext ?ocr7 .
> >>>>>>       }
> >>>>>>     }
> >>>>>>     where {
> >>>>>>       graph vice:pageocrdata {
> >>>>>>         ?page vice:ocrtext ?ocr .
> >>>>>>       }
> >>>>>>       bind (replace(str(?ocr),'ſ','s') as ?ocr1)
> >>>>>>       bind (replace(?ocr1,'uͤ','ü') as ?ocr2)
> >>>>>>       bind (replace(?ocr2,'aͤ','ä') as ?ocr3)
> >>>>>>       bind (replace(?ocr3,'oͤ','ö') as ?ocr4)
> >>>>>>       bind (replace(?ocr4,"[⸗—]\n",'') as ?ocr5)
> >>>>>>       bind (replace(?ocr5,"\n",' ') as ?ocr6)
> >>>>>>       bind (replace(?ocr6,"[ ]+",' ') as ?ocr7)
> >>>>>>     }
> >>>>>>
> >>>>>>
> >>
> -------------------------------------------------------------------------------
> >>
> >>>>>>
> >>>>>> The source graph has some 250,000 triples that fill the WHERE
> >>>>>> criterium.
> >>>>>> The strings are one to two thousand characters in length.
> >>>>>>
> >>>>>> I'm running the query using the Fuseki web UI, and it ends each time
> >>>>>> with
> >>>>>> "Bad Request (#400) Java heap space". The fuseki log does not show
> any
> >>>>>> error except for the Bad Request #400. I'm quite surprised by this
> >>>>>> problem,
> >>>>>> because the update operation is a simple and straightforward data
> >>>>>> processing, with no ordering etc.
> >>>>>>
> >>>>>> I started with -Xmx2G, but even increasing the heap to -Xmx12G only
> >>>>>> increases the time it takes for Fuseki to return the same error.
> >>>>>>
> >>>>>> Is there something wrong with the SPARQL above? Is there something
> >> that
> >>>>>> increases the memory use unnecessarily?
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Harri Kiiskinen
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>>
> >>>>> ---
> >>>>> Marco Neumann
> >>>>> KONA
> >>>>>
> >>>>>
> >>>>
> >>
> >>
> >> --
> >> Tutkijatohtori / post-doctoral researcher
> >> Viral Culture in the Early Nineteenth-Century Europe (ViCE)
> >> Movie Making Finland: Finnish fiction films as audiovisual big data,
> >> 1907–2017 (MoMaF)
> >> Turun yliopisto / University of Turku
> >>
> >
> >
>
>
> --
> Tutkijatohtori / post-doctoral researcher
> Viral Culture in the Early Nineteenth-Century Europe (ViCE)
> Movie Making Finland: Finnish fiction films as audiovisual big data,
> 1907–2017 (MoMaF)
> Turun yliopisto / University of Turku
>


-- 


---
Marco Neumann
KONA

Reply via email to