Re: [basex-talk] html document retrieval runs out of main memory

Christian Grün Fri, 08 Apr 2022 06:44:24 -0700

Hi Graydon,

Maybe it’s TagSoup that has problems to convert some specific HTML
files to XML. Did you try to write the responses to disk and parse
them in a second step?


If your input data is not confidential, could you possibly provide us
with an example that runs out of the box?

Best,
Christian


> I'm using the basexgui to run (minus some identifying actual values defined 
> previously in the query)
>
> (: for each path, retrieve the document :)
> for $remote in $paths
>   let $name as xs:string := file:name($remote)
>   let $target as xs:string := file:resolve-path($name,$targetBase)
>   let $fetched :=
>     http:send-request(<http:request method='get' 
> override-media-type='application/octet-stream' username='{$id}' 
> password='{$pass}' />,
>      $remote)[2]
>   let $use as item() := try {
>     html:parse($fetched)
>   } catch * {
>     $fetched
>   }
>   return if ($use instance of document-node())
>      then file:write($target,$use)
>      else file:write-binary($target,$use)
>
> It works, in that I get exactly 100 documents retrieved.  (There are 
> unfortunately 140+ documents in the list.)
>
> However, the query fails with an "out of main memory" error when using a 
> recent 10.0 beta or 9.7 with Xmx set to 2g.  Setting Xmx to 16g with 9.7 
> produces the same "out of memory" error in the same length of time (about 5 
> minutes).
>
> java -version says
> 20:27 test % java -version
> openjdk version "11.0.14.1" 2022-02-08
> OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing)
>
> It's entirely possible I'm going about fetching files off a web server the 
> wrong way; it's possible there's something there that's rather large, but I 
> doubt it's that large.
>
> What should I be doing instead?
>
> Thanks!
> Graydon

Re: [basex-talk] html document retrieval runs out of main memory

Reply via email to