Hi Christian - Alas, the data is a client's and confidential.
for $remote in $paths let $name as xs:string := file:name($remote) let $target as xs:string := file:resolve-path($name,$targetBase) let $fetched as item() := http:send-request(<http:request method='get' username='{$id}' password='{$pass}' />, $remote)[2] return if ($fetched instance of document-node()) then file:write($target,$fetched) else if ($fetched instance of xs:base64Binary) then file:write-binary($target,$fetched) else file:write-text($target,$fetched) works -- the query completes and files are written to disk. I suspect that server ignores override-content-type. If I don't check the returned type and try to write everything returned out with file:write-binary() to have something where I could pick the html files back off the disk, I got an error that the content wasn't binary, it was xs:untypedAtomic. (Which might imply that file:write-binary complains that way when fed a document node.) Which leads to "is there a way to get the type of an item?" I don't think there is, but it seems like it would be extremely helpful for stuff like this where "figure out what the web server feels like doing" is a concern. Thank you! It was a helpful hint. -- Graydon On Fri, Apr 8, 2022 at 9:44 AM Christian Grün <christian.gr...@gmail.com> wrote: > Hi Graydon, > > Maybe it’s TagSoup that has problems to convert some specific HTML > files to XML. Did you try to write the responses to disk and parse > them in a second step? > > If your input data is not confidential, could you possibly provide us > with an example that runs out of the box? > > Best, > Christian > > > > I'm using the basexgui to run (minus some identifying actual values > defined previously in the query) > > > > (: for each path, retrieve the document :) > > for $remote in $paths > > let $name as xs:string := file:name($remote) > > let $target as xs:string := file:resolve-path($name,$targetBase) > > let $fetched := > > http:send-request(<http:request method='get' > override-media-type='application/octet-stream' username='{$id}' > password='{$pass}' />, > > $remote)[2] > > let $use as item() := try { > > html:parse($fetched) > > } catch * { > > $fetched > > } > > return if ($use instance of document-node()) > > then file:write($target,$use) > > else file:write-binary($target,$use) > > > > It works, in that I get exactly 100 documents retrieved. (There are > unfortunately 140+ documents in the list.) > > > > However, the query fails with an "out of main memory" error when using a > recent 10.0 beta or 9.7 with Xmx set to 2g. Setting Xmx to 16g with 9.7 > produces the same "out of memory" error in the same length of time (about 5 > minutes). > > > > java -version says > > 20:27 test % java -version > > openjdk version "11.0.14.1" 2022-02-08 > > OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1) > > OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing) > > > > It's entirely possible I'm going about fetching files off a web server > the wrong way; it's possible there's something there that's rather large, > but I doubt it's that large. > > > > What should I be doing instead? > > > > Thanks! > > Graydon >