Hi Christian -
Alas, the data is a client's and confidential.
for $remote in $paths
let $name as xs:string := file:name($remote)
let $target as xs:string := file:resolve-path($name,$targetBase)
let $fetched as item() :=
http:send-request(,
$remote)[2]
return
if ($fetched instance of document-node())
then file:write($target,$fetched)
else if ($fetched instance of xs:base64Binary)
then file:write-binary($target,$fetched)
else file:write-text($target,$fetched)
works -- the query completes and files are written to disk. I suspect
that server ignores override-content-type.
If I don't check the returned type and try to write everything returned out
with file:write-binary() to have something where I could pick the html
files back off the disk, I got an error that the content wasn't binary, it
was xs:untypedAtomic. (Which might imply that file:write-binary complains
that way when fed a document node.)
Which leads to "is there a way to get the type of an item?" I don't think
there is, but it seems like it would be extremely helpful for stuff like
this where "figure out what the web server feels like doing" is a concern.
Thank you! It was a helpful hint.
-- Graydon
On Fri, Apr 8, 2022 at 9:44 AM Christian Grün
wrote:
> Hi Graydon,
>
> Maybe it’s TagSoup that has problems to convert some specific HTML
> files to XML. Did you try to write the responses to disk and parse
> them in a second step?
>
> If your input data is not confidential, could you possibly provide us
> with an example that runs out of the box?
>
> Best,
> Christian
>
>
> > I'm using the basexgui to run (minus some identifying actual values
> defined previously in the query)
> >
> > (: for each path, retrieve the document :)
> > for $remote in $paths
> > let $name as xs:string := file:name($remote)
> > let $target as xs:string := file:resolve-path($name,$targetBase)
> > let $fetched :=
> > http:send-request( override-media-type='application/octet-stream' username='{$id}'
> password='{$pass}' />,
> > $remote)[2]
> > let $use as item() := try {
> > html:parse($fetched)
> > } catch * {
> > $fetched
> > }
> > return if ($use instance of document-node())
> > then file:write($target,$use)
> > else file:write-binary($target,$use)
> >
> > It works, in that I get exactly 100 documents retrieved. (There are
> unfortunately 140+ documents in the list.)
> >
> > However, the query fails with an "out of main memory" error when using a
> recent 10.0 beta or 9.7 with Xmx set to 2g. Setting Xmx to 16g with 9.7
> produces the same "out of memory" error in the same length of time (about 5
> minutes).
> >
> > java -version says
> > 20:27 test % java -version
> > openjdk version "11.0.14.1" 2022-02-08
> > OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1)
> > OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing)
> >
> > It's entirely possible I'm going about fetching files off a web server
> the wrong way; it's possible there's something there that's rather large,
> but I doubt it's that large.
> >
> > What should I be doing instead?
> >
> > Thanks!
> > Graydon
>