Re: [basex-talk] html document retrieval runs out of main memory

2022-04-08 Thread Graydon Saunders
Of course there is a way!

Thank you; that is indeed helpful.

(I continue to be impressed that you can get your brain to hold all of this
at one time.)

-- Graydon

On Fri, Apr 8, 2022 at 6:10 PM Christian Grün 
wrote:

> Which leads to "is there a way to get the type of an item?"
>>
>
> You can use inspect:type for that [1].
>
> Hope this helps
> Christian
>
> [1] https://docs.basex.org/wiki/Inspection_Module#inspect:type
>
>
>


Re: [basex-talk] html document retrieval runs out of main memory

2022-04-08 Thread Christian Grün
>
> Which leads to "is there a way to get the type of an item?"
>

You can use inspect:type for that [1].

Hope this helps
Christian

[1] https://docs.basex.org/wiki/Inspection_Module#inspect:type


Re: [basex-talk] html document retrieval runs out of main memory

2022-04-08 Thread Graydon Saunders
Hi Christian -

Alas, the data is a client's and confidential.

for $remote in $paths
  let $name as xs:string := file:name($remote)
  let $target as xs:string := file:resolve-path($name,$targetBase)
  let $fetched as item() :=
  http:send-request(,
   $remote)[2]
  return
  if ($fetched instance of document-node())
then file:write($target,$fetched)
 else if ($fetched instance of xs:base64Binary)
then file:write-binary($target,$fetched)
else file:write-text($target,$fetched)

works -- the query completes and files are written to disk.   I suspect
that server ignores override-content-type.

If I don't check the returned type and try to write everything returned out
with file:write-binary() to have something where I could pick the html
files back off the disk, I got an error that the content wasn't binary, it
was xs:untypedAtomic.  (Which might imply that file:write-binary complains
that way when fed a document node.)

Which leads to "is there a way to get the type of an item?"  I don't think
there is, but it seems like it would be extremely helpful for stuff like
this where "figure out what the web server feels like doing" is a concern.

Thank you! It was a helpful hint.

-- Graydon

On Fri, Apr 8, 2022 at 9:44 AM Christian Grün 
wrote:

> Hi Graydon,
>
> Maybe it’s TagSoup that has problems to convert some specific HTML
> files to XML. Did you try to write the responses to disk and parse
> them in a second step?
>
> If your input data is not confidential, could you possibly provide us
> with an example that runs out of the box?
>
> Best,
> Christian
>
>
> > I'm using the basexgui to run (minus some identifying actual values
> defined previously in the query)
> >
> > (: for each path, retrieve the document :)
> > for $remote in $paths
> >   let $name as xs:string := file:name($remote)
> >   let $target as xs:string := file:resolve-path($name,$targetBase)
> >   let $fetched :=
> > http:send-request( override-media-type='application/octet-stream' username='{$id}'
> password='{$pass}' />,
> >  $remote)[2]
> >   let $use as item() := try {
> > html:parse($fetched)
> >   } catch * {
> > $fetched
> >   }
> >   return if ($use instance of document-node())
> >  then file:write($target,$use)
> >  else file:write-binary($target,$use)
> >
> > It works, in that I get exactly 100 documents retrieved.  (There are
> unfortunately 140+ documents in the list.)
> >
> > However, the query fails with an "out of main memory" error when using a
> recent 10.0 beta or 9.7 with Xmx set to 2g.  Setting Xmx to 16g with 9.7
> produces the same "out of memory" error in the same length of time (about 5
> minutes).
> >
> > java -version says
> > 20:27 test % java -version
> > openjdk version "11.0.14.1" 2022-02-08
> > OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1)
> > OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing)
> >
> > It's entirely possible I'm going about fetching files off a web server
> the wrong way; it's possible there's something there that's rather large,
> but I doubt it's that large.
> >
> > What should I be doing instead?
> >
> > Thanks!
> > Graydon
>


Re: [basex-talk] html document retrieval runs out of main memory

2022-04-08 Thread Christian Grün
Hi Graydon,

Maybe it’s TagSoup that has problems to convert some specific HTML
files to XML. Did you try to write the responses to disk and parse
them in a second step?

If your input data is not confidential, could you possibly provide us
with an example that runs out of the box?

Best,
Christian


> I'm using the basexgui to run (minus some identifying actual values defined 
> previously in the query)
>
> (: for each path, retrieve the document :)
> for $remote in $paths
>   let $name as xs:string := file:name($remote)
>   let $target as xs:string := file:resolve-path($name,$targetBase)
>   let $fetched :=
> http:send-request( override-media-type='application/octet-stream' username='{$id}' 
> password='{$pass}' />,
>  $remote)[2]
>   let $use as item() := try {
> html:parse($fetched)
>   } catch * {
> $fetched
>   }
>   return if ($use instance of document-node())
>  then file:write($target,$use)
>  else file:write-binary($target,$use)
>
> It works, in that I get exactly 100 documents retrieved.  (There are 
> unfortunately 140+ documents in the list.)
>
> However, the query fails with an "out of main memory" error when using a 
> recent 10.0 beta or 9.7 with Xmx set to 2g.  Setting Xmx to 16g with 9.7 
> produces the same "out of memory" error in the same length of time (about 5 
> minutes).
>
> java -version says
> 20:27 test % java -version
> openjdk version "11.0.14.1" 2022-02-08
> OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing)
>
> It's entirely possible I'm going about fetching files off a web server the 
> wrong way; it's possible there's something there that's rather large, but I 
> doubt it's that large.
>
> What should I be doing instead?
>
> Thanks!
> Graydon