Re: Meta-Database

Micah Cowan Mon, 17 Mar 2008 02:41:35 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Charles wrote:
> On Mon, Mar 17, 2008 at 3:20 PM, Micah Cowan <[EMAIL PROTECTED]> wrote:
>>   echo http://something >> links
>>   echo http://anotherthing >> links
>>   echo wget http://something | at 23:30
>>   wget -i links
> 
> Sure, I used to do this. The only problem I have is that all the links
> have to be collected first before wget can be started. With common GUI
> download manager, links can be added any time we like and the download
> can be started right after the first link is added.


Is that true? I thought wget actually read the input file in a streaming
fashion.

If not, that would be the preferred change; not the session database.

As the specification stands, it is actually already possible to do this
"add a link" functionality (just add the desired URL to the list
recorded in the wget invocation information). I just wasn't really
planning on creating a tool to do so, as that's not really what it's
for. But there's nothing preventing a user from tacking the URL on
himself with an editor, or writing a tool to do so automatically (or
using existing Unix text tools).

>>  No, it won't be, and neither will it need to be. The files, even for
>>  large fetches, will almost certainly be quite small (relative to typical
>>  RDBMS application space), and will easily be parsed and the appropriate
>>  internal data structures set up in well under a second for most cases.
>>  However, I think you missed the mention that a binary-format alternative
>>  could be provided (with Wget using timestamping to judge whether it's
>>  out-of-date).
> 
> I agree, the metadata will be small. I'm just thinking, at the
> frequency I'm using wget (I mean, I am used to run it all time times
> in the command line), reading the metadata on and on for each
> invocation is a waste of resource (will wget need to do this?). The
> binary format is a good idea though.

I don't expect that a single session's database would get frequent
reuse, though. However, it probably _would_ be used repeatedly while
you're working on a specific session; in that case, it's useful to have
the binary format.

One reason I wanted the binary format, was that I was envisioning that
someone might want to use some tool to quickly fetch the URL
corresponding to a local file (or vice versa); in that case, rebuilding
the mappings between those two items every time the tool is invoked
would be extremely inefficient. For most other usages, though, the
binary format probably has diminishing returns. It's likely, in fact,
that we wouldn't want to store all the data from the text version in the
binary file; we probably just want to store indexing information and
expensive-to-build structures.

> How about using YAML for the text format? It's interoperable (most
> language has a library to read it), very readable, has a formal syntax
> specification, and there is libyaml to do it in C. The YAML can be
> read into a dictionary and then serialized to create the binary
> format. And being simple and plain text, I believe people can use good
> old unix utilities to parse it ;)

I'll definitely consider it. I hadn't actually heard of it before, but
from what Wikipedia shows, it's a good, clean format. Thanks for the
suggestion! It might be worth considering for configuration file syntax
at some point in the future (it will need an overhaul, due to planned
support for URL-specific configuration settings).

Actually, though, I was considering using something based on HTTP
headers (though I'm not committed to that). But on the surface it looks
like YAML might easily allow that, too.

However, it's important to be able to parse the file, even if there is
some corruption or malformed information in some places--and especially,
if it is truncated (Wget abruptly killed).

One potential problem with YAML is that, from a data integrity
standpoint, it's often useful to explicitly denote both the start and
end of data (as HTTP itself doesn't, without chunked transfer-encodings
or content-lengths, anyway). Otherwise, it's not possible to distinguish
whether we've reached the end of the data or the data's been truncated.
It will be very useful for Wget to detect truncation, because that would
indicate a file it was in the middle of retrieving when it was killed,
and marks the continuation point. YAML, like Python, appears to use
indentation to indicate nested data (which in general I like), rather
than braces or begin/end keywords .

Still, I imagine the problem is easily fixed by placing some line at the
end of the file to indicate completion.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH3jzB7M8hyUobTrERAg//AJ4mxzoxMZ59GzH25OhyK3RJZXP6NACeOyuB
M66W2QMum1XRPtg1VsupR+I=
=Ci95
-----END PGP SIGNATURE-----

Re: Meta-Database

Reply via email to