On Mon, Dec 15, 2008 at 09:36:11AM +0000, Robin Taylor wrote:
> I think the problem is that we wrap the data up in formats that make
> extraction difficult and then need to go to great lengths to try and
> extract that data. I don't know of any widely used, reliable methos as
> yet. Better to move towards formats that make extraction easy.

Most common formats other than plain text have some sort of tagging
feature.  In some cases, few know about them so they aren't much
used.  That could be fixed easily.

> Microsoft docx documents looks like a step in the right direction to
> me. It's a normal Word document but is stored as xml and hence is
> readable programatically.

The older Office formats are readable programmatically too.  More
readable, actually, since OOXML is very new, still only partially
documented, and not implemented anywhere, even at Microsoft.  There's
a store for document attributes inside the traditional Office format's
bag.  There's a nice Java library (POI) that can extract them.

But then that only works for MS Office documents.  Not for OpenOffice
or Symphony.  Not for Acrobat.  We have tens of thousands of PDFs.  We
have audio and video streams waiting in the wings.

And we still need a system for assigning meanings to the tags.

>                           In addition the author can add their own
> tags, so there is no reason why they should not tag the abstract,
> references, etc. In theory it should be easy to then extract that
> information.

See the subject line.  If everybody makes up his own tags then there
is no standard, and software cannot make use of the tags without being
told, for each individual provider's profile, what to look for and
what they mean.  Bibliographic software like EndNote shows us what we
wind up with: hundreds of format modules to be maintained.  We can do
that but I'd rather have something systematic.  (BTW EndNote or one of
its brethren might be able to serve the original request.)

If there is no standard now, then maybe it's up to the document
repository community (that's us) to lay the groundwork for some
standardization and champion the idea until it's accepted.

-- 
Mark H. Wood, Lead System Programmer   [email protected]
Friends don't let friends publish revisable-form documents.

Attachment: pgpFXdB0KGzKu.pgp
Description: PGP signature

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to