End of the 1990s, I used MS-Word forms and macros to allow authors to
enter metadata together with their articles. Even references were
structured.
It seemed a good idea (normalizing upfront).
It ended up very badly because:
* MacIntosh MS-Word was not compatible for forms and macros;
* Word Perfect was still popular and presented as being compatible
(which was not true for forms and macros);
The worse was one of the revisors who opened most of the articles in
Word Perfect and saved them after comments addition...
* Asian versions of Word were introducing unknown characters for Western
versions;
* About a quarter of the authors did not understood the form.
Those (technical?) problems produced a terrible mess which took very
long to correct and delayed the publication of the paper.
Efficient cataloguers (possibly with the help of a submission form like
the DSpace one + a better cataloguing form than the current one) will be
always better than machine to tame the authors' diversity!
Have a nice day!
Christophe Dupriez
François Parmentier a écrit :
During my PhD, this was still a research subject (automatic extraction
of data from physical structure of a document).
Have a look at http://www.loria.fr/equipes/read/
I don't know whether there have been free or proprietary systems since
then.
When the layout of your documents is a regular one, some rather simple
process may be useful, but if it varies too much, it is a much more
complicated task!
--
François PARMENTIER / INIST-CNRS
On Sun, Dec 14, 2008 at 12:52 AM, Andrew Marlow
marlow.and...@googlemail.com mailto:marlow.and...@googlemail.com
wrote:
This may seem like a crazy or naive question, but is there any
standard laid down by publishers or societies that authors must
adhere to so that the extraction of metadata from articles can be
easily automated? Having just performed a text extraction on a
non-searchable PDF I see that there is no easy way to get any
metadata out. But if a society had conventions for the layour of
the article, specifying location and format of title, authors,
abstract, bibliography etc, then it might be possible. I have seen
a very regular visual layout in the PDFs from some places. Using
OCR techniques it might be possible to locate blocks of interest.
It might also be possible from a text extraction but that might be
harder since all visual layout information is gone (at least it
was with the tool I used). I wonder if this is being considered by
anyone. I am very new to this area so please excuse me if this
seems like a silly question.
--
Regards,
Andrew M.
--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las
Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09
to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
mailto:DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
begin:vcard
fn:Christophe Dupriez
n:Dupriez;Christophe
org:DESTIN inc. SSEB
adr;quoted-printable:;;rue des Palais 44, bo=C3=AEte 1;Bruxelles;;B-1030;Belgique
email;internet:christophe.dupr...@destin.be
title:Informaticien
tel;work:+32/2/216.66.15
tel;fax:+32/2/242.97.25
tel;cell:+32/475.77.62.11
note;quoted-printable:D=C3=A9veloppement de Syst=C3=A8mes de Traitement de l'Information
x-mozilla-html:TRUE
url:http://www.destin.be
version:2.1
end:vcard
--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech