Re: [Dspace-tech] standards to facilitate metadata extraction during text extraction

2008-12-15 Thread François Parmentier
During my PhD, this was still a research subject (automatic extraction of
data from physical structure of a document).
Have a look at http://www.loria.fr/equipes/read/
I don't know whether there have been free or proprietary systems since then.

When the layout of your documents is a regular one, some rather simple
process may be useful, but if it varies too much, it is a much more
complicated task!
--
François PARMENTIER / INIST-CNRS

On Sun, Dec 14, 2008 at 12:52 AM, Andrew Marlow 
marlow.and...@googlemail.com wrote:

 This may seem like a crazy or naive question, but is there any standard
 laid down by publishers or societies that authors must adhere to so that the
 extraction of metadata from articles can be easily automated? Having just
 performed a text extraction on a non-searchable PDF I see that there is no
 easy way to get any metadata out. But if a society had conventions for the
 layour of the article, specifying location and format of title, authors,
 abstract, bibliography etc, then it might be possible. I have seen a very
 regular visual layout in the PDFs from some places. Using OCR techniques it
 might be possible to locate blocks of interest. It might also be possible
 from a text extraction but that might be harder since all visual layout
 information is gone (at least it was with the tool I used). I wonder if this
 is being considered by anyone. I am very new to this area so please excuse
 me if this seems like a silly question.
 --
 Regards,

 Andrew M.


 --
 SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
 The future of the web can't happen without you.  Join us at MIX09 to help
 pave the way to the Next Web now. Learn more and register at

 http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech


--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] standards to facilitate metadata extraction during text extraction

2008-12-15 Thread Christophe Dupriez
End of the 1990s, I used MS-Word forms and macros to allow authors to 
enter metadata together with their articles. Even references were 
structured.


It seemed a good idea (normalizing upfront).

It ended up very badly because:
* MacIntosh MS-Word was not compatible for forms and macros;
* Word Perfect was still popular and presented as being compatible 
(which was not true for forms and macros);
 The worse was one of the revisors who opened most of the articles in 
Word Perfect and saved them after comments addition...
* Asian versions of Word were introducing unknown characters for Western 
versions;

* About a quarter of the authors did not understood the form.
Those (technical?) problems produced a terrible mess which took very 
long to correct and delayed the publication of the paper.


Efficient cataloguers (possibly with the help of a submission form like 
the DSpace one + a better cataloguing form than the current one) will be 
always better than machine to tame the authors' diversity!


Have a nice day!

Christophe Dupriez

François Parmentier a écrit :
During my PhD, this was still a research subject (automatic extraction 
of data from physical structure of a document).

Have a look at http://www.loria.fr/equipes/read/
I don't know whether there have been free or proprietary systems since 
then.


When the layout of your documents is a regular one, some rather simple 
process may be useful, but if it varies too much, it is a much more 
complicated task!

--
François PARMENTIER / INIST-CNRS

On Sun, Dec 14, 2008 at 12:52 AM, Andrew Marlow 
marlow.and...@googlemail.com mailto:marlow.and...@googlemail.com 
wrote:


This may seem like a crazy or naive question, but is there any
standard laid down by publishers or societies that authors must
adhere to so that the extraction of metadata from articles can be
easily automated? Having just performed a text extraction on a
non-searchable PDF I see that there is no easy way to get any
metadata out. But if a society had conventions for the layour of
the article, specifying location and format of title, authors,
abstract, bibliography etc, then it might be possible. I have seen
a very regular visual layout in the PDFs from some places. Using
OCR techniques it might be possible to locate blocks of interest.
It might also be possible from a text extraction but that might be
harder since all visual layout information is gone (at least it
was with the tool I used). I wonder if this is being considered by
anyone. I am very new to this area so please excuse me if this
seems like a silly question.
-- 
Regards,


Andrew M.


--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las
Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09
to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
mailto:DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech




--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/


___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  


begin:vcard
fn:Christophe Dupriez
n:Dupriez;Christophe
org:DESTIN inc. SSEB
adr;quoted-printable:;;rue des Palais 44, bo=C3=AEte 1;Bruxelles;;B-1030;Belgique
email;internet:christophe.dupr...@destin.be
title:Informaticien
tel;work:+32/2/216.66.15
tel;fax:+32/2/242.97.25
tel;cell:+32/475.77.62.11
note;quoted-printable:D=C3=A9veloppement de Syst=C3=A8mes de Traitement de l'Information
x-mozilla-html:TRUE
url:http://www.destin.be
version:2.1
end:vcard

--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


[Dspace-tech] standards to facilitate metadata extraction during text extraction

2008-12-13 Thread Andrew Marlow
This may seem like a crazy or naive question, but is there any standard laid
down by publishers or societies that authors must adhere to so that the
extraction of metadata from articles can be easily automated? Having just
performed a text extraction on a non-searchable PDF I see that there is no
easy way to get any metadata out. But if a society had conventions for the
layour of the article, specifying location and format of title, authors,
abstract, bibliography etc, then it might be possible. I have seen a very
regular visual layout in the PDFs from some places. Using OCR techniques it
might be possible to locate blocks of interest. It might also be possible
from a text extraction but that might be harder since all visual layout
information is gone (at least it was with the tool I used). I wonder if this
is being considered by anyone. I am very new to this area so please excuse
me if this seems like a silly question.
-- 
Regards,

Andrew M.
--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech