subject:"Re\: \[Dspace\-tech\] standards to facilitate metadata extraction during text extraction"

Re: [Dspace-tech] standards to facilitate metadata extraction during text extraction

2008-12-15 Thread Christophe Dupriez

End of the 1990s, I used MS-Word forms and macros to allow authors to 
enter metadata together with their articles. Even references were 
structured.


It seemed a good idea (normalizing upfront).

It ended up very badly because:
* MacIntosh MS-Word was not compatible for forms and macros;
* Word Perfect was still popular and presented as being compatible 
(which was not true for forms and macros);
 The worse was one of the revisors who opened most of the articles in 
Word Perfect and saved them after comments addition...
* Asian versions of Word were introducing unknown characters for Western 
versions;

* About a quarter of the authors did not understood the form.
Those (technical?) problems produced a terrible mess which took very 
long to correct and delayed the publication of the paper.


Efficient cataloguers (possibly with the help of a submission form like 
the DSpace one + a better cataloguing form than the current one) will be 
always better than machine to tame the authors' "diversity"!


Have a nice day!

Christophe Dupriez

François Parmentier a écrit :
During my PhD, this was still a research subject (automatic extraction 
of data from physical structure of a document).

Have a look at http://www.loria.fr/equipes/read/
I don't know whether there have been free or proprietary systems since 
then.


When the layout of your documents is a regular one, some rather simple 
process may be useful, but if it varies too much, it is a much more 
complicated task!

--
François PARMENTIER / INIST-CNRS

On Sun, Dec 14, 2008 at 12:52 AM, Andrew Marlow 
mailto:marlow.and...@googlemail.com>> 
wrote:


This may seem like a crazy or naive question, but is there any
standard laid down by publishers or societies that authors must
adhere to so that the extraction of metadata from articles can be
easily automated? Having just performed a text extraction on a
non-searchable PDF I see that there is no easy way to get any
metadata out. But if a society had conventions for the layour of
the article, specifying location and format of title, authors,
abstract, bibliography etc, then it might be possible. I have seen
a very regular visual layout in the PDFs from some places. Using
OCR techniques it might be possible to locate blocks of interest.
It might also be possible from a text extraction but that might be
harder since all visual layout information is gone (at least it
was with the tool I used). I wonder if this is being considered by
anyone. I am very new to this area so please excuse me if this
seems like a silly question.
-- 
Regards,


Andrew M.


--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las
Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09
to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/dspace-tech




--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/


___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  


begin:vcard
fn:Christophe Dupriez
n:Dupriez;Christophe
org:DESTIN inc. SSEB
adr;quoted-printable:;;rue des Palais 44, bo=C3=AEte 1;Bruxelles;;B-1030;Belgique
email;internet:christophe.dupr...@destin.be
title:Informaticien
tel;work:+32/2/216.66.15
tel;fax:+32/2/242.97.25
tel;cell:+32/475.77.62.11
note;quoted-printable:D=C3=A9veloppement de Syst=C3=A8mes de Traitement de l'Information
x-mozilla-html:TRUE
url:http://www.destin.be
version:2.1
end:vcard

--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] standards to facilitate metadata extraction during text extraction

2008-12-15 Thread François Parmentier

During my PhD, this was still a research subject (automatic extraction of
data from physical structure of a document).
Have a look at http://www.loria.fr/equipes/read/
I don't know whether there have been free or proprietary systems since then.

When the layout of your documents is a regular one, some rather simple
process may be useful, but if it varies too much, it is a much more
complicated task!
--
François PARMENTIER / INIST-CNRS

On Sun, Dec 14, 2008 at 12:52 AM, Andrew Marlow <
marlow.and...@googlemail.com> wrote:

> This may seem like a crazy or naive question, but is there any standard
> laid down by publishers or societies that authors must adhere to so that the
> extraction of metadata from articles can be easily automated? Having just
> performed a text extraction on a non-searchable PDF I see that there is no
> easy way to get any metadata out. But if a society had conventions for the
> layour of the article, specifying location and format of title, authors,
> abstract, bibliography etc, then it might be possible. I have seen a very
> regular visual layout in the PDFs from some places. Using OCR techniques it
> might be possible to locate blocks of interest. It might also be possible
> from a text extraction but that might be harder since all visual layout
> information is gone (at least it was with the tool I used). I wonder if this
> is being considered by anyone. I am very new to this area so please excuse
> me if this seems like a silly question.
> --
> Regards,
>
> Andrew M.
>
>
> --
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
>
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>
--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] standards to facilitate metadata extraction during text extraction

Re: [Dspace-tech] standards to facilitate metadata extraction during text extraction

2 matches

Site Navigation

Mail list logo

Footer information