[Dspace-tech] standards to facilitate metadata extraction during text extraction

Andrew Marlow Sat, 13 Dec 2008 15:53:06 -0800

This may seem like a crazy or naive question, but is there any standard laid
down by publishers or societies that authors must adhere to so that the
extraction of metadata from articles can be easily automated? Having just
performed a text extraction on a non-searchable PDF I see that there is no
easy way to get any metadata out. But if a society had conventions for the
layour of the article, specifying location and format of title, authors,
abstract, bibliography etc, then it might be possible. I have seen a very
regular visual layout in the PDFs from some places. Using OCR techniques it
might be possible to locate blocks of interest. It might also be possible
from a text extraction but that might be harder since all visual layout
information is gone (at least it was with the tool I used). I wonder if this
is being considered by anyone. I am very new to this area so please excuse
me if this seems like a silly question.
-- 
Regards,


Andrew M.

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

[Dspace-tech] standards to facilitate metadata extraction during text extraction

Reply via email to