On Mon, Dec 15, 2008 at 9:14 PM, Mark H. Wood <[email protected]> wrote:

> Most common formats other than plain text have some sort of tagging
> feature.  In some cases, few know about them so they aren't much
> used.  That could be fixed easily.
>
> > Microsoft docx documents looks like a step in the right direction
>

The older Office formats are readable programmatically too.


But then that only works for MS Office documents.  Not for OpenOffice
> or Symphony.  Not for Acrobat.  We have tens of thousands of PDFs.


As the OP I would like to chip in again. I think people have missed the
point of what I was trying to say. I probably wasn't very clear. What I
meant was conventions for the placement/position of title, authors, IISN,
abstract, etc so it could be easily extracted. I assumption was that the
document is a PDF. Text can easily be extracted from PDFs but extraction on
its own it not enough. We need to be able to determine what the values are
for title, authors etc. If these were layed out in a std way it would help.
That's all I'm saying. This was not supposed to be a discussion about file
formats!

-- 
Regards,

Andrew M.
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to