On Fri, May 3, 2019 at 2:32 PM Ying Zhou <[email protected]> wrote:

> Dear all,
>
> Sorry if this question doesn’t belong here but TeX.SE community hasn’t
> given helpful answers other than recommending de-macro and other scripts
> that often fail.
>
> I’m a beginning data scientist who wants to be able to get software to
> process scholarly papers. While it is possible to extract text and
> structure from DVI files, PDF files and PS files using machine learning it
> can never been 100% correct which is a fact about ML. This is why I’m
> thinking about using the tex sources of papers themselves. However custom
> macros in TeX are notoriously hard to completely remove so that the TeX
> files can be standardized without introducing inaccuracies. Is this problem
> possible to solve using LuaTex since Lua gives authors more control? Or
> shall I completely forget about standardizing TeX files in any sense and
> focus on better methods to extract information from PDF files?
>
>
1) if with "standardizing TeX files" you mean an ISO standard , yes , *in
principle* is possibile;

2) A more concrete goal is using tagged pdf. You can promote custom tags
---  read : a standard de facto xml  application  --  for your content.

PS
It's a pity that the TeX community has no access to the pdf 2.0 ISO
standard .

-- 
luigi

Reply via email to