On Fri, May 3, 2019 at 2:32 PM Ying Zhou <[email protected]> wrote:
> Dear all, > > Sorry if this question doesn’t belong here but TeX.SE community hasn’t > given helpful answers other than recommending de-macro and other scripts > that often fail. > > I’m a beginning data scientist who wants to be able to get software to > process scholarly papers. While it is possible to extract text and > structure from DVI files, PDF files and PS files using machine learning it > can never been 100% correct which is a fact about ML. This is why I’m > thinking about using the tex sources of papers themselves. However custom > macros in TeX are notoriously hard to completely remove so that the TeX > files can be standardized without introducing inaccuracies. Is this problem > possible to solve using LuaTex since Lua gives authors more control? Or > shall I completely forget about standardizing TeX files in any sense and > focus on better methods to extract information from PDF files? > > 1) if with "standardizing TeX files" you mean an ISO standard , yes , *in principle* is possibile; 2) A more concrete goal is using tagged pdf. You can promote custom tags --- read : a standard de facto xml application -- for your content. PS It's a pity that the TeX community has no access to the pdf 2.0 ISO standard . -- luigi
