Dear all,

Sorry if this question doesn’t belong here but TeX.SE community hasn’t given 
helpful answers other than recommending de-macro and other scripts that often 
fail.

I’m a beginning data scientist who wants to be able to get software to process 
scholarly papers. While it is possible to extract text and structure from DVI 
files, PDF files and PS files using machine learning it can never been 100% 
correct which is a fact about ML. This is why I’m thinking about using the tex 
sources of papers themselves. However custom macros in TeX are notoriously hard 
to completely remove so that the TeX files can be standardized without 
introducing inaccuracies. Is this problem possible to solve using LuaTex since 
Lua gives authors more control? Or shall I completely forget about 
standardizing TeX files in any sense and focus on better methods to extract 
information from PDF files?

Sincerely,

Ying Zhou

Reply via email to