Dear Qian, all,

I was privately asked to send over some information on extracting a formula 
dataset out of arXiv.org, which we have had some experience with in prof. 
Michael Kohlhase's KWARC research group.

I gave a talk earlier in the year [1] where I reported we have so far 
managed to convert 960,000 TeX sources from arXiv into HTML5, and have 
counted about 350 million formulas [2] in that collection, with 
cross-referenced presentation MathML and (experimental) content MathML. 
That covers the arXiv documents upto 02.2016.

It is my greatest pity that we do not yet have a properly bundled public 
dataset for reuse by the larger scientific community. The primary reason 
for that delay has been a very long and painful effort to get official 
assistance from the arXiv team in overcoming the copyright challenges of 
the default arXiv license, as we ideally want to bundle the full documents.

That said, if there is interest in a formula-only dataset, I think that is 
something that can safely be extracted and redistributed as derivative 
work, and we've always meant to eventually do so. I should mention that the 
original latex sources of the formulas are preserved in the transformation. 
But one note of caution is that a very significant number of formulas, 
especially large equations, use non-standard latex commands, thus using the 
latex syntax in vacuum may be very noisy and impractical. Meanwhile the 
presentation MathML trees are quite high quality, and definitely 
machine-readable. The conversion tool we use, latexml, reports 95% success 
rate in formula parsing (but we don't know if that is caused by a 
disproportional number of trivial formulas such as $x$). 

I'll keep an eye on the ongoing discussion and interest, and mostly just 
wanted to answer the request we got to share some of our experience. Sorry 
that I can't link to a dataset one can easily download and play with yet, 
curious to hear if the Sage community would find value in such data.

Greetings,
Deyan

[1] http://prodg.org/talks/mnlp_billion_token_corpora
[2] http://prodg.org/talks/mnlp_billion_token_corpora#14

On Saturday, December 10, 2016 at 10:31:20 PM UTC+2, Qian Hong wrote:
>
> Thanks David, 
>
> On Sun, Dec 11, 2016 at 7:02 AM, David Roe <roed...@gmail.com 
> <javascript:>> wrote: 
> > The issue is with the simplification of the expression, rather than the 
> > latex function.  The following currently works: 
> > 
> > sage: latex(x.mul(x.power(-1),hold=True)) 
> > \frac{x}{x} 
> > 
> > Is this sufficient for your purposes?  If not, the relevant code is in 
> > sage/symbolic/expression.pyx (it's a big file: search for `hold=False`). 
>
> No, this is not automatically enough, I'm looking for an automatic way 
> rather than manually rewriting every expressions. In other words, I'm 
> looking for something like 100k lines of sage expression samples, and 
> then automatically generate 100k latex expressions and 100k images. 
> Thank you for point out the relevant code, I'll start from there! 
>
> BTW, is there anyone has idea where to find as many as sage expression 
> samples? 
> im2latex-100k set is built on arxiv.org papers [1] 
> I built another latex formula set based on math.stackexchange.com open 
> data [2] 
> Is there some place containing a lot of sage expression so I can 
> extract and reuse rather than constructing from scratch? 
>
> [1] https://github.com/Miffyli/im2latex-dataset/blob/master/latex_urls.txt 
> [2] https://archive.org/details/stackexchange 
>
>
>
> -- 
> Regards, 
> Qian Hong 
>

-- 
You received this message because you are subscribed to the Google Groups 
"sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-devel+unsubscr...@googlegroups.com.
To post to this group, send email to sage-devel@googlegroups.com.
Visit this group at https://groups.google.com/group/sage-devel.
For more options, visit https://groups.google.com/d/optout.

Reply via email to