Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

Mathieu Stumpf Guntz Tue, 05 Jan 2016 10:34:28 -0800

Great, thank you for the news and congratulation for this achievement. :)


Le 05/01/2016 19:29, Bodhisattwa Mandal a écrit :

Hi,

I am happy to inform, that Shrinivasan has created a python script toautomate the process in Linux system. This scripts upload the PDFfiles to Google Drive, download the OCRed text and split, merge thetext files properly to fit as the PDF file. We have just tested thescript for small files in Kannad and Bengali Wikisource and it wassuccessful. We are going to test the script for using different typesand sizes of files and in other Indic languages in next few days.


The script is in https://github.com/tshrinivasan/OCR4wikisource

Regards,
Bodhisattwa

On 2 December 2015 at 17:21, Tobias Schönberg <tobias47...@gmail.com<mailto:tobias47...@gmail.com>> wrote:


    I think it is important for non-technical readers of this list to
    separate the 2 issues in the discussion.

    1) OCR-Integration
    This is something WMF can help with, because they can make the
    connection between an OCR service and Mediawiki easier and
    automate certain steps.

    2) OCR
    WMF is not programming an OCR-software and it would probably be a
    bad idea to reinvent the wheel. It would be far better if editors
    reached out to existing ORC-software projects. Starting a
    discussion or filing a bug is an important first step in improving
    the situation.
    Tesseract-OCR (https://github.com/tesseract-ocr) for example is an
    open-source project that works on OCR (No bugs filed for e.g.
    Bengali). The mailing list
    (https://groups.google.com/forum/#!forum/tesseract-ocr
    <https://groups.google.com/forum/#%21forum/tesseract-ocr>)
    contains discussions about e.g. Bengali
    (https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali
    <https://groups.google.com/forum/#%21searchin/tesseract-ocr/Bengali>).
    So I think the situation might not be good, but is certainly on
    its way of getting better.
    Maybe WMF-India can fund a developer to work on Tesseract-OCR.
    Another idea would be, to reach out to local universities. Maybe a
    few informatics-students can improve the situation.

    -Tobias


    2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ)
    <viswapra...@gmail.com <mailto:viswapra...@gmail.com>>:

        From that page which, Alex has linked:
        "On the other hand, using the service for converting document
        formats /is/ SaaSS, because it's something you could have done
        by running a suitable program (free, one hopes) in your own
        computer."

        Hundreds among us have burnt their hands in developing a
        successful 'free' OCR tool for Indic languages without any
        real luck until now.
        Until such a tool appears on the horizon, the Google facility
        is just okay to be used.

        Especially so, because we are anyway dealing with 'free' input
        and output material.

        -Viswaprabha



        On 1 December 2015 at 21:49, Bodhisattwa Mandal
        <bodhisattwa.rg...@gmail.com
        <mailto:bodhisattwa.rg...@gmail.com>> wrote:

            Hi Alex,

            Of course, building free OCR can be the only permanent
            solution, but WMF is not interested in building new OCR
            right now. The language engineering team said at the
            conference that, they don't have the infrastructure and
            expertise to build such software. That's why, we have to
            rely on Google OCR, knowing very well about its profit
            making intentions. It's just a temporary solution but
            right now, its the only best possible alternative for us.

            Regards
            Bodhisattwa

            On 1 Dec 2015 21:12, "Alex Brollo" <alex.bro...@gmail.com
            <mailto:alex.bro...@gmail.com>> wrote:

                ... nevertheless I found very interesting this
                about "SaaSS":
                
https://www.gnu.org/philosophy/who-does-that-server-really-serve.html


                So, to build a true, excellent and indipendent
                "wikisource multilingual OCR service" would be a
                better solution.

                Alex

                2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal
                <bodhisattwa.rg...@gmail.com
                <mailto:bodhisattwa.rg...@gmail.com>>:

                    Hi Nemo,

                    Thanks for your interest. You can find the list of
                    Google OCR supported languages in the following link -

                    https://support.google.com/drive/answer/176692?hl=en

                    Regards,
                    Bodhisattwa

                    Thanks for posting about the topic. Which indic
                    languages are we talking about exactly? Are they
                    included in the recent FineReader versions now
                    used by Internet Archive?

                    Nemo

                    _______________________________________________
                    Wikisource-l mailing list
                    Wikisource-l@lists.wikimedia.org
                    <mailto:Wikisource-l@lists.wikimedia.org>
                    https://lists.wikimedia.org/mailman/listinfo/wikisource-l

                    _______________________________________________
                    Wikisource-l mailing list
                    Wikisource-l@lists.wikimedia.org
                    <mailto:Wikisource-l@lists.wikimedia.org>
                    https://lists.wikimedia.org/mailman/listinfo/wikisource-l



                _______________________________________________
                Wikisource-l mailing list
                Wikisource-l@lists.wikimedia.org
                <mailto:Wikisource-l@lists.wikimedia.org>
                https://lists.wikimedia.org/mailman/listinfo/wikisource-l


            _______________________________________________
            Wikisource-l mailing list
            Wikisource-l@lists.wikimedia.org
            <mailto:Wikisource-l@lists.wikimedia.org>
            https://lists.wikimedia.org/mailman/listinfo/wikisource-l



        _______________________________________________
        Wikisource-l mailing list
        Wikisource-l@lists.wikimedia.org
        <mailto:Wikisource-l@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikisource-l



    _______________________________________________
    Wikisource-l mailing list
    Wikisource-l@lists.wikimedia.org
    <mailto:Wikisource-l@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikisource-l




--
Bodhisattwa Mandal
Administrator, Bengali Wikipedia

''Imagine a world in which every single person on the planet is givenfree access to the sum of all human knowledge.''



_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

Reply via email to