I'm mostly interested in differences between crawls with different PDFBox versions.

And I already have one change where I wonder if anything will happen: the text stripper code has this

wordSpacing == Float.NaN

however that is always false, and I wonder what differences will come up when using the correct code, which is

Float.isNaN(wordSpacing)

Tilman

Am 03.04.2015 um 14:35 schrieb [email protected]:
All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected] wrote:

    CommonCrawl currently has the WET format that extracts plain text
    from web pages.  My guess is that this is text stripping from
    text-y formats.  Let me know if I'm wrong!

    Would there be any interest in adding another format: WETT
    (WET-Tika) or supplementing the current WET by using Tika to
    extract contents from binary formats too: PDF, MSWord, etc.

    Julien Nioche kindly carved out 220 GB for us to experiment with
    on TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on
    a Rackspace vm.  But, I'm wondering now if it would make more
    sense to have CommonCrawl run Tika as part of its regular
    process and make the output available in one of your standard
    formats.

    CommonCrawl consumers would get Tika output, and the Tika dev
    community (including its dependencies, PDFBox, POI, etc.) could
    get the stacktraces to help prioritize bug fixes.

    Cheers,

              Tim



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to