>These were in the header...I have to step away from the keyboard for now...any ideas?
I confirmed this by flipping btwn 4.0.0 and 4.0.1 in our dependencies and using our Tika's SNAPSHOT for both. This is not caused by a different version of Tika. On Wed, Nov 21, 2018 at 12:53 PM Tim Allison <talli...@apache.org> wrote: > > Y, my suspicion holds up. If you look at TOP_10_UNIQUE_TOKEN_DIFFS_A > in content_diffs_with_exceptions.xlsx, there aren't any unique words > we were extracting with 4.0.0 that we're not extracting with 4.0.1 in > the vast majority of ppt files. Note, too, that while the number of > tokens differs, the number of unique tokens does not...for the > majority of ppt. > > It looks like we have lost some content docx template files, e.g.: > http://162.242.228.174/docs/commoncrawl2/KQ/KQQ5VZ6BBBRCZPY4GDUIEMVPSGABOMM4 > > We used to get 17 unique words from this, and we now get just > 1...we've lost: de: 2 | la: 2 | 03: 1 | 06: 1 | 1: 1 | 16: 1 | 2009: 1 > | 3: 1 | conciencia: 1 | despertar: 1 > > These were in the header...I have to step away from the keyboard for > now...any ideas? > On Wed, Nov 21, 2018 at 12:37 PM Tim Allison <talli...@apache.org> wrote: > > > > Reports are available here: > > http://162.242.228.174/reports/reports_poi_4_0_1-rc1.tgz > > > > We have a bunch less content in ppt, but I _think_ this is because at > > the Tika level we used to duplicate notes content, and we've fixed > > that bug. So, I think this is an improvement, but I need to check. > > On Wed, Nov 21, 2018 at 12:05 PM Andreas Beeker <kiwiwi...@apache.org> > > wrote: > > > > > > On 21.11.18 10:47, pj.fanning wrote: > > > > I found a few missing classes in poi-ooxml-schemas.jar. > > > > > > Is this now a "-1", i.e. a must-have otherwise we get a lot of > > > stackoverflow messages complaining about it > > > > > > ... or a "0-", i.e. nice-to-have, but until 4.0.2 is out, the users can > > > use the full-schema > > > > > > > > > I'm asking about this, as there were already a few changes to the trunk > > > since I've provided the RC and we might have to do another Tika- / POI- > > > common crawl run again... which I would like to avoid. > > > > > > Andi > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org > > > For additional commands, e-mail: dev-h...@poi.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org