>These were in the header...I have to step away from the keyboard for
now...any ideas?

I confirmed this by flipping btwn 4.0.0 and 4.0.1 in our dependencies
and using our Tika's SNAPSHOT for both.  This is not caused by a
different version of Tika.
On Wed, Nov 21, 2018 at 12:53 PM Tim Allison <talli...@apache.org> wrote:
>
> Y, my suspicion holds up.  If you look at TOP_10_UNIQUE_TOKEN_DIFFS_A
> in content_diffs_with_exceptions.xlsx, there aren't any unique words
> we were extracting with 4.0.0 that we're not extracting with 4.0.1 in
> the vast majority of ppt files.  Note, too, that while the number of
> tokens differs, the number of unique tokens does not...for the
> majority of ppt.
>
> It looks like we have lost some content docx template files, e.g.:
> http://162.242.228.174/docs/commoncrawl2/KQ/KQQ5VZ6BBBRCZPY4GDUIEMVPSGABOMM4
>
> We used to get 17 unique words from this, and we now get just
> 1...we've lost: de: 2 | la: 2 | 03: 1 | 06: 1 | 1: 1 | 16: 1 | 2009: 1
> | 3: 1 | conciencia: 1 | despertar: 1
>
> These were in the header...I have to step away from the keyboard for
> now...any ideas?
> On Wed, Nov 21, 2018 at 12:37 PM Tim Allison <talli...@apache.org> wrote:
> >
> > Reports are available here:
> > http://162.242.228.174/reports/reports_poi_4_0_1-rc1.tgz
> >
> > We have a bunch less content in ppt, but I _think_ this is because at
> > the Tika level we used to duplicate notes content, and we've fixed
> > that bug.  So, I think this is an improvement, but I need to check.
> > On Wed, Nov 21, 2018 at 12:05 PM Andreas Beeker <kiwiwi...@apache.org> 
> > wrote:
> > >
> > > On 21.11.18 10:47, pj.fanning wrote:
> > > > I found a few missing classes in poi-ooxml-schemas.jar.
> > >
> > > Is this now a "-1", i.e. a must-have otherwise we get a lot of 
> > > stackoverflow messages complaining about it
> > >
> > > ... or a "0-", i.e. nice-to-have, but until 4.0.2 is out, the users can 
> > > use the full-schema
> > >
> > >
> > > I'm asking about this, as there were already a few changes to the trunk 
> > > since I've provided the RC and we might have to do another Tika- / POI- 
> > > common crawl run again... which I would like to avoid.
> > >
> > > Andi
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
> > > For additional commands, e-mail: dev-h...@poi.apache.org
> > >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Reply via email to