Let me actually take a look before answering. Sorry! On Thu, Dec 13, 2018 at 5:30 PM Tim Allison <talli...@apache.org> wrote:
> Thank you for reading the reports!!! > > The files are very likely broken. I can take a look. The change was > probably because of an "upgrade" to junrar. Should I revert to the > version we used in 1.19.1? > On Thu, Dec 13, 2018 at 1:34 PM Luís Filipe Nassif <lfcnas...@gmail.com> > wrote: > > > > Hi Tim, > > > > Reading your great reports, I also saw some new exceptions with RAR files > > in likely broken folder, but seems tika was able to extract some text > from > > them before. Do you know if those files are really broken and why tika > > extracted text from them before? > > > > Thank you, > > Luis > > > > Em qui, 13 de dez de 2018 às 13:02, Tim Allison <talli...@apache.org> > > escreveu: > > > > > Reports are here: > > > > > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip > > > > > > I'm going to revert the mp4 parser, and commit the few dependency > > > upgrades I ran. > > > > > > The _major_ difference in content for ppt is explained by the > > > duplication of header/footer info. To confirm this, note that the > > > values for "num_unique_tokens_a" and "num_unique_tokens_b" are > > > identical for nearly all ppt->ppt, but there are far more tokens in > > > "num_tokens_a" vs "num_tokens_b". > > > > > > I also see that we're losing content in x-java and x-groovy, etc., but > > > that's because we're now suppressing the style markup that our parser > > > was (incorrectly, IMHO, inserting) -- check the values in > > > "top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 | > > > 0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 | > > > weight: 3 | family: 2 > > > > > > In short, I think we're good to go. Will roll rc1 later today or > > > (more likely) tomorrow unless there are objections. > > > On Mon, Dec 10, 2018 at 9:37 PM Tim Allison <talli...@apache.org> > wrote: > > > > > > > > Any blockers on 1.20? I'm going to kick off the regression tests > > > shortly. > > > > On Fri, Nov 30, 2018 at 7:39 PM <loo...@gmail.com> wrote: > > > > > > > > > > Hi, > > > > > On Wed, 21 Nov 2018 at 13:00, Tim Allison <talli...@apache.org> > wrote: > > > > > > > > > > > Dave, > > > > > > Should I try to get the Docker plugin working again? > > > > > > > > > > > > > > > > That would be great. I think I may have went down the wrong path > > > building > > > > > an image at package time, as there doesn't seem to be an easy way > to > > > > > publish it as an Apache labelled org on Dockerhub unless it builds > from > > > > > source. > > > > > > > > > > I have some time over the weekend, so could update to where I got > to > > > and > > > > > see what you think. > > > > > > > > > > Cheers, > > > > > Dave > > > >