Re: Tika Server - Getting the log output with MDC to associate the file being parsed
Thanks Tim! That did the trick. I misread what that parameter meant originally. And double thanks for the /rmeta. that's a much better fit for what i'm doing! On Fri, Jun 26, 2020 at 2:23 PM Tim Allison wrote: > Depends on what you're trying to do. If you want all of the text+metadata > out of your files including embedded files, I'd use /rmeta > > If you start tika-server with -s or --includeStack, /tika and /rmeta will > return the full stacktrace. I can't remember if /unpack will or not. > > If you need the literal bytes from the embedded files, then /unpack is the > right endpoint. > > If /unpack isn't returning the stacktrace when you start the server with > the -s option, please report it. That endpoint should work like /tika and > /rmeta with the -s option. > > On Fri, Jun 26, 2020 at 2:30 PM Nicholas DiPiazza < > nicholas.dipia...@gmail.com> wrote: > > > I am happily using Tika Server to replace some in-memory usage of Apache > > Tika we have been using for years. > > > > I am stuck with one thing I have sent a file to be parsed to the > unpack > > endpoint /unpack/all > > > > I get back a zip file with the metadata, and text extracted. Great! > > > > But some docs failed to parse, and I'll need to know why. For example, > > something is encrypted. > > > > But the response comes back 422. What I really need is to get feedback > from > > the tika server why it failed. In particular, the error message. > > > > Is there another endpoint I should be using? > > > > -NIcholas DIPiazza > > >
Re: Tika Server - Getting the log output with MDC to associate the file being parsed
Depends on what you're trying to do. If you want all of the text+metadata out of your files including embedded files, I'd use /rmeta If you start tika-server with -s or --includeStack, /tika and /rmeta will return the full stacktrace. I can't remember if /unpack will or not. If you need the literal bytes from the embedded files, then /unpack is the right endpoint. If /unpack isn't returning the stacktrace when you start the server with the -s option, please report it. That endpoint should work like /tika and /rmeta with the -s option. On Fri, Jun 26, 2020 at 2:30 PM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > I am happily using Tika Server to replace some in-memory usage of Apache > Tika we have been using for years. > > I am stuck with one thing I have sent a file to be parsed to the unpack > endpoint /unpack/all > > I get back a zip file with the metadata, and text extracted. Great! > > But some docs failed to parse, and I'll need to know why. For example, > something is encrypted. > > But the response comes back 422. What I really need is to get feedback from > the tika server why it failed. In particular, the error message. > > Is there another endpoint I should be using? > > -NIcholas DIPiazza >
Tika Server - Getting the log output with MDC to associate the file being parsed
I am happily using Tika Server to replace some in-memory usage of Apache Tika we have been using for years. I am stuck with one thing I have sent a file to be parsed to the unpack endpoint /unpack/all I get back a zip file with the metadata, and text extracted. Great! But some docs failed to parse, and I'll need to know why. For example, something is encrypted. But the response comes back 422. What I really need is to get feedback from the tika server why it failed. In particular, the error message. Is there another endpoint I should be using? -NIcholas DIPiazza
packaged subsets
All, I received a request to package some of the bugtracker data for easier downloading. For this request, I've zipped the PDFs and FDFs from the bugtrackers and made those zips available here: https://corpora.tika.apache.org/base/packaged/pdfs/ I don't think we'll be inundated with one-off requests, and I don't think we should be zipping large chunks of govdocs1 or commoncrawl. Are there any objections? Is there a better way to package data and/or make it available/browsable/navigable/retrievable? Cheers, Tim
[jira] [Commented] (TIKA-3113) Currently Tika is detecting a .aux file as text/html
[ https://issues.apache.org/jira/browse/TIKA-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146370#comment-17146370 ] Lewis John McGibbney commented on TIKA-3113: After a wee bit of research I understand it to be a file created by TeX and LaTeX, which are typesetting standards often used to generate academic papers and other technical documentation; contains information about a document such as footnotes, bibliography entries, and cross-references. _More Information_ AUX files are written when a .TEX file is typeset (formatted to an output document) by LaTeX. Since the generation of LaTeX documentation can take multiple passes before the document is complete (because of file and citation cross-referencing), the AUX file is used to store information between runs of the LaTeX compilation process. Appears to be a temporary file... I haven't been able to find a Java parser for this family of data formats but I did find https://github.com/nzhagen/bibulous > Currently Tika is detecting a .aux file as text/html > > > Key: TIKA-3113 > URL: https://issues.apache.org/jira/browse/TIKA-3113 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.24 >Reporter: Danny McKinney >Priority: Minor > Attachments: TES.PC.00010363.1.aux > > > While processing files from an Enron test data set a file with extension aux > was detected to be MediaType of text/html. The file contains elements > and but is a type of LaTex file I believe. I am attachingĀ > sample file.[^TES.PC.00010363.1.aux] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3120) Remove whitelist/blacklist terminology
[ https://issues.apache.org/jira/browse/TIKA-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146315#comment-17146315 ] Konstantin Gribov commented on TIKA-3120: - [~tallison], I noticed messages from commits@tika.a.o about this and saw that you use include/skip pair. Did you choose one such pair or just gone with context dependent on case by case basis? If first it might be good idea to add recommended words for include/exclude to wiki for future contributors. > Remove whitelist/blacklist terminology > -- > > Key: TIKA-3120 > URL: https://issues.apache.org/jira/browse/TIKA-3120 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Fix For: 1.25 > > > Looks trivial... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3121) Rename master branch
[ https://issues.apache.org/jira/browse/TIKA-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146290#comment-17146290 ] Konstantin Gribov commented on TIKA-3121: - Alternative is to use just branches like main, branch_1x, branch_2x etc, archive & lock master and set new branch as default HEAD. This way we will have much smoother transition with much smaller potential impact > Rename master branch > > > Key: TIKA-3121 > URL: https://issues.apache.org/jira/browse/TIKA-3121 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I started a discussion on the dev list for this here: > http://mail-archives.us.apache.org/mod_mbox/tika-dev/202006.mbox/%3CCAC1dCwW9FuK%2BkSzokmweeYwLFiED9g0W-43J1TNhMwnv7rdp8g%40mail.gmail.com%3E > One committer would prefer that we not make this change, but seems ok with it. > Recommendations: > * main > * trunk > * development > * stable -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3121) Rename master branch
[ https://issues.apache.org/jira/browse/TIKA-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146288#comment-17146288 ] Konstantin Gribov commented on TIKA-3121: - I didn't vote before and a bit ambivalent about change. Despite all Rich's pushing towards renaming I'm a bit concerned about real impact on developer biased community. For me it looks more like populist decision but I may be biased by previous hate storms that used D ideas against anyone who don't kneel and plead to spare them despite not being in some minority. We will have to go through documentation, wiki, configuration for CI etc to ensure that new branch name is used but we can do this only for our projects. All external developers who include Tika in their build systems, delivery pipelines, writes articles/books and using master branch would have to do some additional (and sometimes unexpected) work. In ideal world it would be done via usual scripts/configuration maintenance but a lot of thing with low prio support or without actual maintenance could break. So, I'm basically -0.5, weak against 'cause potential impact on downstream users and fellow developers. > Rename master branch > > > Key: TIKA-3121 > URL: https://issues.apache.org/jira/browse/TIKA-3121 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > I started a discussion on the dev list for this here: > http://mail-archives.us.apache.org/mod_mbox/tika-dev/202006.mbox/%3CCAC1dCwW9FuK%2BkSzokmweeYwLFiED9g0W-43J1TNhMwnv7rdp8g%40mail.gmail.com%3E > One committer would prefer that we not make this change, but seems ok with it. > Recommendations: > * main > * trunk > * development > * stable -- This message was sent by Atlassian Jira (v8.3.4#803005)