Re: Tika Server - Getting the log output with MDC to associate the file being parsed

2020-06-26 Thread Nicholas DiPiazza
Thanks Tim! That did the trick. I misread what that parameter meant
originally.

And double thanks for the /rmeta. that's a much better fit for what i'm
doing!

On Fri, Jun 26, 2020 at 2:23 PM Tim Allison  wrote:

> Depends on what you're trying to do.  If you want all of the text+metadata
> out of your files including embedded files, I'd use /rmeta
>
> If you start tika-server with  -s or --includeStack, /tika and /rmeta will
> return the full stacktrace.  I can't remember if /unpack will or not.
>
> If you need the literal bytes from the embedded files, then /unpack is the
> right endpoint.
>
> If /unpack isn't returning the stacktrace when you start the server with
> the -s option, please report it.  That endpoint should work like /tika and
> /rmeta with the -s option.
>
> On Fri, Jun 26, 2020 at 2:30 PM Nicholas DiPiazza <
> nicholas.dipia...@gmail.com> wrote:
>
> > I am happily using Tika Server to replace some in-memory usage of Apache
> > Tika we have been using for years.
> >
> > I am stuck with one thing I have sent a file to be parsed to the
> unpack
> > endpoint /unpack/all
> >
> > I get back a zip file with the metadata, and text extracted. Great!
> >
> > But some docs failed to parse, and I'll need to know why. For example,
> > something is encrypted.
> >
> > But the response comes back 422. What I really need is to get feedback
> from
> > the tika server why it failed. In particular, the error message.
> >
> > Is there another endpoint I should be using?
> >
> > -NIcholas DIPiazza
> >
>


Re: Tika Server - Getting the log output with MDC to associate the file being parsed

2020-06-26 Thread Tim Allison
Depends on what you're trying to do.  If you want all of the text+metadata
out of your files including embedded files, I'd use /rmeta

If you start tika-server with  -s or --includeStack, /tika and /rmeta will
return the full stacktrace.  I can't remember if /unpack will or not.

If you need the literal bytes from the embedded files, then /unpack is the
right endpoint.

If /unpack isn't returning the stacktrace when you start the server with
the -s option, please report it.  That endpoint should work like /tika and
/rmeta with the -s option.

On Fri, Jun 26, 2020 at 2:30 PM Nicholas DiPiazza <
nicholas.dipia...@gmail.com> wrote:

> I am happily using Tika Server to replace some in-memory usage of Apache
> Tika we have been using for years.
>
> I am stuck with one thing I have sent a file to be parsed to the unpack
> endpoint /unpack/all
>
> I get back a zip file with the metadata, and text extracted. Great!
>
> But some docs failed to parse, and I'll need to know why. For example,
> something is encrypted.
>
> But the response comes back 422. What I really need is to get feedback from
> the tika server why it failed. In particular, the error message.
>
> Is there another endpoint I should be using?
>
> -NIcholas DIPiazza
>


Tika Server - Getting the log output with MDC to associate the file being parsed

2020-06-26 Thread Nicholas DiPiazza
I am happily using Tika Server to replace some in-memory usage of Apache
Tika we have been using for years.

I am stuck with one thing I have sent a file to be parsed to the unpack
endpoint /unpack/all

I get back a zip file with the metadata, and text extracted. Great!

But some docs failed to parse, and I'll need to know why. For example,
something is encrypted.

But the response comes back 422. What I really need is to get feedback from
the tika server why it failed. In particular, the error message.

Is there another endpoint I should be using?

-NIcholas DIPiazza


packaged subsets

2020-06-26 Thread Tim Allison
All,

   I received a request to package some of the bugtracker data for easier
downloading.  For this request, I've zipped the PDFs and FDFs from the
bugtrackers and made those zips available here:
https://corpora.tika.apache.org/base/packaged/pdfs/

   I don't think we'll be inundated with one-off requests, and I don't
think we should be zipping large chunks of govdocs1 or commoncrawl.

   Are there any objections?  Is there a better way to package data and/or
make it available/browsable/navigable/retrievable?

  Cheers,

 Tim


[jira] [Commented] (TIKA-3113) Currently Tika is detecting a .aux file as text/html

2020-06-26 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146370#comment-17146370
 ] 

Lewis John McGibbney commented on TIKA-3113:


After a wee bit of research I understand it to be a file created by TeX and 
LaTeX, which are typesetting standards often used to generate academic papers 
and other technical documentation; contains information about a document such 
as footnotes, bibliography entries, and cross-references.

_More Information_
AUX files are written when a .TEX file is typeset (formatted to an output 
document) by LaTeX. Since the generation of LaTeX documentation can take 
multiple passes before the document is complete (because of file and citation 
cross-referencing), the AUX file is used to store information between runs of 
the LaTeX compilation process.


Appears to be a temporary file... 


I haven't been able to find a Java parser for this family of data formats but I 
did find https://github.com/nzhagen/bibulous

> Currently Tika is detecting a .aux file as text/html
> 
>
> Key: TIKA-3113
> URL: https://issues.apache.org/jira/browse/TIKA-3113
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.24
>Reporter: Danny McKinney
>Priority: Minor
> Attachments: TES.PC.00010363.1.aux
>
>
> While processing files from an Enron test data set a file with extension aux 
> was detected to be MediaType of text/html. The file contains elements 
>  and  but is a type of LaTex file I believe. I am attachingĀ  
> sample file.[^TES.PC.00010363.1.aux]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3120) Remove whitelist/blacklist terminology

2020-06-26 Thread Konstantin Gribov (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146315#comment-17146315
 ] 

Konstantin Gribov commented on TIKA-3120:
-

[~tallison], I noticed messages from commits@tika.a.o about this and saw that 
you use include/skip pair. Did you choose one such pair or just gone with 
context dependent on case by case basis? If first it might be good idea to add 
recommended words for include/exclude to wiki for future contributors.

> Remove whitelist/blacklist terminology
> --
>
> Key: TIKA-3120
> URL: https://issues.apache.org/jira/browse/TIKA-3120
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.25
>
>
> Looks trivial...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3121) Rename master branch

2020-06-26 Thread Konstantin Gribov (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146290#comment-17146290
 ] 

Konstantin Gribov commented on TIKA-3121:
-

Alternative is to use just branches like main, branch_1x, branch_2x etc, 
archive & lock master and set new branch as default HEAD. This way we will have 
much smoother transition with much smaller potential impact

> Rename master branch
> 
>
> Key: TIKA-3121
> URL: https://issues.apache.org/jira/browse/TIKA-3121
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I started a discussion on the dev list for this here:
> http://mail-archives.us.apache.org/mod_mbox/tika-dev/202006.mbox/%3CCAC1dCwW9FuK%2BkSzokmweeYwLFiED9g0W-43J1TNhMwnv7rdp8g%40mail.gmail.com%3E
> One committer would prefer that we not make this change, but seems ok with it.
> Recommendations:
> * main
> * trunk
> * development
> * stable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3121) Rename master branch

2020-06-26 Thread Konstantin Gribov (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146288#comment-17146288
 ] 

Konstantin Gribov commented on TIKA-3121:
-

I didn't vote before and a bit ambivalent about change. Despite all Rich's 
pushing towards renaming I'm a bit concerned about real impact on developer 
biased community. For me it looks more like populist decision but I may be 
biased by previous hate storms that used D ideas against anyone who don't 
kneel and plead to spare them despite not being in some minority.

We will have to go through documentation, wiki, configuration for CI etc to 
ensure that new branch name is used but we can do this only for our projects. 

All external developers who include Tika in their build systems, delivery 
pipelines, writes articles/books and using master branch would have to do some 
additional (and sometimes unexpected) work. In ideal world it would be done via 
usual scripts/configuration maintenance but a lot of thing with low prio 
support or without actual maintenance could break.

So, I'm basically -0.5, weak against 'cause potential impact on downstream 
users and fellow developers.

> Rename master branch
> 
>
> Key: TIKA-3121
> URL: https://issues.apache.org/jira/browse/TIKA-3121
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I started a discussion on the dev list for this here:
> http://mail-archives.us.apache.org/mod_mbox/tika-dev/202006.mbox/%3CCAC1dCwW9FuK%2BkSzokmweeYwLFiED9g0W-43J1TNhMwnv7rdp8g%40mail.gmail.com%3E
> One committer would prefer that we not make this change, but seems ok with it.
> Recommendations:
> * main
> * trunk
> * development
> * stable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)