Re: Unexpected behavior when inspecting mp4 files with different ISO

2024-04-27 Thread Nick Burch
On Fri, 26 Apr 2024, Mauler, David wrote: I'm in the process of troubleshooting an issue with certain mp4 video files and tika. After a bunch of digging, it appears to be related to whatever ISO is set for the mp4 file. An mp4 with an ISO of 14496-12:2003 will be detected as video/quicktime

Re: PST file parsing

2023-11-29 Thread Nick Burch
On Wed, 29 Nov 2023, Neha Kamat via user wrote: We are currently using TIKA for parsing/extracting content from pst files.Is there a way we can tell parsing engine to parse as list of emails instead of string of emails? Depends how you're calling Tika? Tika App? Tika Server? Python Wrapper?

Re: Using Tika with another OCR engine

2023-08-08 Thread Nick Burch
On Thu, 3 Aug 2023, Cristian Zamfir wrote: I am interested in trying out Tika with a different OCR engine and wondering how Tesseract is integrated. Largely as "just another parser", but IIRC with a bit of logic to allow the "normal" image parsers to also have a go at the file to grab

Re: TIKA for MIME type detection

2023-07-27 Thread Nick Burch
On Tue, 20 Jun 2023, Neha Kamat via user wrote: I am currently working on an application wherein I would like to whitelist the filetypes supported by TIKA And discard rest of the files to avoid unknown behaviour/memory leaks. I am currently referring to

Re: Run Tika-docker with custom config

2023-04-28 Thread Nick Burch
On Fri, 28 Apr 2023, שי ברק wrote: I don’t know if it’s possible but I’m trying to avoid typing this ‘ –– config’ when I start the container. I wish to have all of these settings to be written inside the Dockerfile. Since you're doing your own custom docker container, you could override the

Re: Run Tika-docker with custom config

2023-04-28 Thread Nick Burch
On Fri, 28 Apr 2023, שי ברק wrote: Inside the container probably - makes more sense to me In that case, create a custom Docker container that adds in your custom config to your Docker image, as per Konstantin's instructions: https://lists.apache.org/thread/l0od2b6tp6odyd661ftjqmkkf27o6hdl

Re: Tika incorrectly detecting Canon raw image file .cr3 as video/quicktime

2023-03-22 Thread Nick Burch
On Wed, 22 Mar 2023, Tim Allison wrote: Thank you, Richard, for raising this. In looking at these file formats, it looks like crw is based on ciff, cr2 is based on tiff and cr3 is based on quicktime. Always fun when the core of a format (or at least the container) swaps between versions!

Re: Best practice for extracting content and metadata repeatedly

2023-03-06 Thread Nick Burch
On Mon, 6 Mar 2023, Chris Bamford via user wrote: From both performance and thread safety points of view what is the best approach for the use / reuse of the following objects: Tika ParseContext Parser Metadata The Tika object and/or TikaConfig object should only be created once and then

Re: Subset(s) of Tika?

2023-01-05 Thread Nick Burch
On Thu, 5 Jan 2023, Georg.Fischer wrote: The tika.jar has >54 MB, and I suspect that the loading of the big jar (under Windows) is hindering the performance. I should perhaps move to Linux, or try the Tika server. The Tika App jar has always been the "kitchen sink included quickstart" option

Re: Paragraph words getting merged

2022-10-31 Thread Nick Burch
On Sun, 30 Oct 2022, Christian Ribeaud wrote: I am using the default configuration. I think, we could reduce my problem to following code snippet: Is there a reason that you aren't using one of the built-in Tika content handlers? Generally they should be taking care of everything for you with

Re: Custom Parser Plugin for Tika Server

2022-10-26 Thread Nick Burch
On Wed, 26 Oct 2022, Tim Allison wrote: I've been struggling with this too. Outside of Docker, what I've been doing is using a bin/ directory and throwing everything in there and then starting tika-server: java -cp "bin/*" org.apache.tika.server.core.cli.TikaServerCli ... If we moved to that

Re: Validate MIME-type

2022-09-29 Thread Nick Burch
On Thu, 29 Sep 2022, Peter Conrad wrote: thanks. That's definitely an improvement. But I think it's not sufficient. AFAICS your code uses "aliases" as in "if it's type X then it can also be type Y". However there's also cases where a specific instance of type X can also be type Y but not all

Re: Tika documentation?

2022-09-01 Thread Nick Burch
On Thu, 1 Sep 2022, Mark Kerzner SHMsoft, Inc. wrote: Yes, please. If I make some changes, I will start with small ones. I will also verify them with you. Great, thanks in advance for your contributions! Can you please head to https://cwiki.apache.org/confluence/display/tika/ , click Sign Up

Re: Datasets for testing large number of attachments

2022-07-26 Thread Nick Burch
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote: I am currently trying to validate our Tika setup and was looking for a set of example data I could use If you want a small number of files of lots of different types, the test files in the Tika source tree will work. Main set are in

Re: Custom filter

2022-06-03 Thread Nick Burch
On Fri, 3 Jun 2022, Cihad Guzel wrote: I want to pass the content's words through some filters while parsing in Tika. How can I add custom filtering? Does the content handler work for this? Is there a document about this? A custom content handler is a pretty good way to do that. Tika just

Re: ForkParser issues with 2.3.0

2022-04-26 Thread Nick Burch
On Tue, 26 Apr 2022, Stephen H wrote: On 26/04/2022 12:22, Nick Burch wrote: Are you able to write a short junit unit test case which shows this issue? We have a bunch of small test OOXML and ODF files that could be used I've done this - if I create an issue in Jira with it would that best

Re: ForkParser issues with 2.3.0

2022-04-26 Thread Nick Burch
On Tue, 26 Apr 2022, Stephen H wrote: Second, there seems to be some work missing in the handling of metadata from certain parsers when using ForkParser. For example, for OpenDocument ODP and ODS files and Microsoft Open XML formats, while the document text is returned there is no metadata in

Re: Returning file extension alongside mime-type?

2022-03-11 Thread Nick Burch
On Tue, 8 Mar 2022, Willy T. Koch wrote: That’s fantastic, thank you! Looking forward to testing when the Tika Docker repo is updated with this release. That may take a few weeks, but if you don't mind building Tika from source, you should be able to give it a whirl now. (As far as I'm

Re: Returning file extension alongside mime-type?

2022-03-07 Thread Nick Burch
On Fri, 18 Feb 2022, Willy T. Koch wrote: Den Tor 17 feb 2022, kl. 20:00, skrev Nick Burch: Tika devs - any thoughts on this? It's a pretty small code change (we already have the data on the mime type!), just need feedback on extending the existing API vs adding a new one By also returning

Re: Returning file extension alongside mime-type?

2022-02-24 Thread Nick Burch
On Thu, 24 Feb 2022, Tim Allison wrote: A separate endpoint, then? That would be cleaner. We already have some mime details related endpoints, would be an extension or related endpoint to those, see earlier-thread: https://lists.apache.org/thread/jlym8ypnrj978hmzjgvkc1fpxnc7g51h Nick

Re: Returning file extension alongside mime-type?

2022-02-24 Thread Nick Burch
On Tue, 22 Feb 2022, Tim Allison wrote: I guess the question is how far do we want to bake this in? I could see adding a field for the default extension in the CompositeDetector/DefaultDetector. This would then be triggered on embedded files, too. I can't imagine this would add much cost

Re: Returning file extension alongside mime-type?

2022-02-17 Thread Nick Burch
On Thu, 10 Feb 2022, Nick Burch wrote: On Thu, 10 Feb 2022, Willy T. Koch wrote: …and calling it as a webservice with Postman/curl. Ah, I think we might not be exposing the full details of the mime types via the server, only details of their parsers and the heirarchy, eg http://localhost

Re: Returning file extension alongside mime-type?

2022-02-10 Thread Nick Burch
On Thu, 10 Feb 2022, Willy T. Koch wrote: …and calling it as a webservice with Postman/curl. Ah, I think we might not be exposing the full details of the mime types via the server, only details of their parsers and the heirarchy, eg http://localhost:9998/mime-types#audio/vorbis (We have

Re: Returning file extension alongside mime-type?

2022-02-10 Thread Nick Burch
On Thu, 10 Feb 2022, Willy T. Koch wrote: As for content detection, today the content-type field with mime type is returned. What we would need is a mime-type to file extension lookup and it seems logical that this was also returned by Tika. How are you calling Tika? We already have APIs for

Re: Tika 2.1.0 pdf parser

2021-10-21 Thread Nick Burch
On Thu, 21 Oct 2021, nskarthik wrote: Question : Need to extract Text / images at page level using java. Did not find any example on www or Tika website. For PDF, you should fetch the contents as XHTML rather than plain text. You can then split on the page divs. This isn't available for

Re: Deleted text in Word document

2021-08-27 Thread Nick Burch
On Fri, 27 Aug 2021, Peter Kronenberg wrote: When Tika extracts from a Microsoft Word document, deleted text is extracted, with no indication that it is deleted. In fact, if a word was deleted and replaced by another word, both words just show up side-by-side. Is there a way to get some sort

Re: dcterms:created date changes on RTF documents

2021-07-22 Thread Nick Burch
On Thu, 22 Jul 2021, David Pilato wrote: TL;DR: the created date of the document changes depending on the timezone. That does seem a bug For example: • Asia/Sakhalin gives dcterms:created=2016-07-06T23:38:00Z • Asia/Colombo gives dcterms:created=2016-07-07T05:08:00Z • Europe/Stockholm gives

Re: logging formatter configuration compatible with StackDriver

2021-06-11 Thread Nick Burch
On Fri, 11 Jun 2021, Cristian Zamfir wrote: I think for most people it would be quite critical to have logs working. Do you happen to know how I can reach out to the person maintaining the docker images https://hub.docker.com/u/dameikle to see if they are available to update the images? Sounds

Re: --header "X-Tika-OCR: false" ; an option to fully disable OCR for each request

2021-06-10 Thread Nick Burch
On Thu, 10 Jun 2021, Cristian Zamfir wrote: Got it, thanks. What are your thoughts on using Tika 2.x while still in beta? Is it likely to be more stable than 1,26? I presume it has passed the same extensive test suite. Usage stability wise, it's as good as 1.x. API stability wise things are

Re: --header "X-Tika-OCR: false" ; an option to fully disable OCR for each request

2021-06-10 Thread Nick Burch
On Thu, 10 Jun 2021, Cristian Zamfir wrote: Thanks Nick. Looks like the option I was looking for is the 3rd one, but the docs say it is only available in Tika 2.x - am I right? I've just done a grep of the codebase, and it isn't in the 1.x branch, only main = 2.x. So, Tika 2.x only Nick

Re: --header "X-Tika-OCR: false" ; an option to fully disable OCR for each request

2021-06-10 Thread Nick Burch
On Thu, 10 Jun 2021, Cristian Zamfir wrote: It would be nice if this was feasible via the headers of each request. I find it more convenient to use if/else in my code than in the yaml files used for k8s configuration. Is there such an option? Three options, see

Re: best practices for avoiding OOM for tika docker

2021-06-02 Thread Nick Burch
On Wed, 2 Jun 2021, Cristian Zamfir wrote: 1. Do you have a recommendation for a stress test that would allow me to easily test OOM behavior? Depends what kind of OOM you're interested in. If you fire a lot of memory-hungry documents at a single server at once, you can trigger an OOM.

Re: best practices for avoiding OOM for tika docker

2021-05-28 Thread Nick Burch
On Thu, 27 May 2021, Cristian Zamfir wrote: I am running some stress tests of the latest tika server docker (not modified in any way, just pulled from the registry) and seeing that after a few hours I see OOM in the logs. The container has a limit of 4GB set in K8S. I am wondering if you have

Re: Tika Docker licence

2021-04-17 Thread Nick Burch
On Sat, 17 Apr 2021, Lewis John McGibbney wrote: Please point me to the code for the ‘ttf-mscorefonts-installer’. The bit of the Tika docker file that pulls them in is: https://github.com/apache/tika-docker/blob/master/full/Dockerfile#L21 I think the EULA (which we auto-accept during

Re: Tika Docker licence

2021-04-16 Thread Nick Burch
On Tue, 13 Apr 2021, Subhajit Das wrote: The Tika Docker image (full) uses ‘ttf-mscorefonts-installer’. The licence used by it is Microsoft licence and dosen’t seems to allow commercial use. Can any please confirm if it is ok to use? Or should a customized version to be used for production?

RE: UNSUBSCRIBE

2021-04-16 Thread Nick Burch
On Fri, 16 Apr 2021, Maloney, Patrick (ITS) wrote: Thanks, but that info is not in the individual e-mails...I checked for that. Hmm, that might be an issue with your email client. Every list message has this in the headers Mailing-List: contact user-h...@tika.apache.org; run by

Re: UNSUBSCRIBE

2021-04-16 Thread Nick Burch
On Fri, 16 Apr 2021, Maloney, Patrick (ITS) wrote: UNSUBSCRIBE To unsubscribe from the Apache Tika users list, send an email to user-unsubscr...@tika.apache.org and then reply to confirm. This info is also included in every email Nick

RE: Parsing PDF file - setting threshold of unmapped characters

2021-04-14 Thread Nick Burch
On Wed, 14 Apr 2021, Peter Kronenberg wrote: Anyone have any thoughts on this? I think both an absolute and a percentage would be good, but I don't have enough experience to comment on your suggested numbers for those two thresholds, sorry! Your idea on best vs fast touches on much older

Re: TikaServer Header Name is Case-sensitive

2021-03-15 Thread Nick Burch
On Mon, 15 Mar 2021, Subhajit Das wrote: It seems that TikaServer 1.25 header like “X-Tika-PDFOcrStrategy” is case sensitive. Yes. That's bcause those then get mapped onto underlying Java classes and methods, which are case sensitive According to

Re: Microsoft alternate fonts on RHEL

2021-03-06 Thread Nick Burch
On Sat, 6 Mar 2021, Subhajit Das wrote: But, the fonts and packages are not available on RHEL, as those are Debian packages. Please suggest alternate option to setup all supported fonts and packages on RHEL. Without a RHEL support login I can't be sure if these help or not, but I'd suggest

Re: Re-using a TikaStream

2021-03-01 Thread Nick Burch
On Mon, 1 Mar 2021, Tim Allison wrote: detectors should return the stream reset to the beginning. I agree - needs to be ready for the parser to then process Parsers, IIRC, should return the stream fully(?) read but not closed. Not always - if the parser wanted a File then it may not have

RE: Re-using a TikaStream

2021-03-01 Thread Nick Burch
On Fri, 26 Feb 2021, Peter Kronenberg wrote: For most audio files, using the AudioParser, the buffer is still at the beginning. Even though there is no text extraction, I would think that Tika still needs to read through the stream. The MP3Parser consumes the stream, but the MP4Parser does

RE: Re-using a TikaStream

2021-02-23 Thread Nick Burch
On Tue, 23 Feb 2021, Peter Kronenberg wrote: I was re-reading some emails with Nick Burch back around Dec 22-23 and maybe I mis-understood him, but it sounds like he was saying that TiksInputStream was smart enough to automatically spool the stream to disk to allow re-use. If a parser knows

Re: Error calling ImageMagick

2021-02-12 Thread Nick Burch
On Thu, 11 Feb 2021, Tim Allison wrote: I can replicate this on my windows laptop. The weird thing is that the image file is actually there and if I pause the debugger at the point after imagemagick has complained that the file isn't there but before Tika does the clean up, Windows is funny

Re: OCR on PDFs

2020-12-31 Thread Nick Burch
On Thu, 31 Dec 2020, Peter Kronenberg wrote: I've got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice. Is this a PDF where some other tool has already done the OCR and stored the text it

Re: Metadata

2020-12-29 Thread Nick Burch
On Mon, 28 Dec 2020, Peter Kronenberg wrote: For the metadata that comes back from a parse (example below), clearly, the fields are dependent on the file type and information available. Are there any 'standard' fields that come back for all/any files? Such as Author, date, x-parsed-by, etc.

RE: Mimetypes

2020-12-23 Thread Nick Burch
On Wed, 23 Dec 2020, Peter Kronenberg wrote: Best is to wrap as a TikaInputStream, detect using all the detectors via >DefaultDetector, then parse after that. But sometimes the detect will read the whole file, right? For example, for Word. So is it then making 2 passes? Nope, we stash the

RE: Mimetypes

2020-12-23 Thread Nick Burch
On Wed, 23 Dec 2020, Peter Kronenberg wrote: But yet, if I understand correctly, using a TikaInputStream *will* spool the entire stream to disk so it can read everything, right? If I re-read the stream to parse, is it making 2 passes? TikaInputStream has logic in it dump the stream to a temp

RE: Mimetypes

2020-12-23 Thread Nick Burch
On Tue, 22 Dec 2020, Peter Kronenberg wrote: Oh, so reading the stream doesn't read the whole file? Not for Detect, no. The assumption is that Detect is normally followed by Parse, so you won't want the Stream consuming, so we do a mark/reset to check the first few kb only I know for

Re: Mimetypes

2020-12-22 Thread Nick Burch
On Tue, 22 Dec 2020, Peter Kronenberg wrote: I'm trying to detect the mimetype of a file using both Tika.detect(InputStream) and Tika.detect(File) I get 2 different results. I'm testing with a Microsoft Word (.doc) file. The InputStream one is based on just the first few kb of the file.

Re: Extract URLs from a document

2020-11-12 Thread Nick Burch
On Wed, 11 Nov 2020, nensick wrote: I am exploring the available features and I managed also to extract Office macros but I still don't find a way to get the links. Imagine to have a PDF, a DOCX in which you have a "click here" text as a link pointing to a website (let's say example[.]com).

Re: WARNING: org.xerial's sqlite-jdbc is not loaded for 1.2.4

2020-04-22 Thread Nick Burch
On Wed, 22 Apr 2020, Tim Allison wrote: Y. Agreed. Where should we document this? Where would you look for it? The Tika Server and Tika App both get a fair bit of use from non-Java devs Maybe we need a quickstart for non-Java folks section, and probably a python-specific one as we get loads

Re: WARNING: org.xerial's sqlite-jdbc is not loaded for 1.2.4

2020-04-21 Thread Nick Burch
On Mon, 20 Apr 2020, Bradley Beach wrote: I have tried every permutation of adding sqlite-jdbc-3.30.1.jar to my classpath but still get:   java -classpath ".:sqlite-jdbc-3.30.1.jar" -jar tika-server-1.24.jar --host=localhost --port=12345 You can't combine -classpath and -jar, you have to use

Re: Setting PDF2XHTML img src

2020-01-03 Thread Nick Burch
On Fri, 3 Jan 2020, Mike Dalrymple wrote: I've just started using Tika to process PDFs with embedded images. I'm getting fantastic results but I'm having to post-process the generated XHTML to correct the value of the src attribute on the img elements. That is expected. A simple sax handler

Re: Encoding detectors in OSGi (tika-bundle)

2019-11-12 Thread Nick Burch
On Tue, 12 Nov 2019, Katsuya Tomioka wrote: I'm having trouble accessing encoding detectors in OSGi with Tika 1.22. AutoDetectParser returns "Failed to detect the character encoding of a document" for non-Latin text. We are migrating from 1.10, I'm sure many things are different. It seems like

Re: Anyone have a nice Unix service script for running Tika Server?

2019-10-16 Thread Nick Burch
On Wed, 16 Oct 2019, Eric Pugh wrote: I’m looking at running Tika Server mode in a Linux box (and sorry, I don’t know the specific flavour….). Is there a nice service script to deal with bring Tika back up if the Linux box is restarted? Are you using a systemd-based linux, or a different

Re: Sample Rate / Audio Sample Rate not included in XML output

2018-10-17 Thread Nick Burch
On Wed, 17 Oct 2018, Tim Allison wrote: This is one of the limitations of a streaming write. As I look at the code of the MP3Parser, I _think_ it would be trivial to write the metadata before writing any content, and it wouldn't get in the way of a streaming parse because the parser reads the

Re: Google Takeout GChat messages

2018-09-05 Thread Nick Burch
yOn Tue, 4 Sep 2018, Tucker Barbour wrote: I've exported a GMail archive in MBOX format using takeout.google.com. The MBOX archive also includes GChat messages. However, the GChat messages do not include a Date header. Instead the date sent is included in what appears to be a non-conforming

Re: Forcing Parser Invocation

2018-04-24 Thread Nick Burch
On Mon, 23 Apr 2018, lewis john mcgibbney wrote: Using the tika-server, I am having issues parsing the attachment ENVI hdr file at [0] with the EnviHeaderParser [1]. Is there any way I can explicitly force execution of the EnviHeaderParser? I think not directly on a per-request basis. All the

Re: Tika Parsers jar?

2018-04-19 Thread Nick Burch
On Thu, 19 Apr 2018, AJ Weber wrote: But I can't find that jar anywhere in any of the download areas.  (I don't know why, but my maven isn't working properly.) You need to use Maven / Gradle / Ivy to fetch it, and everything it depends on Can someone point me to the location of such a jar

Re: Hex of RSS xml file is not recognized as RSS file MIME type

2018-04-19 Thread Nick Burch
On Wed, 18 Apr 2018, Jean-Nicolas Boulay Desjardins wrote: I converted this RSS XML content to hex: Then send it to Tika... Tika returns: text/plain Base 64 encoded XML is no longer valid XML, so this is as expected. Why am I not getting the rss mime type? You need to send Tika the

Re: Subfile Extraction

2018-03-27 Thread Nick Burch
On Sun, 25 Mar 2018, McGreevy, Anthony wrote: I am currently playing with Tika to see how it works with regards to extraction of subfiles. Do you mean files or resources embedded within another file? If so... With the Tika App, you want -z to have these extracted. With the Tika java classes,

Re: Unable to use -classpath

2018-03-05 Thread Nick Burch
On Sat, 3 Mar 2018, Jean-Nicolas Boulay Desjardins wrote: I am using this command: java -classpath /home/$USER/Projects/Lab/tika/classes/ -jar ./tika-app/target/tika-app-1.17.jar Java ignores -classpath if you also specify -jar In /home/$USER/Projects/Lab/tika/classes/ I have:

Re: Malware RTF is not detected as RTF

2018-03-01 Thread Nick Burch
On Thu, 1 Mar 2018, Jim Idle wrote: Malicious RTF files take advantage of the fact that Microsoft do not follow their own RTF spec. Specifically, Word et al only looks for the opening sequence: {rt Thought the spec says it should be: {rtf1 I don't think that Tika can assume that all RTF

Re: Long time with OCR

2018-02-20 Thread Nick Burch
On Mon, 19 Feb 2018, Mark Kerzner wrote: Is that a good approach? Is the 10 seconds time normal? I am using the latest most powerful Mac and I get similar results on an i7 processor in Ubuntu. Tika uses the open source Tesseract OCR engine. Tesseract is optimised for ease of contributions

Re: Detect JSON / PDF specific mime type

2018-02-05 Thread Nick Burch
On Mon, 5 Feb 2018, Matteo Alessandroni wrote: I'm using Apache Tika to detect a file Mime Type from its base64 rapresentation. Unfortunately I don't have other info about the file (e.g. extension). and it gives me "text/plain" for JSON and PDF files, but I would like to obtain a more

Re: Binary file check

2018-01-21 Thread Nick Burch
On Fri, 19 Jan 2018, Kudrettin Güleryüz wrote: One more thing, regarding application/xml vs text/xml I think I'll skip application/xml for now and just include text/xml Assuming application/xml is compressed XML such as Open office documents and text/xml as uncompressed XML Nope! They're both

Re: Binary file check

2018-01-14 Thread Nick Burch
On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote: I am not an expert on mime types and how they extend. My definition of binary is any file that is not in human readable form. Any other file, I'd like to index. Would that answer your question? Some of us humans here can read a wide range of

Re: Binary file check

2018-01-11 Thread Nick Burch
On Thu, 11 Jan 2018, Kudrettin Güleryüz wrote: Does Tika library provide an efficient binary file check? How do you define "binary"? Only things with a mimetype that starts text/ ? Or do you want to include application/xml files? Or things that extend form XML like DIF and FictionBook? Only

RE: Very slow parsing of a few PDF files

2017-11-21 Thread Nick Burch
On Tue, 21 Nov 2017, Jim Idle wrote: Following up on this, I will try cancelling my thread based tasks after a pre-set time limit. That is only going to work if Tika and the underlying parsers behave correctly with the interrupted exception. Anyone had any success with that? I am mainly

Re: Very slow parsing of a few PDF files

2017-11-06 Thread Nick Burch
On Tue, 7 Nov 2017, Jim Idle wrote: I have a few PDF files that are taking a very long time to parse. Are you sure it's a PDF? The profiler images you've sent are all for Apache POI and seem to show a XLS file being parsed Nick

Re: Using TikaConfig troubles

2017-11-03 Thread Nick Burch
On Fri, 3 Nov 2017, Markus Jelsma wrote: This is how Nutch gets the parser: Parser parser = tikaConfig.getParser(MediaType.parse(mimeType)); When no custom config is specified config is: new TikaConfig(this.getClass().getClassLoader()); When i specify a custom config, it is: tikaConfig = new

Re: Java 9 and JAXB dependency in tika-core

2017-09-14 Thread Nick Burch
On Thu, 14 Sep 2017, Robert Munteanu wrote: One of the issues that came up is that tika-core has a dependency on JAXB [1]. The javax.xml.bind packages are no longer part of the java.se module, and therefore not available by default on the module path. The issue can be triggered with a simple

Re: Detecting .bat and .cmd files

2017-08-23 Thread Nick Burch
On Wed, 23 Aug 2017, epast...@vt.edu wrote: I'm trying to get tika to detect .bat and .cmd files. Both are returning as text/plain. In the xml file, (https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml) bat falls under

Re: Performance Improvement AutoDetectParser

2017-08-04 Thread Nick Burch
On Fri, 4 Aug 2017, aravinth thangasami wrote: we are using Tika 1.13. 1.15 is out! While instantiating AutoDetectParser we found that the CompositeExternalParser which actually we don't need, takes up more time. It because of ExifTool & FFmpeg. I tried with removing

Re: Parse file without creating tmp file

2017-07-11 Thread Nick Burch
On Tue, 11 Jul 2017, aravinth thangasami wrote: Recently I have noticed tika creates a tmp file in before parsing the stream. Only for certain formats, generally where the underlying parsing library requires a file for random-access I don't have much experience in Tika but I feel it is an

Re: Adding a WARC parser to Tika

2017-07-10 Thread Nick Burch
On Mon, 10 Jul 2017, Allison, Timothy B. wrote: Sorry, I can't tell if this is tongue-in-cheek... No, I do think we should add a WARC parser to Tika Parsers. Once done, I'd suggest we figure out a way for Tika Batch to run over a collection of WARC files just as it does for directories, to

Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Nick Burch
Having taken a "quick" look over lunch at some of the "programming language" ones, and gone down a rabbit whole... I think at least some of them are as described in TIKA-2419, where our change to the HTML magic priority to fix for HTML-containing formats like email had broken some things.

Re: Limit on input PDF file size in Tika?

2017-06-08 Thread Nick Burch
On Thu, 8 Jun 2017, tesm...@gmail.com wrote: Thanks for your reply. I am calling Apache Tika in Java code like this: public String extractPDFText(String faInputFileName) throws IOException,TikaException { //Handler for body text of the PDF article BodyContentHandler handler = new

Re: Limit on input PDF file size in Tika?

2017-06-08 Thread Nick Burch
On Thu, 8 Jun 2017, tesm...@gmail.com wrote: My tika code is not extracting full body text of larger PDF files. Files more than 1 MB in size and around 20 pages are partially extracted. Is there any limit on input PDF file size in tika How are you calling Apache Tika? Direct java calls to

Re: Extracting macros in 1.15

2017-06-03 Thread Nick Burch
On Sat, 3 Jun 2017, Jim Idle wrote: After being baffled why macros no longer show up in 1.15 I found: https://issues.apache.org/jira/browse/TIKA-2302 Can anyone point me to an example of doing this? I am finding bits and pieces but no example of turning macros back on.I basically want all

Re: TIKA for confidental documents

2017-05-13 Thread Nick Burch
On Sat, 13 May 2017, Julian Decker wrote: is there any connection and data transfer to external servers by using the Tika Server or Tika App? None out-of-the-box. If you turn on Translation, or most of the NER / NLP / Object Recognition stuff, Tika will send the relevant things to your

RE: Extract Message-ID in EML file

2017-04-21 Thread Nick Burch
On Fri, 21 Apr 2017, Allison, Timothy B. wrote: Probably? Please open an issue on our JIRA and submit an example file. I think you can often get it from Message:Raw-Header:Message-ID But that isn't ideal. We probably ought to define a proper Message: property for it, and have all the

Re: Fwd: Tika not parsing underlines

2017-01-04 Thread Nick Burch
On Thu, 5 Jan 2017, Kamesh Joshi wrote: I am trying to parse the attached the pdf.but it does not give me the places where the underline is present it just returns me plain text. Please help me how can i also get the underline present in pdf or some way to split text based on that. I am using

Re: Mime type matching: tika-mimetypes.xml

2016-11-09 Thread Nick Burch
On Wed, 9 Nov 2016, Chris Bamford wrote: … ... Does offset="0:8192" mean match 'Message-ID:' anywhere in the first 8192 bytes? Yup, that's it. If that is found, and nothing with a priority score of higher than 50 also matches, it'll return that type. If a higher

Re: Get file metadata without retrieving entire file with Tika Server

2016-10-13 Thread Nick Burch
On Thu, 13 Oct 2016, Mr Havecamp wrote: However, the problem with either option is that we need to retrieve the entire file from storage; this is fine for smaller text files but when handling these larger files, it seems wasteful and time-consuming to download, say, a video file just to

Re: Tika: parsing mixed content e-mails

2016-10-06 Thread Nick Burch
On Thu, 6 Oct 2016, Ingo Siebert wrote: Am 05.10.2016 um 20:04 schrieb Nick Burch: On Wed, 5 Oct 2016, Ingo Siebert wrote: I just used Tika (org.apache.tika:tika-parsers:1.13) to parse an e-mail with multipart/mixed content. How do you want to get the various parts back? All text inlined

Re: Code parser?

2016-09-29 Thread Nick Burch
On Wed, 28 Sep 2016, Mark Kerzner wrote: probably yes, but how do I tell it which parser to use? Today, I just do that String text = tika.parseToString(inputStream, metadata); and it know the parser. That might be your issue. It's quite hard to identify the language of a piece of source

Re: How to parse PDF files effectively with Tika

2016-09-12 Thread Nick Burch
On Mon, 12 Sep 2016, Sergey Beryozkin wrote: By the way, I've found out AutoDetectParser may not work if the (pdf) stream is an attachment stream which may not support a mark. Simplest would probably be just to wrap it in a TikaInputStream, which would handle any buffering/marking as needed

Re: Problem with detection of RFC822 message

2016-07-28 Thread Nick Burch
On Thu, 28 Jul 2016, Vjeran Marcinko wrote: Just as I resolved the rpoblem with MBOX parser, I noticed that it doesn't correctly detect contained RFC822 messages as message/rfc822, but usually text/html or some variation of it. And question as before, is there some workaround for 1.13 to

Re: No Unicode mapping warnings

2016-07-26 Thread Nick Burch
On Tue, 26 Jul 2016, Oliver Steinau wrote: I'm having problems extracting text from a small (43 KB) PDF file using tika-1.13 -- I get a bunch of warnings like WARN No Unicode mapping for C0104 (38) in font FDLICI+PSOwstswiss WARN No Unicode mapping for C0097 (31) in font FDLICI+PSOwstswiss

Re: Problem with detection of .mbox file

2016-07-25 Thread Nick Burch
On Mon, 25 Jul 2016, Vjeran Marcinko wrote: I fist noticed that my .mbox file doesn't get parsed by MBoxParser, and later, after debugging Tika source code, I found what the problem is - default detector doesn't even recognize it as "applciation/mbox" MIME type, and although file extension is

Re: DATE metadata from email

2016-05-15 Thread Nick Burch
On Sun, 15 May 2016, Philipp Steinkrüger wrote: To begin with, I noticed the following behaviour which might or might not be a bug. I asked this question on stackexchange (https://stackoverflow.com/questions/37226842/tika-metadata-from-email-misses-date

My "What's new with Apache Tika 2.0" talk slides

2016-05-11 Thread Nick Burch
Hi All For those who couldn't make it to Vancouver this week, the slides from my "What's new with Apache Tika 2.0" talk are now available online: http://www.slideshare.net/NickBurch2/apache-tika-whats-new-with-20 The audio was recorded, hopefully that will be available to go with the slides

Re: XML Parser with type recognition

2016-05-11 Thread Nick Burch
On Wed, 11 May 2016, plug...@free.fr wrote: If you can take a look at my little gist example https://gist.github.com/anonymous/3506db4367040ea8f381c5b7b435b3f9 it will be very helpful. The localName parameter is case sensitive. Your sample file starts with Nick

Re: XML Parser with type recognition

2016-05-11 Thread Nick Burch
On Wed, 11 May 2016, plug...@free.fr wrote: Ok if I understand I can create a specific mime type into tika-mimetypes.xml resource file like this: http://www.w3.org/2001/XMLSchema-instance"/> Almost - you can't set that glob as it's already claimed. Otherwise, assuming that is the

Re: XML Parser with type recognition

2016-05-10 Thread Nick Burch
On Tue, 10 May 2016, plug...@free.fr wrote: But now I'm facing of detecting some XML files but only some specifics, I can't detect only "application/xml", I need to detect which type of XML is it (in my case http://www.iab.com/guidelines/digital-video-ad-serving-template-vast-3-0/). But the

Re: disable extraction of images

2016-04-13 Thread Nick Burch
On Wed, 13 Apr 2016, ron.vandenbranden wrote: Is it possible to disable text extraction from images inside a PDF file? I'm testing with the CLI tika app, which has "extractInlineImages" set to false by default, if I'm not mistaken. Yet, the text of the images still is present in the generated

Re: Fwd: How to enable multiple parsers for content type ?

2016-03-23 Thread Nick Burch
On Wed, 23 Mar 2016, Thamme Gowda N. wrote: Question : How to enable multiple parsers for specific mimetypes? I am using tika to parse html pages. My requirement is that both *NamedEntityParser* and *HtmlParser* has to be enabled for specific web related MIME types like *text/html, *

Re: Using tika-app-1.11.jar

2016-02-11 Thread Nick Burch
On Wed, 10 Feb 2016, Steven White wrote: I'm including tika-app-1.11.jar with my application and see that Tika includes "slf4j". The Tika App single jar is intended for standalone use. It's not generally recommended to be included as part of a wider application, as it tends to include

  1   2   3   >