[jira] [Commented] (TIKA-1334) Add presentation layer for results of each run
[ https://issues.apache.org/jira/browse/TIKA-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994941#comment-15994941 ] Tyler Palsulich commented on TIKA-1334: --- The format should probably be in the form: {noformat} [ { "mime-type": "something", "count": 1234, "version": "a" }, { "mime-type": "something", "count": 4321, "version": "b" }, ... ] {noformat} > Add presentation layer for results of each run > -- > > Key: TIKA-1334 > URL: https://issues.apache.org/jira/browse/TIKA-1334 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > Attachments: static_stats.zip > > > If I'm doing this, it'll probably be vintage mid-90s html. If someone with > some .js kung-fu wants to take this, please do. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1743) NetworkParser can create Unbounded Number of Threads
[ https://issues.apache.org/jira/browse/TIKA-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903878#comment-14903878 ] Tyler Palsulich commented on TIKA-1743: --- [Copied from the list] This sounds like a great idea! We should make the size of the pool configurable with TikaConfig. > NetworkParser can create Unbounded Number of Threads > > > Key: TIKA-1743 > URL: https://issues.apache.org/jira/browse/TIKA-1743 > Project: Tika > Issue Type: Bug >Reporter: Bob Paulin > > The current NetworkParser class creates new instances of the Thread class > which each call to parse. This could create an unbounded number of threads > created by this class. I'd suggest replacing this logic with a > ThreadPoolExecutor and a configurable number of threads. This will help > prevent creating an unbounded number of threads and allow the user to tune > performance to the hardware. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1672) Integrate tika-java7 component
[ https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14722705#comment-14722705 ] Tyler Palsulich commented on TIKA-1672: --- Hmm. Maybe we should rename the module? Right now, it doesn't make sense to have a java7 component when the entire project depends on Java 7. Integrate tika-java7 component -- Key: TIKA-1672 URL: https://issues.apache.org/jira/browse/TIKA-1672 Project: Tika Issue Type: Improvement Reporter: Tyler Palsulich Fix For: 1.11 Code requiring Java 7 doesn't need to be in a separate module now that TIKA-1536 (upgrade to Java 7) is done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API
[ https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623246#comment-14623246 ] Tyler Palsulich commented on TIKA-1362: --- If you have a pressing need for better configuration abilities for the Google Translator, feel free to open up a new issue and upload a patch! :) We'd be happy to help you get started. Check out the [contributing page|https://tika.apache.org/contribute.html] for some general information. Add GoogleTranslate implementation of Translation API - Key: TIKA-1362 URL: https://issues.apache.org/jira/browse/TIKA-1362 Project: Tika Issue Type: Bug Components: translation Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 Add an implementation of the Translation API that uses the Google Translate v2 API and Apache CXF: https://www.googleapis.com/language/translate/v2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1672) Integrate tika-java7 component
Tyler Palsulich created TIKA-1672: - Summary: Integrate tika-java7 component Key: TIKA-1672 URL: https://issues.apache.org/jira/browse/TIKA-1672 Project: Tika Issue Type: Improvement Reporter: Tyler Palsulich Fix For: 1.10 Code requiring Java 7 doesn't need to be in a separate module now that TIKA-1536 (upgrade to Java 7) is done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1536) Upgrade compiler definition in pom's to Java 7
[ https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1536. --- Resolution: Fixed Upgraded in r1688779. Thanks, all. Will open a new issue regarding integrating tika-java7. Upgrade compiler definition in pom's to Java 7 -- Key: TIKA-1536 URL: https://issues.apache.org/jira/browse/TIKA-1536 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.7 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10 Attachments: TIKA-1536.patch Since we committed TIKA-1423 it would appear through [mailing list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] commentary that there is a willingness to drop support for Java 1.6 in favour of = Java 1.7. This issue simply addresses this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1536) Upgrade compiler definition in pom's to Java 7
[ https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605772#comment-14605772 ] Tyler Palsulich commented on TIKA-1536: --- Yep, see http://apache.markmail.org/thread/7oubuh4hp6rdlbch. Upgrade compiler definition in pom's to Java 7 -- Key: TIKA-1536 URL: https://issues.apache.org/jira/browse/TIKA-1536 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.7 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10 Attachments: TIKA-1536.patch Since we committed TIKA-1423 it would appear through [mailing list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] commentary that there is a willingness to drop support for Java 1.6 in favour of = Java 1.7. This issue simply addresses this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1481) TikaJAXRS get metadata calls give different results
[ https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1481. - Resolution: Not A Problem Hi [~arbuzovada]. Sorry for the trouble! Did you make sure to respond to the automated response, confirming your subscription? I'm closing this issue as not a problem. But, don't hesitate to let us know if you have any more issues. TikaJAXRS get metadata calls give different results --- Key: TIKA-1481 URL: https://issues.apache.org/jira/browse/TIKA-1481 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.6 Environment: Windows 8, JDK 1.8 Reporter: Darya Arbuzova Priority: Minor Attachments: sample.csv Hello! I'm trying to use Tika in server mode. I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/. I have tried to get file metadata in 2 different ways (as explained here: http://wiki.apache.org/tika/TikaJAXRS ): {{ curl -T sample.csv http://localhost:9998/meta --header Content-Type: text/csv}} {{Content-Encoding,windows-1252}} {{Content-Type,text/plain; charset=windows-1252}} and {{ curl -X PUT -d @sample.csv http://localhost:9998/meta --header Content-Type: text/csv}} {{Content-Encoding,ISO-8859-1}} {{Content-Type,text/plain; charset=ISO-8859-1}} How come they give different results in encoding if I call the same {{http://localhost:9998/meta}}? What could the other differences appear and which is the preferable way to get metadata? Many thanks! Best regards, Darya Arbuzova -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-756) XMP output from Tika CLI
[ https://issues.apache.org/jira/browse/TIKA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-756. -- Resolution: Fixed Marking this as Fixed, since there are a few more references to tika-parser components (see TikaToXMP). Feel free to reopen if you disagree. XMP output from Tika CLI Key: TIKA-756 URL: https://issues.apache.org/jira/browse/TIKA-756 Project: Tika Issue Type: New Feature Components: cli, metadata Reporter: Jukka Zitting Assignee: Jörg Ehrlich Labels: metadata, xmp Attachments: tika-xmp.patch, tika-xmp_styleAndHeader.patch It would be great if the Tika CLI could output metadata also in the XMP format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1429) Unable to View a 9mb file even after setting a large Heap Size of 3GB while TIKA GUI
[ https://issues.apache.org/jira/browse/TIKA-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1429. - Resolution: Not A Problem Closing this as not a problem. The file needs to be kept in memory for the GUI to work. So, the problem should be fixed with a higher limit. Unable to View a 9mb file even after setting a large Heap Size of 3GB while TIKA GUI --- Key: TIKA-1429 URL: https://issues.apache.org/jira/browse/TIKA-1429 Project: Tika Issue Type: Bug Components: gui Affects Versions: 1.6 Environment: Windows 8 Reporter: Gautham Gowrishankar Priority: Minor we seem to have found an issue while tika1.6 jar as a GUI (-g option),It seems to work for smaller .tsv files but we running into GC Overload Excpetion while running on of the files in your DataSet. Strangely it seems to work with -x option. There might be an issue with at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:284). Just bringing it to your notice. Below are the logs. = Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC overhead l imit exceeded at java.util.Arrays.copyOfRange(Unknown Source) at java.lang.String.init(Unknown Source) at java.lang.StringBuilder.toString(Unknown Source) at java.lang.StackTraceElement.toString(Unknown Source) at java.lang.String.valueOf(Unknown Source) at java.lang.StringBuilder.append(Unknown Source) at java.lang.Throwable.printStackTrace(Unknown Source) at java.lang.Throwable.printStackTrace(Unknown Source) at org.apache.tika.gui.TikaGUI.handleError(TikaGUI.java:351) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:284) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238) at javax.swing.AbstractButton.fireActionPerformed(Unknown Source) at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.setPressed(Unknown Source) at javax.swing.AbstractButton.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown Source) at java.awt.Component.processMouseEvent(Unknown Source) at javax.swing.JComponent.processMouseEvent(Unknown Source) at java.awt.Component.processEvent(Unknown Source) at java.awt.Container.processEvent(Unknown Source) at java.awt.Component.dispatchEventImpl(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Window.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.EventQueue.dispatchEventImpl(Unknown Source) Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC overhead l imit exceeded at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Sour ce) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Sour ce) at java.awt.EventQueue$4.run(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Sour ce) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source) Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC overhead l imit exceeded at java.lang.StringBuilder.toString(Unknown Source) at com.sun.java.swing.plaf.windows.TMSchema$Part.getControlName(Unknown Source) at com.sun.java.swing.plaf.windows.XPStyle.isSkinDefined(Unknown Source) at
[jira] [Commented] (TIKA-1493) Update for JAXRS page with details on passing password
[ https://issues.apache.org/jira/browse/TIKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605292#comment-14605292 ] Tyler Palsulich commented on TIKA-1493: --- Can someone familiar with the latest in passing a password to Tika server update the wiki page? Or, is setting the environment variable enough? Update for JAXRS page with details on passing password -- Key: TIKA-1493 URL: https://issues.apache.org/jira/browse/TIKA-1493 Project: Tika Issue Type: Improvement Components: documentation Reporter: Peter Bowyer Priority: Minor Labels: documentation, newbie I signed up for a wiki account to make the edit, but the page is immutable :( It would be really helpful to put on https://wiki.apache.org/tika/TikaJAXRS information about passing the password for encrypted PDFs into TikaJAXRS. In Changelog.txt I discovered the TIKA_PASSWORD environment variable which has worked for me, and it'd be nice to save others having to hunt around. I'd also like to know if there's a way to pass it in per-request (a HTTP header? Useful when many different passwords) - not found anything in the source code for that though. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1552) Pdf document parser
[ https://issues.apache.org/jira/browse/TIKA-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1552. - Resolution: Not A Problem Marking this as not a problem, since Adobe Reader also adds white space. Pdf document parser --- Key: TIKA-1552 URL: https://issues.apache.org/jira/browse/TIKA-1552 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Konstantin Attachments: 2014_US_Federal_Budget.pdf, issue.jpg Hello, We found that when a pdf document has marked text inside frame (table) then after parsing Tika insert tabs between words. Original text from attached file: Provides $17.7 billion in discretionary funding for the National Aeronautics and Space Parsed text (jira removed tabs, so i will add - symbols instead): •Provides - $17.7 - billion-in-discretionary-funding-for-the-National-Aeronautics-and-Space Please take a look in attached screenshot. On the left side is the parsed text in text editor Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1452) parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted
[ https://issues.apache.org/jira/browse/TIKA-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1452. - Resolution: Not A Problem I'm closing this as not a problem. But, please feel free to reopen if you're still having this issue! parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted -- Key: TIKA-1452 URL: https://issues.apache.org/jira/browse/TIKA-1452 Project: Tika Issue Type: Bug Components: detector, metadata, parser Affects Versions: 1.6 Environment: jre6 Reporter: Abhishek I am passing a file as input stream to parser.parse() method while using apache tika library to convert file to text.The method throws an exception (displayed below) but the input stream is closed in the finally block successfully. Then while renaming the file, the File.renameTo method from java.io returns false. I am not able to rename/delete/move the file despite successfully closing the inputStream. I am afraid another instance of file is created, while parser.parse() method processess the file, which doesn't get closed till the time exception is throw. Is that possible? If so what should I do to rename or delete the file. The Exception thrown while checking the content type is java.lang.NoClassDefFoundError: Could not initialize class com.adobe.xmp.impl.XMPMetaParser at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160) at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144) at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106) at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1439) PDF embeded with document can not parse.
[ https://issues.apache.org/jira/browse/TIKA-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1439. - Resolution: Duplicate PDF embeded with document can not parse. Key: TIKA-1439 URL: https://issues.apache.org/jira/browse/TIKA-1439 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Environment: windows7 Reporter: sunxingzhe Labels: pdfbox Attachments: PDF2XHTML.java_diff.html I insert a Excel file into the pdf file. But can not extracte embedded excel resources. The attachment file PDF2XHTML.java_diff.html is the diff file. Please confirm it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates
[ https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1233: -- Fix Version/s: (was: 1.6) 1.10 PDFBox can throw StringIndexOutOfBoundsException on some dates -- Key: TIKA-1233 URL: https://issues.apache.org/jira/browse/TIKA-1233 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Tim Allison Priority: Trivial Labels: easyfix Fix For: 1.10 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this feature. Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from causing problems in TIKA {noformat} @@ -171,6 +171,9 @@ addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate()); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } try { Calendar modified = info.getModificationDate(); @@ -178,6 +181,9 @@ addMetadata(metadata, TikaCoreProperties.MODIFIED, modified); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1585) Create Example Website with Form Submission
[ https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1585. --- Resolution: Fixed Good idea, [~lewismc]. I added it to http://people.apache.org/~tpalsulich/tika.html. The server is down right now. If/when another one is started, we'll need to start it with the right CORS argument (http://people.apache.org) and I'll update the page with the right IP address. Create Example Website with Form Submission --- Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1536) Upgrade compiler definition in pom's to Java 7
[ https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605300#comment-14605300 ] Tyler Palsulich commented on TIKA-1536: --- Now that 1.9 is released, are there any blockers for upgrading to Java 1.7? Upgrade compiler definition in pom's to Java 7 -- Key: TIKA-1536 URL: https://issues.apache.org/jira/browse/TIKA-1536 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.7 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.10 Attachments: TIKA-1536.patch Since we committed TIKA-1423 it would appear through [mailing list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] commentary that there is a willingness to drop support for Java 1.6 in favour of = Java 1.7. This issue simply addresses this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1199) Tika extracts weird signs instead of text
[ https://issues.apache.org/jira/browse/TIKA-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1199. - Resolution: Not A Problem Tika extracts weird signs instead of text - Key: TIKA-1199 URL: https://issues.apache.org/jira/browse/TIKA-1199 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: MacOSX, Linux Reporter: Marc Teutelink Attachments: gaat fout.pdf, plain_text_tika_output_from_gaat_fout_pdf.txt, structured_text_tika_output_from_gaat_fout_pdf.xml Tika extracts complete bogus text from the attached document. I have attached the .PDF in question and also added the plain and structured text output from Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1630) Mention APK support in List of Supported Formats
[ https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1630. --- Resolution: Fixed Fix Version/s: 1.9 Assignee: Tyler Palsulich Bolded the Please note for version 1.9. Hopefully that will help clear things up. [~flowlo], thank you for reporting this! Please let us know if you run into any other issues or have any other suggested improvements. Mention APK support in List of Supported Formats Key: TIKA-1630 URL: https://issues.apache.org/jira/browse/TIKA-1630 Project: Tika Issue Type: Improvement Components: documentation Affects Versions: 1.8 Reporter: Lorenz Leutgeb Assignee: Tyler Palsulich Priority: Trivial Fix For: 1.9 http://tika.apache.org/1.8/formats.html claims to offer a full list of supported formats does not mention support for APK files at all. I trusted that source and only found that tike supports APK files and their respective MIME types from looking at Tikas codebase, which is suboptimal. Please add APK files to that list as appropriate (at least include the MIME type Tika understands). Consider reevaluating the list to find out whether other formats are missing (this is not covered by this ticket). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App
[ https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575986#comment-14575986 ] Tyler Palsulich commented on TIKA-1652: --- I think this is a duplicate of TIKA-1426? Tika Server should allow config file override from the command line like Tika App - Key: TIKA-1652 URL: https://issues.apache.org/jira/browse/TIKA-1652 Project: Tika Issue Type: Bug Components: server Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.9 Tika-app's TikaCLI allows a command line parameter, --config, to override the Tika config at the command line. For whatever reason, Tika-server doesn't it should since it causes a different control flow for things to get created. I first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1624) Syntax error in DOAP file release section
[ https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553281#comment-14553281 ] Tyler Palsulich commented on TIKA-1624: --- Thanks, Ken. I published the file a few minutes ago. Syntax error in DOAP file release section - Key: TIKA-1624 URL: https://issues.apache.org/jira/browse/TIKA-1624 Project: Tika Issue Type: Bug Environment: http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf Reporter: Sebb Assignee: Ken Krugler DOAP files can contain details of multiple release Versions, however each must be listed in a separate release section, for example: release Version nameApache XYZ/name created2015-02-16/created revision1.6.2/revision /Version /release release Version nameApache XYZ/name created2014-09-24/created revision1.6.1/revision /Version /release Please can the project DOAP be corrected accordingly? Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1630) Mention APK support in List of Supported Formats
[ https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553272#comment-14553272 ] Tyler Palsulich commented on TIKA-1630: --- That is a very good point. There is a paragraph on the formats page which explains in a little bit more detail: bq. (Please note that Apache Tika is able to detect a much wider range of formats than those listed below, this page only documents those formats from which Tika is able to extract metadata and/or textual content) Would it help if we included a link to the mimetypes file (which has all filetypes Tika can detect)? Mention APK support in List of Supported Formats Key: TIKA-1630 URL: https://issues.apache.org/jira/browse/TIKA-1630 Project: Tika Issue Type: Improvement Components: documentation Affects Versions: 1.8 Reporter: Lorenz Leutgeb Priority: Trivial http://tika.apache.org/1.8/formats.html claims to offer a full list of supported formats does not mention support for APK files at all. I trusted that source and only found that tike supports APK files and their respective MIME types from looking at Tikas codebase, which is suboptimal. Please add APK files to that list as appropriate (at least include the MIME type Tika understands). Consider reevaluating the list to find out whether other formats are missing (this is not covered by this ticket). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1630) Mention APK support in List of Supported Formats
[ https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544104#comment-14544104 ] Tyler Palsulich commented on TIKA-1630: --- Hi. Thanks for reporting this! Can you be a little more specific about which file is supported? What in the Tika codebase indicates support for APK formats? Also, just to be clear, are you referring to android application packages? Mention APK support in List of Supported Formats Key: TIKA-1630 URL: https://issues.apache.org/jira/browse/TIKA-1630 Project: Tika Issue Type: Improvement Components: documentation Affects Versions: 1.8 Reporter: Lorenz Leutgeb Priority: Trivial http://tika.apache.org/1.8/formats.html claims to offer a full list of supported formats does not mention support for APK files at all. I trusted that source and only found that tike supports APK files and their respective MIME types from looking at Tikas codebase, which is suboptimal. Please add APK files to that list as appropriate (at least include the MIME type Tika understands). Consider reevaluating the list to find out whether other formats are missing (this is not covered by this ticket). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1624) Syntax error in DOAP file release section
[ https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544150#comment-14544150 ] Tyler Palsulich commented on TIKA-1624: --- [~kkrugler], yes. I just updated the release instructions. Syntax error in DOAP file release section - Key: TIKA-1624 URL: https://issues.apache.org/jira/browse/TIKA-1624 Project: Tika Issue Type: Bug Environment: http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf Reporter: Sebb Assignee: Ken Krugler DOAP files can contain details of multiple release Versions, however each must be listed in a separate release section, for example: release Version nameApache XYZ/name created2015-02-16/created revision1.6.2/revision /Version /release release Version nameApache XYZ/name created2014-09-24/created revision1.6.1/revision /Version /release Please can the project DOAP be corrected accordingly? Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission
[ https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507259#comment-14507259 ] Tyler Palsulich commented on TIKA-1585: --- Is there an Apache hosted location we'd like to stand this up? If not, I'll close this issue off. http://tpalsulich.github.io/TikaExamples/ Create Example Website with Form Submission --- Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new HashMapString, Object data structure for persitsence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503778#comment-14503778 ] Tyler Palsulich commented on TIKA-1607: --- Good idea! What if you created a subclass of {{Metadata}} ({{ExtendedMetadata}}?) which supports mapping to a {{ListMapString, Object}}. Then, when populating the metadata with a phone number, you can check if {{metadata instanceof ExtendedMetadata}} and respond accordingly. Any drastic changes would be a good candidate for Tika 2.0. Introduce new HashMapString, Object data structure for persitsence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.9 I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: ListHashMapString,String {code} Where Object could be a CollectionHashMapString/Property, String/int/long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1266) Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox
[ https://issues.apache.org/jira/browse/TIKA-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1266. - Resolution: Not A Problem Thanks, [~bobpaulin]! Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox -- Key: TIKA-1266 URL: https://issues.apache.org/jira/browse/TIKA-1266 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.4, 1.5 Reporter: pm The tika-bundle currently has the Embed-Dependency header filled with embedded dependencies. Embed-Dependency is not defined in OSGI spec, Bundle-ClassPath is . Please add Bundle-ClassPath with list of embedded JAR names prefixed with ., . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide
[ https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492638#comment-14492638 ] Tyler Palsulich commented on TIKA-1593: --- See https://svn.apache.org/repos/asf/tika/site/src/site/apt/download.apt.vm -- you need the vm extension. Then, you can use {code}${project.parent.version}{code} to get the current version of the project. Then, when we update the site for a new release, you just have to change the version number in the site's pom.xml file. I'll fix this right now. Doco: Broken link to Parser Quick Start Guide --- Key: TIKA-1593 URL: https://issues.apache.org/jira/browse/TIKA-1593 Project: Tika Issue Type: Bug Components: documentation Affects Versions: 1.7 Reporter: Dan Rollo Priority: Minor The Tika web page: https://tika.apache.org/contribute.html, under the Section: New Parsers, Detectors and Mime Types, there is a link with the text: Parser Quick Start Guide. The link URL is: https://tika.apache.org/parser_guide.apt, and does not work. The .apt extension seems odd. I don't know what the link should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide
[ https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1593. --- Resolution: Fixed Assignee: Tyler Palsulich Fixed in r1673240. Thank you [~bhamail]! Please let us know if you find any more. Doco: Broken link to Parser Quick Start Guide --- Key: TIKA-1593 URL: https://issues.apache.org/jira/browse/TIKA-1593 Project: Tika Issue Type: Bug Components: documentation Affects Versions: 1.7 Reporter: Dan Rollo Assignee: Tyler Palsulich Priority: Minor The Tika web page: https://tika.apache.org/contribute.html, under the Section: New Parsers, Detectors and Mime Types, there is a link with the text: Parser Quick Start Guide. The link URL is: https://tika.apache.org/parser_guide.apt, and does not work. The .apt extension seems odd. I don't know what the link should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide
[ https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492662#comment-14492662 ] Tyler Palsulich edited comment on TIKA-1593 at 4/13/15 5:02 PM: Fixed in r1673240 and r1673241. Thank you [~bhamail]! Please let us know if you find any more. was (Author: tpalsulich): Fixed in r1673240. Thank you [~bhamail]! Please let us know if you find any more. Doco: Broken link to Parser Quick Start Guide --- Key: TIKA-1593 URL: https://issues.apache.org/jira/browse/TIKA-1593 Project: Tika Issue Type: Bug Components: documentation Affects Versions: 1.7 Reporter: Dan Rollo Assignee: Tyler Palsulich Priority: Minor The Tika web page: https://tika.apache.org/contribute.html, under the Section: New Parsers, Detectors and Mime Types, there is a link with the text: Parser Quick Start Guide. The link URL is: https://tika.apache.org/parser_guide.apt, and does not work. The .apt extension seems odd. I don't know what the link should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources
[ https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1600. --- Resolution: Fixed Assignee: Hong-Thai Nguyen Thanks, [~thaichat04]! I just updated it -- reformatted the ODF parsing files (they were all a bit odd with whitespace) and moved the test into the existing test file. Marking this as fixed and will cut a new release shortly. Unable to parse ODT files because of failed to close temporary resources Key: TIKA-1600 URL: https://issues.apache.org/jira/browse/TIKA-1600 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.8 Environment: Windows Reporter: Hong-Thai Nguyen Assignee: Hong-Thai Nguyen Attachments: Manuel_koha.odt Many ODT files are failed to parse causing of this exception. A sample file in attachment {code} Apache Tika was unable to parse the document at C:\Users\hong-thai.nguyen\Downloads\Manuel_koha.odt. The full exception stack trace is included below: org.apache.tika.exception.TikaException: Failed to close temporary resources at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:342) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:299) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:256) at javax.swing.AbstractButton.fireActionPerformed(Unknown Source) at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.setPressed(Unknown Source) at javax.swing.AbstractButton.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown Source) at java.awt.Component.processMouseEvent(Unknown Source) at javax.swing.JComponent.processMouseEvent(Unknown Source) at java.awt.Component.processEvent(Unknown Source) at java.awt.Container.processEvent(Unknown Source) at java.awt.Component.dispatchEventImpl(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Window.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.EventQueue.dispatchEventImpl(Unknown Source) at java.awt.EventQueue.access$400(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source) Caused by: java.io.IOException: Could not delete temporary file C:\Users\HONG-T~1.NGU\AppData\Local\Temp\apache-tika-2891340188156641845.tmp at org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70) at org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121) at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150) ... 42 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources
[ https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1600: -- Priority: Blocker (was: Major) Unable to parse ODT files because of failed to close temporary resources Key: TIKA-1600 URL: https://issues.apache.org/jira/browse/TIKA-1600 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.8 Environment: Windows Reporter: Hong-Thai Nguyen Assignee: Hong-Thai Nguyen Priority: Blocker Attachments: Manuel_koha.odt Many ODT files are failed to parse causing of this exception. A sample file in attachment {code} Apache Tika was unable to parse the document at C:\Users\hong-thai.nguyen\Downloads\Manuel_koha.odt. The full exception stack trace is included below: org.apache.tika.exception.TikaException: Failed to close temporary resources at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:342) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:299) at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:256) at javax.swing.AbstractButton.fireActionPerformed(Unknown Source) at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source) at javax.swing.DefaultButtonModel.setPressed(Unknown Source) at javax.swing.AbstractButton.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source) at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown Source) at java.awt.Component.processMouseEvent(Unknown Source) at javax.swing.JComponent.processMouseEvent(Unknown Source) at java.awt.Component.processEvent(Unknown Source) at java.awt.Container.processEvent(Unknown Source) at java.awt.Component.dispatchEventImpl(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source) at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source) at java.awt.Container.dispatchEventImpl(Unknown Source) at java.awt.Window.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.EventQueue.dispatchEventImpl(Unknown Source) at java.awt.EventQueue.access$400(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.awt.EventQueue$3.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.awt.EventQueue$4.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown Source) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source) Caused by: java.io.IOException: Could not delete temporary file C:\Users\HONG-T~1.NGU\AppData\Local\Temp\apache-tika-2891340188156641845.tmp at org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70) at org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121) at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150) ... 42 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1592. - Resolution: Invalid Closing as Invalid. Feel free to create additional issues if you run into other problems with Tika! Thank you for updating with the solution! I'm glad you found it. :) (I'm also glad this wasn't a Tika issue... Ha.) It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393246#comment-14393246 ] Tyler Palsulich commented on TIKA-1592: --- I tried building ikube on a Mac, but I ran into multiple test failures. {code} Tests in error: analyze(ikube.analytics.weka.WekaClassifierIntegration) initializationError(ikube.action.rule.IsRemoteIndexCurrentIntegration) initializationError(ikube.analytics.weka.WekaForecastClassifierIntegration) initializationError(ikube.database.DataBaseIntegration) initializationError(ikube.action.index.handler.database.TableResourceProviderIntegration) initializationError(ikube.web.service.AnalyzerIntegration) initializationError(ikube.analytics.AnalyticsServiceIntegration) initializationError(ikube.scheduling.SnapshotScheduleIntegration) initializationError(ikube.web.service.SearcherJsonIntegration) initializationError(ikube.scheduling.PruneScheduleIntegration) initializationError(ikube.action.index.handler.email.IndexableEmailHandlerIntegration) initializationError(ikube.action.index.handler.strategy.GeospatialEnrichmentStrategyIntegration) initializationError(ikube.action.index.handler.filesystem.IndexableFilesystemHandlerIntegration) initializationError(ikube.web.service.SearcherXmlIntegration) initializationError(ikube.action.ResetIntegration) initializationError(ikube.action.index.handler.internet.SvnHandlerIntegration) initializationError(ikube.toolkit.DatabaseUtilitiesIntegration) initializationError(ikube.action.rule.RulesIntegration) initializationError(ikube.analytics.neuroph.NeurophAnalyzerIntegration) initializationError(ikube.database.EntityIntegration) initializationError(ikube.cluster.hzc.ClusterManagerCacheSearchIntegration) initializationError(ikube.action.index.handler.database.IndexableTableHandlerIntegration) {code} Is Linux required? Can you give some context of how you're using Tika in the failing unit test? Tika should not have any (or, really, there is very little) OS specific code. So, it doesn't make sense why something would try to start x11. But, a dependency could definitely be up to something fishy. It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393184#comment-14393184 ] Tyler Palsulich commented on TIKA-1592: --- Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building Tika 1.7 from source? Which test case causes this? After a quick {{grep}}, I don't see any gconf or dbus references (don't know why there would be any, off the top of my head...). When you say the logging is a a gig, is that what is sent to stdout when doing {{mvn install}}? Or something else? It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too
[ https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393184#comment-14393184 ] Tyler Palsulich edited comment on TIKA-1592 at 4/2/15 7:09 PM: --- Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building Tika 1.7 from source? Which test case causes this? -After a quick {{grep}}, I don't see any gconf or dbus references (don't know why there would be any, off the top of my head...)- See {{grep}} output below. When you say the logging is a a gig, is that what is sent to stdout when doing {{mvn install}}? Or something else? {code} ➜ trunk grep -Ri dbus . Binary file ./tika-parsers/src/test/resources/test-documents/testTIFF.tif matches Binary file ./tika-parsers/target/test-classes/test-documents/testTIFF.tif matches Binary file ./tika-parsers/target/tika-parsers-1.8-SNAPSHOT-tests.jar matches Binary file ./tika-server/target/tika-server-1.8-SNAPSHOT.jar matches ➜ trunk grep -Ri gconf . Binary file ./tika-app/target/tika-app-1.8-SNAPSHOT.jar matches Binary file ./tika-server/target/tika-server-1.8-SNAPSHOT.jar matches {code} was (Author: tpalsulich): Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building Tika 1.7 from source? Which test case causes this? After a quick {{grep}}, I don't see any gconf or dbus references (don't know why there would be any, off the top of my head...). When you say the logging is a a gig, is that what is sent to stdout when doing {{mvn install}}? Or something else? It seems dbus and x11 server are invoked, and fails for some reason too --- Key: TIKA-1592 URL: https://issues.apache.org/jira/browse/TIKA-1592 Project: Tika Issue Type: Bug Affects Versions: 1.7 Environment: CentOs 6.6, Java 1.7 Reporter: Michael Couck Exception running unit tests: GConf Error: Failed to contact configuration server; some possible causes are that you need to enable TCP/IP networking for ORBit, or you have stale NFS locks due to a system crash. See http://projects.gnome.org/gconf/ for information. (Details - 1: Not running within active session) Is Tika trying to start an x11 server using dbus? Why? This breaks the unit tests, the logging is a gig for each run, and even a 64 core server is 100% cpu during the failure. I am completely confounded. Any ideas? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1585) Create Example Website with Form Submission
[ https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390841#comment-14390841 ] Tyler Palsulich edited comment on TIKA-1585 at 4/1/15 3:51 PM: --- Done. It works. -I'll see if I can shut 9997 down right now.- Port 9997 is now closed. was (Author: tpalsulich): Done. It works. I'll see if I can shut 9997 down right now. Create Example Website with Form Submission --- Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1558: -- Description: As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. -So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.- was: As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}. Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. -So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.- -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1432#comment-1432 ] Tyler Palsulich edited comment on TIKA-1558 at 3/31/15 9:41 PM: -Above strategy added in r1661284. You can now blacklist Parsers by adding names to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same format as the normal services file. If a class is blacklisted, all of its subclasses are automatically blacklisted.- Edit: Service loading blacklisting disabled in r1670487. Use a custom TikaConfig like [this one|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-1558-blacklistsub.xml] to disable a Parser. Any subclasses of that Parser will also be excluded. was (Author: tpalsulich): Above strategy added in r1661284. You can now blacklist Parsers by adding names to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same format as the normal services file. If a class is blacklisted, all of its subclasses are automatically blacklisted. Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. -So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.- -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1587) ForkParser::setJavaCommand should take ListString
[ https://issues.apache.org/jira/browse/TIKA-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386685#comment-14386685 ] Tyler Palsulich commented on TIKA-1587: --- Thank you for reporting this! It seems like a definite problem. Is there any way you can provide a patch? ForkParser::setJavaCommand should take ListString --- Key: TIKA-1587 URL: https://issues.apache.org/jira/browse/TIKA-1587 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Oleg Oshmyan ForkParser::setJavaCommand currently takes a string and splits it on whitespace. This makes it impossible to use commands with paths that contain spaces. In particular, it makes it impossible to reliably use System.getProperty(java.home) in order to launch the same Java that the current process is running in, because it might contain spaces. If it would just take a ListString and pass (a clone of) it directly to ProcessBuilder, this wouldn't be a problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386906#comment-14386906 ] Tyler Palsulich edited comment on TIKA-1584 at 3/30/15 4:05 PM: Yup! The 1.8 release process should start this week. Ideally, it will hit the mirrors some time next week. [edit: 1.8, not 1.7!] was (Author: tpalsulich): Yup! The 1.7 release process should start this week. Ideally, it will hit the mirrors some time next week. Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker Fix For: 1.8 I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1575: -- Fix Version/s: 1.8 Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, PDFBox_1_8_8Vs1_8_9_20150316.zip, caught_ex_1_8_9.zip, content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, reports_1_8_9_multithread_vs_single.zip The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1579) Add file type to NetCDFParser
[ https://issues.apache.org/jira/browse/TIKA-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1579. --- Resolution: Fixed Add file type to NetCDFParser - Key: TIKA-1579 URL: https://issues.apache.org/jira/browse/TIKA-1579 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Assignee: Ann Burgess Attachments: TIKA-1579.abburgess.190315.patch.txt [~gostep] explains that, there are three versions of NetCDF (classic format, 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF file, the netCDF library will transparently detect its format so we do not need to adjust according to the detected format. That said, it would be good to know the file type as each can have the .nc extension. This will add patch with add file type to the metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)
[ https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385483#comment-14385483 ] Tyler Palsulich commented on TIKA-1584: --- We now have two major issues which need a quick release. So, I would say go for 1.8. Tim, can you chime in on the current discuss thread? Tika 1.7 possible regression (nested attachment files not getting parsed) - Key: TIKA-1584 URL: https://issues.apache.org/jira/browse/TIKA-1584 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.7 Reporter: Rob Tulloh Assignee: Tim Allison Priority: Blocker I tried to send this to the tika user list, but got a qmail failure so I am opening a jira to see if I can get help with this. There appears to be a change in the behavior of tika since 1.5 (the last version we have used). In 1.5, if we pass a file with content type of rfc822 which contains a zip that contains a docx file, the entire content would get recursed and the text returned. In 1.7, tika only unwinds as far as the zip file and ignores the content of the contained docx file. This is causing a regression failure in our search tests because the contents of the docx file are not found when searched for. We are testing with tika-server if this helps. If we ask the meta service to just characterize the test data, it correctly determines the input is of type rfc822. However, on extract, the contents of the attachment are not extracted as expected. curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/meta 2/dev/null | grep Content-Type Content-Type,message/rfc822 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream http://localhost:9998/tika 2/dev/null | grep docx sign.docx --- this is not expected, need contents of this extracted We can easily reproduce this problem with a simple eml file with an attachment. Can someone please comment if this seems like a problem or perhaps we need to change something in our call to get the old behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1585) Create Example Website with Form Submission
Tyler Palsulich created TIKA-1585: - Summary: Create Example Website with Form Submission Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1526. --- Resolution: Fixed Marking this as Fixed, per the above comments. [~thetaphi] or [~hossman] or anyone else, please reopen this if you find any other cases. Thank you everyone for the help! ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1581) jhighlight license concerns
[ https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385337#comment-14385337 ] Tyler Palsulich commented on TIKA-1581: --- Hi [~kkrugler]. Thanks. The comment is now bq. Tika-parsers component uses CDDL/LGPL dual-licensed dependency: jhighlight (https://github.com/codelibs/jhighlight) If this looks good, I'll start a \[DISCUSS\] thread on the list about a new version. jhighlight license concerns --- Key: TIKA-1581 URL: https://issues.apache.org/jira/browse/TIKA-1581 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Karl Wright Fix For: 1.8 jhighlight jar is a Tika dependency. The Lucene team discovered that, while it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL only: {code} Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself as dual CDDL or LGPL license. However, some of its classes are distributed only under LGPL, e.g. com.uwyn.jhighlight.highlighter. CppHighlighter.java GroovyHighlighter.java JavaHighlighter.java XmlHighlighter.java I downloaded the sources from Maven (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar) to confirm that, and also found this SVN repo: http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's website seems to not exist anymore (https://jhighlight.dev.java.net/). I didn't find any direct usage of it in our code, so I guess it's probably needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, things will compile, but may fail at runtime. {code} Is it possible to remove this dependency for future releases, or allow only optional inclusion of this package? It is of concern to the ManifoldCF project because we distribute a binary package that includes Tika and its required dependencies, which currently includes jHighlight. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1586) Enable CORS on Tika Server
Tyler Palsulich created TIKA-1586: - Summary: Enable CORS on Tika Server Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1586. --- Resolution: Fixed Fixed in r1669799. Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission
[ https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385411#comment-14385411 ] Tyler Palsulich commented on TIKA-1585: --- CORS work is now integrated. [~talli...@mitre.org], can you restart the server on 162.242.228.174:9998 with the --cors http://tpalsulich.github.io; option? Then, we can close off the 9997 port (my github.io site is querying 9997, though, so I'll need to update that). Is there an official place we'd like to host the above site? Create Example Website with Form Submission --- Key: TIKA-1585 URL: https://issues.apache.org/jira/browse/TIKA-1585 Project: Tika Issue Type: New Feature Components: example, server Reporter: Tyler Palsulich Assignee: Tyler Palsulich It would be great to have a website where we can direct people who ask what Tika can do for [filetype] without needing them to actually download Tika. Some initial work to do that is [here|http://tpalsulich.github.io/TikaExamples/]. I'm far from a design guru, but I imagine the site as having a form where you can upload a file at the top, checkboxes for if you want metadata, content, or both, and a submit button. The request should be sent with AJAX and the result should populate a {{div}}. One issue with AJAX requests is that Tika Server doesn't currently allow Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly updated tika-server, or update the server to allow configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385372#comment-14385372 ] Tyler Palsulich commented on TIKA-1586: --- Can someone take a look at the above PR and make sure I'm not doing anything bone-headed? Thanks! Enable CORS on Tika Server -- Key: TIKA-1586 URL: https://issues.apache.org/jira/browse/TIKA-1586 Project: Tika Issue Type: New Feature Components: server Reporter: Tyler Palsulich Assignee: Tyler Palsulich Tika Server should allow configuration of CORS requests (for uses like TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from CXF for how to add it. The only change from that site is that we will need to add a {{CrossOriginResourceSharingFilter}} as a provider. Ideally, this is configurable (limit which resources have CORS, and which origins are allowed). But, I'm not thinking of any general methods of how to do that... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1354. - Resolution: Fixed Fix Version/s: 1.7 Marking as Fixed. ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac Fix For: 1.7 I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1583) Convert Module Level READMEs to Markdown
Tyler Palsulich created TIKA-1583: - Summary: Convert Module Level READMEs to Markdown Key: TIKA-1583 URL: https://issues.apache.org/jira/browse/TIKA-1583 Project: Tika Issue Type: Improvement Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1583) Convert Module Level READMEs to Markdown
[ https://issues.apache.org/jira/browse/TIKA-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1583. --- Resolution: Done Done in r1669644 and r1669645. Convert Module Level READMEs to Markdown Key: TIKA-1583 URL: https://issues.apache.org/jira/browse/TIKA-1583 Project: Tika Issue Type: Improvement Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1273) old tika-server jar artifact contains no manifest so not able to invoke from shell
[ https://issues.apache.org/jira/browse/TIKA-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376796#comment-14376796 ] Tyler Palsulich commented on TIKA-1273: --- {{original-tika-server-1.8-SNAPSHOT.jar}}? I don't see any flags in the tika-server pom.xml. So, I'm not sure where the activation is. old tika-server jar artifact contains no manifest so not able to invoke from shell -- Key: TIKA-1273 URL: https://issues.apache.org/jira/browse/TIKA-1273 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.8 I've never ever used the old tika-server artifact which is generated when one installs the server module. It needs to contain a manifest otherwise it cannot be invoked from the shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1565) image/gif parse error
[ https://issues.apache.org/jira/browse/TIKA-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1565. --- Resolution: Fixed Fix Version/s: (was: 1.7) 1.8 Assignee: Tyler Palsulich Marking as Fixed for 1.8. The file is now parsed without an Exception. Please reopen if you are still running into this issue with Trunk or 1.8 (when it is released some time in the future). image/gif parse error - Key: TIKA-1565 URL: https://issues.apache.org/jira/browse/TIKA-1565 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: win7 x64 jdk1.7 Reporter: lixin Assignee: Tyler Palsulich Fix For: 1.8 Attachments: JNK16-1309-173.mht I am getting an exception parsing the following mht File {code} org.apache.tika.exception.TikaException: image/gif parse error at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:115) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.tika.example.MyTest.test1(MyTest.java:31) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192) Caused by: javax.imageio.IIOException: Unexpected block type 1! at com.sun.imageio.plugins.gif.GIFImageReader.readMetadata(Unknown Source) at com.sun.imageio.plugins.gif.GIFImageReader.getWidth(Unknown Source) at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:92) ... 32 more {code} my test code: {code} AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); parser.parse(new FileInputStream(new File(file)), handler, metadata,context); System.out.println(handler.toString()); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux
[ https://issues.apache.org/jira/browse/TIKA-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1543. - Resolution: Fixed This isn't actually a problem. I just tested locally -- it works. We have unit tests for the path, but it's difficult to test that extraction works with a non-standard path, since we don't know what the path is... I think the problem is either: The path you set is not to the directory that contains the executable or The path doesn't have a tessdata directory inside it. You can see all of the Tesseract debugging messages by enabling {{debug}} level logging (put a [log4j.properties|https://github.com/apache/tika/blob/10298692cb27d1ad3732589930987e2fe2681ee8/tika-parsers/src/test/resources/log4j.properties] file on your classpath and set the output level to {{debug}}). I'd be happy to help you debug further. TesseractOCRParser.setTesseractPath() doesn't work on Linux --- Key: TIKA-1543 URL: https://issues.apache.org/jira/browse/TIKA-1543 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Sean Zhao Fix For: 1.8 Original Estimate: 168h Remaining Estimate: 168h After call setTesseractPath() to set the Tesseract path to a not-default path, like /root/tesseract , call the TesseractOCRParser.parse(), nothing will return. Not sure if this is related to TIKA-1421. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1460) Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'
[ https://issues.apache.org/jira/browse/TIKA-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1460. - Resolution: Cannot Reproduce Closing as Cannot Reproduce, since it's been a month since my last comment and we don't have the file which reproduces the issue. Please reopen if you're still running into this! Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2' -- Key: TIKA-1460 URL: https://issues.apache.org/jira/browse/TIKA-1460 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: win7,myeclipse8.5 Reporter: onyas Priority: Critical for some reason,I could not upload the file,Here is the info.. and i checked all the version in the directory of \org\apache\pdfbox\resources\cmap, I have not found the ’Adobe-GBK1-UCS2‘ file org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@d640af at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) Caused by: java.lang.IllegalArgumentException: Position 66048 past the end of the file at org.apache.poi.poifs.nio.FileBackedDataSource.read(FileBackedDataSource.java:50) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:420) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:397) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:356) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:202) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 21 more the major code is : Parser parser = new AutoDetectParser(); ContentHandler handler = new BodyContentHandler(getNum()); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); InputStream stream = null; StringBuffer content = new StringBuffer(); try { stream = new FileInputStream(file); if (stream != null) { parser.parse(stream, handler, metadata, context); content = content.append(handler); if(StringUtils.isNotBlank(content.toString())){ hasContent = true; handler = null; metadata = null; context = null; } } And the exception is throwed at this line== parser.parse(stream, handler, metadata, context); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1565) image/gif parse error
[ https://issues.apache.org/jira/browse/TIKA-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1565: -- Description: I am getting an exception parsing the following mht File {code} org.apache.tika.exception.TikaException: image/gif parse error at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:115) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.tika.example.MyTest.test1(MyTest.java:31) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192) Caused by: javax.imageio.IIOException: Unexpected block type 1! at com.sun.imageio.plugins.gif.GIFImageReader.readMetadata(Unknown Source) at com.sun.imageio.plugins.gif.GIFImageReader.getWidth(Unknown Source) at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:92) ... 32 more {code} my test code: {code} AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); parser.parse(new FileInputStream(new File(file)), handler, metadata,context); System.out.println(handler.toString()); {code} was: I am getting an exception parsing the following mht File org.apache.tika.exception.TikaException: image/gif parse error at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:115) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:239) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.tika.example.MyTest.test1(MyTest.java:31) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source)
[jira] [Updated] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux
[ https://issues.apache.org/jira/browse/TIKA-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1543: -- Fix Version/s: (was: 1.7) 1.8 TesseractOCRParser.setTesseractPath() doesn't work on Linux --- Key: TIKA-1543 URL: https://issues.apache.org/jira/browse/TIKA-1543 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Sean Zhao Fix For: 1.8 Original Estimate: 168h Remaining Estimate: 168h After call setTesseractPath() to set the Tesseract path to a not-default path, like /root/tesseract , call the TesseractOCRParser.parse(), nothing will return. Not sure if this is related to TIKA-1421. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1543) TesseractOCRParser.setTesseractPath() doesn't work on Linux
[ https://issues.apache.org/jira/browse/TIKA-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375173#comment-14375173 ] Tyler Palsulich commented on TIKA-1543: --- (I just added the logging in r1668477 a few minutes ago. See [this commit|https://github.com/apache/tika/commit/84825f035069d572f155f86fa4c18d5a79b48028] on GitHub.) TesseractOCRParser.setTesseractPath() doesn't work on Linux --- Key: TIKA-1543 URL: https://issues.apache.org/jira/browse/TIKA-1543 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Sean Zhao Fix For: 1.8 Original Estimate: 168h Remaining Estimate: 168h After call setTesseractPath() to set the Tesseract path to a not-default path, like /root/tesseract , call the TesseractOCRParser.parse(), nothing will return. Not sure if this is related to TIKA-1421. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1273) old tika-server jar artifact contains no manifest so not able to invoke from shell
[ https://issues.apache.org/jira/browse/TIKA-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372909#comment-14372909 ] Tyler Palsulich commented on TIKA-1273: --- Where exactly is the old jar? The one I ran above? old tika-server jar artifact contains no manifest so not able to invoke from shell -- Key: TIKA-1273 URL: https://issues.apache.org/jira/browse/TIKA-1273 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.8 I've never ever used the old tika-server artifact which is generated when one installs the server module. It needs to contain a manifest otherwise it cannot be invoked from the shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images
[ https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372082#comment-14372082 ] Tyler Palsulich commented on TIKA-1344: --- [~gagravarr], can we close this one off? Thank you, [~skibaa]! Ability to generate self-contained HTML with images --- Key: TIKA-1344 URL: https://issues.apache.org/jira/browse/TIKA-1344 Project: Tika Issue Type: Improvement Components: parser Reporter: Andrew Skiba Labels: easyfix, patch Attachments: word.patch Original Estimate: 1h Remaining Estimate: 1h n the current code, the images from Word documents are referenced by embedded:xxx links in the generated HTML. This causes the browsers display x icon instead of the image. The proposed patch encodes the images using Data URI, if there is -Dtika.parsers.urlimages system property. http://en.wikipedia.org/wiki/Data_URI_scheme So the default behavior is the same, but users of the library can optionally generate self-contained HTML with correct images. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1354) ForkParser doesn't work in OSGI container
[ https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372107#comment-14372107 ] Tyler Palsulich commented on TIKA-1354: --- [~chrismattmann] and [~hlavki], are there any other updates needed for this issue? The build failure just got pruned from Jenkins. ForkParser doesn't work in OSGI container - Key: TIKA-1354 URL: https://issues.apache.org/jira/browse/TIKA-1354 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Michal Hlavac I can't find way to run ForkParser in OSGI container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1356) degraded performance OOXMLParser with WriteOutContentHandler
[ https://issues.apache.org/jira/browse/TIKA-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1356: -- Description: If use OOXMLParser with WriteOutContentHandler as destination of result, we can recieve degraded performance. Reason of this problem is ignoring SAXException in org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.SheetTextAsHTML.endRow() and others methods of this class. As example: source doc have many empty rows in end of the table(about 100). When WriteOutContentHandler is full WriteLimitReachedException raised lot times. Below is stacktrace of long proccess {code} org.apache.tika.sax.ContentHandlerDecorator.ignorableWhitespace(ContentHandlerDecorator.java:157) org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:46) org.apache.tika.sax.SafeContentHandler$2.write(SafeContentHandler.java:94) org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) org.apache.tika.sax.SafeContentHandler.ignorableWhitespace(SafeContentHandler.java:293) org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:242) org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:275) org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$SheetTextAsHTML.cell(XSSFExcelExtractorDecorator.java:203) org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:295) org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:287) org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) org.apache.xerces.parsers.XMLParser.parse(Unknown Source) org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:164) org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:120) org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:105) org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) org.elasticsearch.index.mapper.attachment.AttachmentMapper$RecursiveMetadataParser.parse(AttachmentMapper.java:104) org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:169) org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:135) org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) {code} was: If use OOXMLParser with WriteOutContentHandler as destination of result, we can recieve degraded performance. Reason of this problem is ignoring SAXException in org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.SheetTextAsHTML.endRow() and others methods of this class. As example: source doc have many empty rows in end of the table(about 100). When WriteOutContentHandler is full WriteLimitReachedException raised lot times. Below is stacktrace of long proccess org.apache.tika.sax.ContentHandlerDecorator.ignorableWhitespace(ContentHandlerDecorator.java:157) org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:46) org.apache.tika.sax.SafeContentHandler$2.write(SafeContentHandler.java:94)
[jira] [Updated] (TIKA-1358) Add support for newer iWork file formats
[ https://issues.apache.org/jira/browse/TIKA-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1358: -- Labels: new-parser newbie (was: newbie) Add support for newer iWork file formats Key: TIKA-1358 URL: https://issues.apache.org/jira/browse/TIKA-1358 Project: Tika Issue Type: Wish Components: parser Affects Versions: 1.5 Reporter: Jelle Kastelein Labels: new-parser, newbie Attachments: iwork13-testdocs-zips.zip, iwork13-testfiles-2014-11.zip IWork 2013 uses a revised file format which replaces the xml files that hold the content by .iwa files (a binary format). This file format is becoming increasingly relevant as more and more people are using apple products. However, it does not appear to work with the current IWorkPackageParser (tested with several of the example .pages files one can get from the iCloud). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372123#comment-14372123 ] Tyler Palsulich commented on TIKA-1367: --- This is still worth doing, but it needs to be better than the dependency tree idea I gave above. Still not sure about a good solution. Should this be a page on the website? Tika documentation should list tika-parsers parser dependencies --- Key: TIKA-1367 URL: https://issues.apache.org/jira/browse/TIKA-1367 Project: Tika Issue Type: Improvement Components: documentation Reporter: Sergey Beryozkin Fix For: 1.8 tika-parsers module has many strong transitive parser dependencies. Maven users of tika-parsers have to exclude all the transitivie dependencies manually. Documenting the list of the existing transitive dependencies and keeping the list up to date will help developers exclude the libraries not needed for a given project. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature
[ https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1379: -- Description: we tried to get the mime type of an xml file with xades signature embedded. the result is text/html and not the expected text/xml or application/xml. here is an example of the xml file: {code} VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23 VERBALE Id=1 tipologia=Verbale esame VERB_NUM00094853 0003 2/VERB_NUM DATA_APP2013-09-23/DATA_APP DATA_ESA2013-09-23/DATA_ESA AD_CODD69017/AD_COD ADFILOSOFIA DELLA SCIENZA/AD CDS_CODD69/CDS_COD CDSTEATRO E ARTI VISIVE/CDS TIPO_ESA/TIPO_ESA MAT1233456/MAT NOMEPAOLINO/NOME COGNOMEPAPERINO/COGNOME VOTO23.0/VOTO VOTODECOD23/VOTODECOD CAUSALE/CAUSALE TIPO_MODULO/TIPO_MODULO IMG_PATH/IMG_PATH AA_SES_ID2012/AA_SES_ID AD_CFU6.0/AD_CFU NOTA/NOTA ATENEO9/ATENEO ATENEO_DESجامعة البندقية - TEST/ATENEO_DES TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO AD_STU_CODD69017/AD_STU_COD AD_STUFILOSOFIA DELLA SCIENZA/AD_STU CDS_STU_CODD69/CDS_STU_COD CDS_STUTEATRO E ARTI VISIVE/CDS_STU DOCENTEQUI QUO QUA/DOCENTE DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO SOFTWARE_DI_CREAZIONE NOME3/NOME VERSIONE11.09.03/VERSIONE /SOFTWARE_DI_CREAZIONE /VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; Id=sig08744308748201048377 ds:SignedInfo ds:CanonicalizationMethod Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod ds:SignatureMethod Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod ds:Reference URI= ds:Transforms ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2; dsig-xpath:XPath xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath /ds:Transform ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116; xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; xmlns:xsl=http://www.w3.org/1999/XSL/Transform; exclude-result-prefixes=kion version=1.0 kion:ml module=FirmaDigitale target=kion/kion:ml xsl:output method=xml/xsl:output xsl:variable name=mostra_ad_figlie select=1/xsl:variable xsl:variable name=verbale_root select=/VERBALI/VERBALE/xsl:variable xsl:variable name=sostituzione_root select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable xsl:variable name=RAGG_ROOT select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable xsl:variable name=COMM_ROOT select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable xsl:template match=/ html head meta content=text/html;charset=UTF-8 http-equiv=Content-Type/meta xsl:choose xsl:when test=$sostituzione_root titleDichiarazione conformità Verbale Esame/title /xsl:when xsl:otherwise titleVerbalizzazione esame/title /xsl:otherwise /xsl:choose style type=text/css td {font-family: Arial; font-size:10pt;} div {font-family: Arial; font-size:10pt;} pre {font-family: Arial; font-size:10pt;} /style /head body table xsl:choose xsl:when test=$sostituzione_root trtd align=center colspan=2bigstrongxsl:value-of select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr trtd align=center colspan=2bigstrongDICHIARAZIONE DI CONFORMITÀ/strong/bigbr/br/td/tr trtd align=left colspan=2strongIl sottoscritto xsl:value-of select=$verbale_root/TITOLARE_PROCEDIMENTO/xsl:value-of, docente di xsl:value-of select=$verbale_root/AD/xsl:value-of/strongbr/br /td /tr tr
[jira] [Commented] (TIKA-1379) error in Tika().detect for xml files with xades signature
[ https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372133#comment-14372133 ] Tyler Palsulich commented on TIKA-1379: --- The file is still detected as text/html. Should we update the magic to detect it as xml? error in Tika().detect for xml files with xades signature - Key: TIKA-1379 URL: https://issues.apache.org/jira/browse/TIKA-1379 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.4 Reporter: Alessandro De Angelis Labels: new-parser Fix For: 1.8 we tried to get the mime type of an xml file with xades signature embedded. the result is text/html and not the expected text/xml or application/xml. here is an example of the xml file: {code} VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23 VERBALE Id=1 tipologia=Verbale esame VERB_NUM00094853 0003 2/VERB_NUM DATA_APP2013-09-23/DATA_APP DATA_ESA2013-09-23/DATA_ESA AD_CODD69017/AD_COD ADFILOSOFIA DELLA SCIENZA/AD CDS_CODD69/CDS_COD CDSTEATRO E ARTI VISIVE/CDS TIPO_ESA/TIPO_ESA MAT1233456/MAT NOMEPAOLINO/NOME COGNOMEPAPERINO/COGNOME VOTO23.0/VOTO VOTODECOD23/VOTODECOD CAUSALE/CAUSALE TIPO_MODULO/TIPO_MODULO IMG_PATH/IMG_PATH AA_SES_ID2012/AA_SES_ID AD_CFU6.0/AD_CFU NOTA/NOTA ATENEO9/ATENEO ATENEO_DESجامعة البندقية - TEST/ATENEO_DES TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO AD_STU_CODD69017/AD_STU_COD AD_STUFILOSOFIA DELLA SCIENZA/AD_STU CDS_STU_CODD69/CDS_STU_COD CDS_STUTEATRO E ARTI VISIVE/CDS_STU DOCENTEQUI QUO QUA/DOCENTE DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO SOFTWARE_DI_CREAZIONE NOME3/NOME VERSIONE11.09.03/VERSIONE /SOFTWARE_DI_CREAZIONE /VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; Id=sig08744308748201048377 ds:SignedInfo ds:CanonicalizationMethod Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod ds:SignatureMethod Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod ds:Reference URI= ds:Transforms ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2; dsig-xpath:XPath xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath /ds:Transform ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116; xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; xmlns:xsl=http://www.w3.org/1999/XSL/Transform; exclude-result-prefixes=kion version=1.0 kion:ml module=FirmaDigitale target=kion/kion:ml xsl:output method=xml/xsl:output xsl:variable name=mostra_ad_figlie select=1/xsl:variable xsl:variable name=verbale_root select=/VERBALI/VERBALE/xsl:variable xsl:variable name=sostituzione_root select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable xsl:variable name=RAGG_ROOT select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable xsl:variable name=COMM_ROOT select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable xsl:template match=/ html head meta content=text/html;charset=UTF-8 http-equiv=Content-Type/meta xsl:choose xsl:when test=$sostituzione_root titleDichiarazione conformità Verbale Esame/title /xsl:when xsl:otherwise titleVerbalizzazione esame/title /xsl:otherwise /xsl:choose style type=text/css td {font-family: Arial; font-size:10pt;} div {font-family: Arial; font-size:10pt;} pre {font-family: Arial; font-size:10pt;} /style /head body table xsl:choose xsl:when test=$sostituzione_root trtd align=center colspan=2bigstrongxsl:value-of select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr
[jira] [Commented] (TIKA-1266) Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox
[ https://issues.apache.org/jira/browse/TIKA-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371882#comment-14371882 ] Tyler Palsulich commented on TIKA-1266: --- After a quick Google, I don't think this is actually a problem? I really don't know, though. Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox -- Key: TIKA-1266 URL: https://issues.apache.org/jira/browse/TIKA-1266 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.4, 1.5 Reporter: pm The tika-bundle currently has the Embed-Dependency header filled with embedded dependencies. Embed-Dependency is not defined in OSGI spec, Bundle-ClassPath is . Please add Bundle-ClassPath with list of embedded JAR names prefixed with ., . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371890#comment-14371890 ] Tyler Palsulich commented on TIKA-1276: --- Is there anything else keeping this issue open? From the above, I don't think so. Please correct me if I'm wrong. Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.8 Attachments: TIKA-1276_20140423_rwesten.diff, TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_3_rwesten.diff, TIKA-1276_20140428_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above issues the tika-bundle was started successfully. However when extracting EXIG metadata from a jpeg image I got the following exception. {code} java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at
[jira] [Commented] (TIKA-1579) Add file type to NetCDFParser
[ https://issues.apache.org/jira/browse/TIKA-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371894#comment-14371894 ] Tyler Palsulich commented on TIKA-1579: --- +1, ship it! You don't need a review board for small changes. :) Add file type to NetCDFParser - Key: TIKA-1579 URL: https://issues.apache.org/jira/browse/TIKA-1579 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Assignee: Ann Burgess Attachments: TIKA-1579.abburgess.190315.patch.txt [~gostep] explains that, there are three versions of NetCDF (classic format, 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF file, the netCDF library will transparently detect its format so we do not need to adjust according to the detected format. That said, it would be good to know the file type as each can have the .nc extension. This will add patch with add file type to the metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1578) Add file type description to HDFParsers
[ https://issues.apache.org/jira/browse/TIKA-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371895#comment-14371895 ] Tyler Palsulich commented on TIKA-1578: --- +1! Add file type description to HDFParsers --- Key: TIKA-1578 URL: https://issues.apache.org/jira/browse/TIKA-1578 Project: Tika Issue Type: Improvement Components: parser Reporter: Ann Burgess Assignee: Ann Burgess Attachments: TIKA-1578.abburgess.150319.patch.txt [~gostep] explains that, there are three versions of NetCDF (classic format, 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF file, the netCDF library will transparently detect its format so we do not need to adjust according to the detected format. That said, it would be good to know the file type as each can have the .nc extension. This will add patch with add file type to the metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1154. --- Resolution: Fixed Marking as Fixed, since the file is detected and parsed without issue. Not sure what was happening before! Thanks! Tika hangs on format detection of malformed HTML file. -- Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: tika-breaker.html We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection. An example file is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1296) Add case insensitive matching for text/html mime type
[ https://issues.apache.org/jira/browse/TIKA-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372038#comment-14372038 ] Tyler Palsulich commented on TIKA-1296: --- The only mimetype definition that uses {{stringignorecase}} is rfc822. Are there any (other than HTML) that could benefit from this? Add case insensitive matching for text/html mime type - Key: TIKA-1296 URL: https://issues.apache.org/jira/browse/TIKA-1296 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.5 Reporter: Phil Lester Assignee: Ken Krugler Currently in tika-mimetypes.xml for the mime type text/html (and possibly others) matches in a couple different cases are provided for the elements so that varying HTML writing styles are matched. As of version 1.5 of Tika the ability exists to make these case insensitive using the stringignorecase type. This would allow consolidation of some matches and improve detection of poorly-formed HTML that would be rendered by most browsers regardless of case. For example: match value=lt;BODY type=string offset=0/ match value=lt;body type=string offset=0/ could become: match value=lt;BODY type=stringignorecase offset=0/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1307) Jenkins Java7 job requires a profile in order to build 'tika-java7' module.
[ https://issues.apache.org/jira/browse/TIKA-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1307: -- Labels: build (was: ) Jenkins Java7 job requires a profile in order to build 'tika-java7' module. --- Key: TIKA-1307 URL: https://issues.apache.org/jira/browse/TIKA-1307 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Reporter: Lewis John McGibbney Labels: build Fix For: 1.8 N.B. Can someone please create a *build* tag in Admin area? The assign it to this issue? This issue was flagged up by Hong-Thai during the DISCUSS nightly builds thread recently http://www.mail-archive.com/dev%40tika.apache.org/msg07963.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1307) Jenkins Java7 job requires a profile in order to build 'tika-java7' module.
[ https://issues.apache.org/jira/browse/TIKA-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1307. --- Resolution: Done Marking this as Done, since the Java7 component is now tested. See https://builds.apache.org/view/Tika/job/tika-trunk-jdk1.7/lastStableBuild/org.apache.tika$tika-java7/. Jenkins Java7 job requires a profile in order to build 'tika-java7' module. --- Key: TIKA-1307 URL: https://issues.apache.org/jira/browse/TIKA-1307 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Reporter: Lewis John McGibbney Labels: build Fix For: 1.8 N.B. Can someone please create a *build* tag in Admin area? The assign it to this issue? This issue was flagged up by Hong-Thai during the DISCUSS nightly builds thread recently http://www.mail-archive.com/dev%40tika.apache.org/msg07963.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE
[ https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372060#comment-14372060 ] Tyler Palsulich commented on TIKA-1308: --- This would be a great feature! The default would need to be disabled, though, since some files are larger than memory. And, as mentioned above, some parsers require writing the output to a file in order to use the external parsing library. Support in memory parse mode(don't create temp file): to support run Tika in GAE Key: TIKA-1308 URL: https://issues.apache.org/jira/browse/TIKA-1308 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: jefferyyuan Labels: gae Fix For: 1.8 I am trying to use Tika in GAE and write a simple servlet to extract meta data info from jpeg: {code} String urlStr = req.getParameter(imageUrl); byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr)); ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData); Metadata metadata = new Metadata(); BodyContentHandler ch = new BodyContentHandler(); AutoDetectParser parser = new AutoDetectParser(); parser.parse(bais, ch, metadata, new ParseContext()); bais.close(); {code} This fails with exception: {code} Caused by: java.lang.SecurityException: Unable to create temporary file at java.io.File.createTempFile(File.java:1986) at org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 {code} Checked the code, in org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, Metadata, ParseContext), it creates a temp file from the input stream. I can understand why tika create temp file from the stream: so tika can parse it multiple times. But as GAE and other cloud servers are getting more popular, is it possible to avoid create temp file: instead we can copy the origin stream to a byteArray stream, so tika can also parse it multiple times. -- This will have a limit on the file size, as tika keeps the whole file in memory, but this can make tika work in GAE and maybe other cloud server. We can add a parameter in parser.parse to indicate whether do in memory parse only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE
[ https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1308: -- Description: I am trying to use Tika in GAE and write a simple servlet to extract meta data info from jpeg: {code} String urlStr = req.getParameter(imageUrl); byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr)); ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData); Metadata metadata = new Metadata(); BodyContentHandler ch = new BodyContentHandler(); AutoDetectParser parser = new AutoDetectParser(); parser.parse(bais, ch, metadata, new ParseContext()); bais.close(); {code} This fails with exception: {code} Caused by: java.lang.SecurityException: Unable to create temporary file at java.io.File.createTempFile(File.java:1986) at org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 {code} Checked the code, in org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, Metadata, ParseContext), it creates a temp file from the input stream. I can understand why tika create temp file from the stream: so tika can parse it multiple times. But as GAE and other cloud servers are getting more popular, is it possible to avoid create temp file: instead we can copy the origin stream to a byteArray stream, so tika can also parse it multiple times. -- This will have a limit on the file size, as tika keeps the whole file in memory, but this can make tika work in GAE and maybe other cloud server. We can add a parameter in parser.parse to indicate whether do in memory parse only. was: I am trying to use Tika in GAE and write a simple servlet to extract meta data info from jpeg: String urlStr = req.getParameter(imageUrl); byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr)); ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData); Metadata metadata = new Metadata(); BodyContentHandler ch = new BodyContentHandler(); AutoDetectParser parser = new AutoDetectParser(); parser.parse(bais, ch, metadata, new ParseContext()); bais.close(); This fails with exception: Caused by: java.lang.SecurityException: Unable to create temporary file at java.io.File.createTempFile(File.java:1986) at org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 Checked the code, in org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, Metadata, ParseContext), it creates a temp file from the input stream. I can understand why tika create temp file from the stream: so tika can parse it multiple times. But as GAE and other cloud servers are getting more popular, is it possible to avoid create temp file: instead we can copy the origin stream to a byteArray stream, so tika can also parse it multiple times. -- This will have a limit on the file size, as tika keeps the whole file in memory, but this can make tika work in GAE and maybe other cloud server. We can add a parameter in parser.parse to indicate whether do in memory parse only. Support in memory parse mode(don't create temp file): to support run Tika in GAE Key: TIKA-1308 URL: https://issues.apache.org/jira/browse/TIKA-1308 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: jefferyyuan Labels: gae Fix For: 1.8 I am trying to use Tika in GAE and write a simple servlet to extract meta data info from jpeg: {code} String urlStr = req.getParameter(imageUrl); byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr)); ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData); Metadata metadata = new Metadata(); BodyContentHandler ch = new BodyContentHandler(); AutoDetectParser parser = new AutoDetectParser(); parser.parse(bais, ch, metadata, new ParseContext()); bais.close(); {code} This fails with exception: {code} Caused by: java.lang.SecurityException: Unable to create temporary file at java.io.File.createTempFile(File.java:1986) at org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
[jira] [Commented] (TIKA-1314) An inappropriate comment of CharsetDetector.detect()
[ https://issues.apache.org/jira/browse/TIKA-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372066#comment-14372066 ] Tyler Palsulich commented on TIKA-1314: --- This is still an issue in Tika 1.8-SNAPSHOT. See [here|https://github.com/apache/tika/blob/4096059da7f6d50e3d6e018681b8c02a96d3933a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java#L141-L172]. Any input on whether we should update the comment or throw an Exception? An inappropriate comment of CharsetDetector.detect() Key: TIKA-1314 URL: https://issues.apache.org/jira/browse/TIKA-1314 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Yi EungJun Priority: Minor According to the javadoc of CharsetDetector.detect(), it raises an exception if no charset appears to match the data: * Raise an exception if * ul *lino charsets appear to match the input data./li *lino input text has been provided/li * /ul But it seems to me that in such cases the method returns null but does not raise any exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1114) sgml mime type is not detected when passed in as byte stream
[ https://issues.apache.org/jira/browse/TIKA-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371945#comment-14371945 ] Tyler Palsulich commented on TIKA-1114: --- See http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language. It seems like there isn't really a dedicated way to know whether is a file is SGML or not... sgml mime type is not detected when passed in as byte stream Key: TIKA-1114 URL: https://issues.apache.org/jira/browse/TIKA-1114 Project: Tika Issue Type: Bug Components: mime Reporter: Vikas Garg When passing sgml files as TikaInputStream (created from byte[]) to Detector.detect(), it returns text/plain as mediatype and not application/sgml or text/sgml. But when I provide the file name to metadata, then it gives me correct mime-type, i.e., text/sgml. Is it because Tika is missing any designated parser for sgml files OR am I missing something? I am on Tika-1.3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1267) Improve Mbox file detection
[ https://issues.apache.org/jira/browse/TIKA-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371957#comment-14371957 ] Tyler Palsulich commented on TIKA-1267: --- The current definition is: {code} mime-type type=application/mbox sub-class-of type=text/plain/ glob pattern=*.mbox/ /mime-type {code} I think it would be too general to call all text files that start with {{From }} to be identified as {{application/mbox}}. Improve Mbox file detection --- Key: TIKA-1267 URL: https://issues.apache.org/jira/browse/TIKA-1267 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.5 Reporter: Luis Filipe Nassif Priority: Minor Could we add to application/mbox mime-type definition code below: {code} magic priority=70 match value=From type=string offset=0/ /magic {code} Or is it too common out there? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1287) Update NetCDF .jar file on Maven Central
[ https://issues.apache.org/jira/browse/TIKA-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1287. --- Resolution: Fixed Marking this as Fixed, since [~annieburgess] and [~lewismc] have pushed to Central. Update NetCDF .jar file on Maven Central Key: TIKA-1287 URL: https://issues.apache.org/jira/browse/TIKA-1287 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Ann Burgess Labels: jar, maven, netcdf, tika, unit-test, update I am working to update the NetCDFParser file. When using the most-recent .jar file available from http://www.unidata.ucar.edu/ at the command line I receive a note about a depreciated API: javac -classpath ../../../../tika-core/target/tika-core-1.6-SNAPSHOT.jar:../../../../toolsUI-4.3.jar org/apache/tika/parser/netcdf/NetCDFParser.java Note: org/apache/tika/parser/netcdf/NetCDFParser.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. After updating the NetCDFParser file with non-deprecated methods (e.x. changing dimension.getName() to dimension.getFullName()) however, I get failed unit tests in maven, which I assume is because the Maven Central Repo has the lapsed version of the .jar file needed for NetCDF files ( http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22edu.ucar%22%20AND%20a%3A%22netcdf%22) . Can anyone provide insight into how I get the updated .jar file into the Maven Central Repository? Is there an alternative method to update Tika so I can run my unit tests in Maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1325) Move the font metadata definitions to properties
[ https://issues.apache.org/jira/browse/TIKA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372076#comment-14372076 ] Tyler Palsulich commented on TIKA-1325: --- Does anyone know of an external standard for these metadata keys? Or, can we close this off? Move the font metadata definitions to properties Key: TIKA-1325 URL: https://issues.apache.org/jira/browse/TIKA-1325 Project: Tika Issue Type: Improvement Components: metadata, parser Affects Versions: 1.5, 1.6 Reporter: Nick Burch Attachments: TIKA-1325_TimeZone.patch As noticed while working on TIKA-1182, the AFM font parser has a bunch of hard coded strings it uses as metadata keys, while the TTF font parser doesn't have many We should switch these to being proper Properties, with definitions from a well known standard (+ compatibility fallbacks), and have both use largely the same set -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371900#comment-14371900 ] Tyler Palsulich edited comment on TIKA-1154 at 3/20/15 7:19 PM: Marking as Fixed, since the file is detected and parsed without issue. Not sure what was happening before! Thank you, [~anjackson]! was (Author: tpalsulich): Marking as Fixed, since the file is detected and parsed without issue. Not sure what was happening before! Thanks! Tika hangs on format detection of malformed HTML file. -- Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: tika-breaker.html We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection. An example file is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371939#comment-14371939 ] Tyler Palsulich commented on TIKA-1194: --- Thank you, [~tssk]! Is there any way you can create a patch from {{svn diff}}, instead of (I think) just regular {{diff}}? Then, we can hopefully integrate this into trunk. :) Missing text from MS Word (DOC) file Key: TIKA-1194 URL: https://issues.apache.org/jira/browse/TIKA-1194 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Tomas Safarik Priority: Critical Attachments: OP-06-015.doc, apache-tika-1.5.patch Hello, we noticed that filtered text from some MS Word DOC files is missing one line (in table cell) in the original document. - If you add or remove one character anywhere before the problematic line/cell then the filtered text is correct. If you get the text back to original the filtering problem is back. - If the file is resaved as DOCX filtering works fine. I will provide sample document. And please let me know if more information is needed. Regards, Tomas -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1273) old tika-server jar artifact contains no manifest so not able to invoke from shell
[ https://issues.apache.org/jira/browse/TIKA-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1273. --- Resolution: Fixed The server now starts as expected. So, I'm marking this as Fixed. {code} ➜ trunk java -jar tika-server/target/tika-server-1.8-SNAPSHOT.jar Mar 20, 2015 3:53:04 PM org.apache.tika.server.TikaServerCli main INFO: Starting Apache Tika 1.8-SNAPSHOT server Mar 20, 2015 3:53:04 PM org.apache.cxf.endpoint.ServerImpl initDestination INFO: Setting the server's publish address to be http://localhost:9998/ Mar 20, 2015 3:53:04 PM org.slf4j.impl.JCLLoggerAdapter info INFO: jetty-8.y.z-SNAPSHOT Mar 20, 2015 3:53:04 PM org.slf4j.impl.JCLLoggerAdapter info INFO: Started SelectChannelConnector@localhost:9998 Mar 20, 2015 3:53:04 PM org.apache.tika.server.TikaServerCli main INFO: Started {code} old tika-server jar artifact contains no manifest so not able to invoke from shell -- Key: TIKA-1273 URL: https://issues.apache.org/jira/browse/TIKA-1273 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.8 I've never ever used the old tika-server artifact which is generated when one installs the server module. It needs to contain a manifest otherwise it cannot be invoked from the shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1289) Ligatures convert on text extraction
[ https://issues.apache.org/jira/browse/TIKA-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1289. --- Resolution: Fixed Fix Version/s: 1.7 I'm marking this as Fixed, since the first sentence now seems to be valid: {code} pReplace this file with prentcsmacro.sty for your meeting, or with entcsmacro.sty for your meeting. Both can be found at the ENTCS Macro Home Page. /p {code} Ligatures convert on text extraction Key: TIKA-1289 URL: https://issues.apache.org/jira/browse/TIKA-1289 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: win 8, jre 1.5 Reporter: Alex Andrushchak Fix For: 1.7 Attachments: PDF_text that can be copied is over the picture.pdf According to tika sources review, it uses pdfbox to parse pdf files. I found that pdfbox itself uses icu4j to handle ligatures. Unfortunately, when i added icu4j jar to my classpath nothing changed, ligatures are still not converted. Sample pdf file is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1580) ISA-Tab parsers
[ https://issues.apache.org/jira/browse/TIKA-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1580: -- Labels: new-parser (was: ) ISA-Tab parsers --- Key: TIKA-1580 URL: https://issues.apache.org/jira/browse/TIKA-1580 Project: Tika Issue Type: New Feature Components: parser Reporter: Giuseppe Totaro Priority: Minor Labels: new-parser Attachments: TIKA-1580.patch We are going to add parsers for ISA-Tab data formats. ISA-Tab files are related to [ISA Tools|http://www.isa-tools.org/] which help to manage an increasingly diverse set of life science, environmental and biomedical experiments that employing one or a combination of technologies. The ISA tools are built upon _Investigation_, _Study_, and _Assay_ tabular format. Therefore, ISA-Tab data format includes three types of file: Investigation file ({{a_.txt}}), Study file ({{s_.txt}}), Assay file ({{a_.txt}}). These files are organized as [top-down hierarchy|http://www.isa-tools.org/format/specification/]: An Investigation file includes one or more Study files: each Study files includes one or more Assay files. Essentially, the Investigation files contains high-level information about the related study, so it provides only metadata about ISA-Tab files. More details on file format specification are [available online|http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf]. The patch in attachment provides a preliminary version of ISA-Tab parsers (there are three parsers; one parser for each ISA-Tab filetype): * {{ISATabInvestigationParser.java}}: parses Investigation files. It extracts only metadata. * {{ISATabStudyParser.java}}: parses Study files. * {{ISATabAssayParser.java}}: parses Assay files. The most important improvements are: * Combine these three parsers in order to parse an ISArchive * Provide a better mapping of both study and assay data on XHML. Currently, {{ISATabStudyParser}} and {{ISATabAssayParser}} provide a naive mapping function relying on [Apache Commons CSV|https://commons.apache.org/proper/commons-csv/]. Thanks for supporting me on this work [~chrismattmann]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1293) Netscape bookmark files are not being detected as HTML
[ https://issues.apache.org/jira/browse/TIKA-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372020#comment-14372020 ] Tyler Palsulich commented on TIKA-1293: --- Looks good to me. Any objections to adding this magic for HTML Netscape bookmark files? Netscape bookmark files are not being detected as HTML -- Key: TIKA-1293 URL: https://issues.apache.org/jira/browse/TIKA-1293 Project: Tika Issue Type: Bug Components: detector, mime Reporter: Phil Lester Attachments: bookmarks.txt We are able to circumvent the HTML file type detection using the standard Netscape bookmark file doctype (!DOCTYPE NETSCAPE-Bookmark-file-1) and renaming the file extension to .txt. Standard HTML elements can then be included in the file. Some browsers (such as Firefox) will detect the .txt file as HTML and display it accordingly when downloading. We were able to resolve this by adding a custom mime-type for text/html that included a match pattern for the Netscape doctype: match value=lt;!DOCTYPE NETSCAPE-Bookmark-file-1 type=string offset=0:64/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1304) Implement Metadata Property with PropertyType ALT
[ https://issues.apache.org/jira/browse/TIKA-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372029#comment-14372029 ] Tyler Palsulich commented on TIKA-1304: --- Did you ever have an update on this, [~talli...@mitre.org]? Implement Metadata Property with PropertyType ALT - Key: TIKA-1304 URL: https://issues.apache.org/jira/browse/TIKA-1304 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Trivial PropertyType Alt has been available for a while, but it doesn't appear to have been implemented. I'd like to implement it to fix TIKA-1295. If I've missed the implementation or if there is a preferred workaround, please let me know, and I'll close this issue and use that to fix TIKA-1295. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1382) Output document outlinks
[ https://issues.apache.org/jira/browse/TIKA-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372141#comment-14372141 ] Tyler Palsulich commented on TIKA-1382: --- This patch looks good to commit. But, [~chrismattmann], do you have any update? Output document outlinks Key: TIKA-1382 URL: https://issues.apache.org/jira/browse/TIKA-1382 Project: Tika Issue Type: New Feature Components: cli Affects Versions: 1.5 Reporter: Greg Padiasek Assignee: Chris A. Mattmann Attachments: outlinks.patch Would you consider adding CLI options to output document outlinks? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1401) occured infinite loop using tika library
[ https://issues.apache.org/jira/browse/TIKA-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1401: -- Description: Hi 1. Save the file with the following content as errorfile.xml {code} ?xml version=1.0? !DOCTYPE billion [ !ELEMENT billion (#PCDATA) !ENTITY laugh0
[jira] [Commented] (TIKA-1402) Insert chart content in PPTX, the graph information cannot be extracted
[ https://issues.apache.org/jira/browse/TIKA-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372175#comment-14372175 ] Tyler Palsulich commented on TIKA-1402: --- Still not extracted with Tika 1.8-SNAPSHOT. Insert chart content in PPTX, the graph information cannot be extracted --- Key: TIKA-1402 URL: https://issues.apache.org/jira/browse/TIKA-1402 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: win7 Reporter: Ben Gao Attachments: bug.pptx I inserted a chart in bug.pptx, the chart contains AAA, BBB, CCC, DDD, 1,2,3 and other information, graph information cannot be extracted -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1408) Fix version for tikadotnet to be tracked along with trunk and release version
[ https://issues.apache.org/jira/browse/TIKA-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1408. --- Resolution: Fixed Fixed in r1668165. Will make a note in the release notes now. Fix version for tikadotnet to be tracked along with trunk and release version - Key: TIKA-1408 URL: https://issues.apache.org/jira/browse/TIKA-1408 Project: Tika Issue Type: Bug Components: packaging Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 As reported by [~thaichat04] the tikadotnet versioning doesn't match up with trunk. This is because we aren't releasing this code yet and it's not part of the pom.xml file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1398) Dependency in tika-parsers 1.5 contains variables
[ https://issues.apache.org/jira/browse/TIKA-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1398. - Resolution: Not a Problem Closing as Not a Problem. If you have a project which fails to build because of this, please reopen! If you'd like an example of using Tika as a Maven dependency in an external project, please see [here|https://github.com/tpalsulich/phone_numbers]. Dependency in tika-parsers 1.5 contains variables - Key: TIKA-1398 URL: https://issues.apache.org/jira/browse/TIKA-1398 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Eddie Olsson In tika-parsers, the dependency for tika-core contains two variables that resolve differently depending on the local project in which it is used, thus breaking the dependency. From org/apache/tika/tika-parsers/1.5/tika-parsers-1.5.pom: dependency groupId${project.groupId}/groupId artifactIdtika-core/artifactId version${project.version}/version /dependency -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1401) occured infinite loop using tika library
[ https://issues.apache.org/jira/browse/TIKA-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372172#comment-14372172 ] Tyler Palsulich commented on TIKA-1401: --- Still loop infinitely with Tika 1.8-SNAPSHOT. occured infinite loop using tika library Key: TIKA-1401 URL: https://issues.apache.org/jira/browse/TIKA-1401 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.5 Reporter: Robin.Hwang Hi 1. Save the file with the following content as errorfile.xml {code} ?xml version=1.0? !DOCTYPE billion [ !ELEMENT billion (#PCDATA) !ENTITY laugh0
[jira] [Updated] (TIKA-1405) Uppercase content detected as Estonian
[ https://issues.apache.org/jira/browse/TIKA-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1405: -- Summary: Uppercase content detected as Estonian (was: German content detected as French) Uppercase content detected as Estonian -- Key: TIKA-1405 URL: https://issues.apache.org/jira/browse/TIKA-1405 Project: Tika Issue Type: Bug Components: languageidentifier Affects Versions: 1.4 Environment: Linux Reporter: Zaheer Beig Labels: newbie Hi, We are using Apache Tika 1.4 for document conversion to text and language detection in one of our project. We are facing below issue with language detection: 1. When the text is in all UPPER CASE, even though the language is English, it gets detected as Estonian. Any update on this will be very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1406) Problems in TXT encoding detection
[ https://issues.apache.org/jira/browse/TIKA-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372183#comment-14372183 ] Tyler Palsulich commented on TIKA-1406: --- Can someone familiar with the UniversalEncodingDetector take a look at this patch? Sorry no one ever got back, [~almson]! Do you have a couple test files which demonstrate these issues? Problems in TXT encoding detection -- Key: TIKA-1406 URL: https://issues.apache.org/jira/browse/TIKA-1406 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6, 1.7 Reporter: Aleksandr Dubinsky Attachments: 0001-fix-TXT-encoding-detection.patch Original Estimate: 0h Remaining Estimate: 0h The detection of TXT file encoding often makes mistakes. Two things can improve the detection in many cases: - Increase the lookahead from 16k bytes to a larger number, such as 128k or larger. - Improve on the brain-dead heuristic that if a file doesn't have a \r then it must be ISO 8859-1(5) instead of Windows-1252. (For one, it mis-detects files that don't have any newlines.) A remaining problem that doesn't have an immediate solution, is the frequent mis-detection of Windows-1252 as Shift-JIS. A flag to forbit Shift-JIS is desirable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1416) Refactor Translator Exception Handling
[ https://issues.apache.org/jira/browse/TIKA-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1416. --- Resolution: Fixed Fixed in r1668166. {{Translator.translate()}} can now throw a TikaException or an IOException. Refactor Translator Exception Handling -- Key: TIKA-1416 URL: https://issues.apache.org/jira/browse/TIKA-1416 Project: Tika Issue Type: Bug Components: translation Reporter: Tyler Palsulich Fix For: 1.8 `Translator.translate()` currently throws `Exception`. We should make it more specific. The only real limitation comes from MicrosoftTranslator -- the library used throws `Exception`, but that shouldn't mean Tika does too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)