[jira] [Commented] (TIKA-1361) Update MP4Parser to 1.0.2
[ https://issues.apache.org/jira/browse/TIKA-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073596#comment-14073596 ] Hudson commented on TIKA-1361: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #114 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/114/]) Update imports following TIKA-1361 changes, to match our current preference for explicit (not wildcard) imports (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613252) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java Patch from Matthias Krueger from TIKA-1361 - Upgrade MP4Parser to 1.0.2, add a custom Data Source and use that for explicit temp handling. This closes #14 from Github (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613249) * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/DirectFileReadDataSource.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java > Update MP4Parser to 1.0.2 > - > > Key: TIKA-1361 > URL: https://issues.apache.org/jira/browse/TIKA-1361 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.5 >Reporter: Matthias Krueger > Fix For: 1.6 > > > The currently used com.googlecode.mp4parser:isoparser version is 1.0-RC-1. > According to https://code.google.com/p/mp4parser/#Changes/Releases and > https://code.google.com/p/mp4parser/source/list there have been quite some > improvements since then. Before tackling more metadata (such as in TIKA-852) > we should update to 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1361) Update MP4Parser to 1.0.2
[ https://issues.apache.org/jira/browse/TIKA-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073569#comment-14073569 ] Hudson commented on TIKA-1361: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #116 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/116/]) Update imports following TIKA-1361 changes, to match our current preference for explicit (not wildcard) imports (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613252) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java Patch from Matthias Krueger from TIKA-1361 - Upgrade MP4Parser to 1.0.2, add a custom Data Source and use that for explicit temp handling. This closes #14 from Github (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613249) * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/DirectFileReadDataSource.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java > Update MP4Parser to 1.0.2 > - > > Key: TIKA-1361 > URL: https://issues.apache.org/jira/browse/TIKA-1361 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.5 >Reporter: Matthias Krueger > Fix For: 1.6 > > > The currently used com.googlecode.mp4parser:isoparser version is 1.0-RC-1. > According to https://code.google.com/p/mp4parser/#Changes/Releases and > https://code.google.com/p/mp4parser/source/list there have been quite some > improvements since then. Before tackling more metadata (such as in TIKA-852) > we should update to 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server
[ https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073562#comment-14073562 ] Lewis John McGibbney commented on TIKA-1269: Yep I am on it right now. Patch coming up [~gagravarr] > Self-hosted documentation for the JAX-RS Server > --- > > Key: TIKA-1269 > URL: https://issues.apache.org/jira/browse/TIKA-1269 > Project: Tika > Issue Type: Improvement > Components: server >Affects Versions: 1.5 >Reporter: Nick Burch > Fix For: 1.7 > > Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch > > > Currently, if you fire up the JAX-RS Tika Server, and go to the root of the > server in a web browser, you get an empty page back. You have to know to head > over to https://wiki.apache.org/tika/TikaJAXRS find out what the available > URLs are > We should self-host some simple documentation on the server at the root of > it, so that people can discover what it offers. Ideally, this should be > largely auto-generated based on the endpoints, so that we don't risk missing > things when we add new features > This will also allow us to potentially offer a sample running version of the > server for people to discover Tika with -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073557#comment-14073557 ] Tyler Palsulich commented on TIKA-1373: --- bq. HtmlParser skips tags generated by JHighlight. Is there a particular reason? > AutoDetectParser extracts no text when SourceCodeParser is selected > --- > > Key: TIKA-1373 > URL: https://issues.apache.org/jira/browse/TIKA-1373 > Project: Tika > Issue Type: Bug >Affects Versions: 1.5 >Reporter: Andrés Aguilar-Umaña > > When using the AutoDetectParser in java code, and the SourceCodeParser is > selected (i.e. java files), the handler gets no text: > I have this test program: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source"); > try { >autoDetectParser.parse(bais, bch, metadata, parseContext); > } catch (Exception e) { >e.printStackTrace(); > } > System.out.println("Text extracted: "+bch.toString()) > {code} > It returns (using the SourceCodeParser): > {code} > Text extracted: {code} > But when I use this code: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/plain"); > try { autoDetectParser.parse(bais, bch, metadata, parseContext); } > catch (Exception e) { e.printStackTrace(); } > System.out.println("Text extracted: "+bch.toString()) > {code} > The Text Parser is used and I get: > {code} > Text extracted: public class HelloWorld {} {code} > I have also tested this command: > {code} > > java -jar tika-app-1.5.jar -t D:\text.java > (no text) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1361) Update MP4Parser to 1.0.2
[ https://issues.apache.org/jira/browse/TIKA-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1361. -- Resolution: Fixed Fix Version/s: 1.6 > Update MP4Parser to 1.0.2 > - > > Key: TIKA-1361 > URL: https://issues.apache.org/jira/browse/TIKA-1361 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.5 >Reporter: Matthias Krueger > Fix For: 1.6 > > > The currently used com.googlecode.mp4parser:isoparser version is 1.0-RC-1. > According to https://code.google.com/p/mp4parser/#Changes/Releases and > https://code.google.com/p/mp4parser/source/list there have been quite some > improvements since then. Before tackling more metadata (such as in TIKA-852) > we should update to 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1361) Update MP4Parser to 1.0.2
[ https://issues.apache.org/jira/browse/TIKA-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073543#comment-14073543 ] ASF GitHub Bot commented on TIKA-1361: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/14 > Update MP4Parser to 1.0.2 > - > > Key: TIKA-1361 > URL: https://issues.apache.org/jira/browse/TIKA-1361 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.5 >Reporter: Matthias Krueger > > The currently used com.googlecode.mp4parser:isoparser version is 1.0-RC-1. > According to https://code.google.com/p/mp4parser/#Changes/Releases and > https://code.google.com/p/mp4parser/source/list there have been quite some > improvements since then. Before tackling more metadata (such as in TIKA-852) > we should update to 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252)
[GitHub] tika pull request: TIKA-1361: MP4Parser Update
Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (TIKA-1376) Improve embedded file name extraction in PDFParser
Tim Allison created TIKA-1376: - Summary: Improve embedded file name extraction in PDFParser Key: TIKA-1376 URL: https://issues.apache.org/jira/browse/TIKA-1376 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.6 When we extract embedded files from PDFs, we are currently using the key in the PDEmbeddedFilesNameTreeNode as the file name that we store as the value of Metadata.RESOURCE_NAME_KEY in the embedded document's metadata. I think we should try to get the file name from PDComplexFileSpecification's getFilename() first. If that is null, then we should fall back to the key value. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: How should video files with audio be handled by parsers?
On Wed, 23 Jul 2014, Ray Gauss wrote: 2) There are are several PBCore instantiation properties that apply to the entire file like duration and tracks that we'd want prefixed with pbcore so I think it would be odd to see: pbcore:instantiationDuration=00:00:05.20 stream[0]/pbcore:essenceTrackType=Video This structure does have the advantage that any tool can easily see that the second metadata key relates to a sub-stream / sub-track etc, without having to know anything about PBCore. That will make it easier for tools to exclude or handle these differently in a general way. (I can't think, off the top of my head, of another kind of thing that might need this structure, but I'm reluctant to nail it down to being only for PBCore if that'll cause us issues when we try to support something very similar in future) Any chance you could get / fake a nearly-full set of metadata keys and value for a media file with (say) 3 streams? We can then generate pbcore prefixed and general prefixed versions, which should hopefully make it easier for other community members to compare and offer their input! Nick
[jira] [Created] (TIKA-1375) Decrease memory consumption when extracting images from PDFs
Tim Allison created TIKA-1375: - Summary: Decrease memory consumption when extracting images from PDFs Key: TIKA-1375 URL: https://issues.apache.org/jira/browse/TIKA-1375 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Minor Fix For: 1.6 This patch applies changes made in PDFBOX-2101 to decrease memory consumption during extraction of embedded images. This also applies the recommendation by [~tilman] on the PDFBox dev [list | http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201407.mbox/%3c53cff0ce.9090...@t-online.de%3e] to clear resources after handling each page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1374) Need to add code to look for OS-specific keys for embedded files within PDFs
[ https://issues.apache.org/jira/browse/TIKA-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1374: -- Description: Embedded files in PDFs can be found by the general all purpose key we currently use via PDFBox: "F". However, embedded documents can also be stored under OS specific keys: "DOS", "Mac" and "Unix". [~lehmi] confirmed on the PDFBox users [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e] that we might be missing embedded documents if we're not trying the OS specific keys as well. As Andreas points out, according to the spec the OS specific keys shouldn't be used any more, but I think we should support extraction for them. My proposal is to pull all documents that are available by any of the four keys (well, via getEmbeddedFile() in PDFBox). This has the downside of potentially extracting basically duplicate documents, but I'd prefer to err on the side of extracting everything. The code fix is trivial, and I'll try to commit it today. However, it will take me a bit of time to generate a test file that stores files under the OS specific keys. So, if anyone has an ASF-friendly file available or wants to take the task of generating one, please do. was: Embedded files in PDFs can be found by the general all purpose key we currently use via PDFBox: "EF/F". However, embedded documents can also be stored under OS specific keys: "EF/DOS", "EF/Mac" and "EF/Unix". [~lehmi] confirmed on the PDFBox users [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e] that we might be missing embedded documents if we're not trying the OS specific keys as well. As Andreas points out, according to the spec the OS specific keys shouldn't be used any more, but I think we should support extraction for them. My proposal is to pull all documents that are available by any of the four keys (well, via getEmbeddedFile() in PDFBox). The code fix is trivial, and I'll try to commit it today. However, it will take me a bit of time to generate a test file that stores files under the OS specific keys. So, if anyone has an ASF-friendly file available or wants to take the task of generating one, please do. > Need to add code to look for OS-specific keys for embedded files within PDFs > > > Key: TIKA-1374 > URL: https://issues.apache.org/jira/browse/TIKA-1374 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 1.6 > > > Embedded files in PDFs can be found by the general all purpose key we > currently use via PDFBox: "F". However, embedded documents can also be > stored under OS specific keys: "DOS", "Mac" and "Unix". > [~lehmi] confirmed on the PDFBox users > [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e] > that we might be missing embedded documents if we're not trying the OS > specific keys as well. As Andreas points out, according to the spec the OS > specific keys shouldn't be used any more, but I think we should support > extraction for them. > My proposal is to pull all documents that are available by any of the four > keys (well, via getEmbeddedFile() in PDFBox). This has the downside of > potentially extracting basically duplicate documents, but I'd prefer to err > on the side of extracting everything. > The code fix is trivial, and I'll try to commit it today. However, it will > take me a bit of time to generate a test file that stores files under the OS > specific keys. So, if anyone has an ASF-friendly file available or wants to > take the task of generating one, please do. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1374) Need to add code to look for OS-specific keys for embedded files within PDFs
Tim Allison created TIKA-1374: - Summary: Need to add code to look for OS-specific keys for embedded files within PDFs Key: TIKA-1374 URL: https://issues.apache.org/jira/browse/TIKA-1374 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Minor Fix For: 1.6 Embedded files in PDFs can be found by the general all purpose key we currently use via PDFBox: "EF/F". However, embedded documents can also be stored under OS specific keys: "EF/DOS", "EF/Mac" and "EF/Unix". [~lehmi] confirmed on the PDFBox users [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e] that we might be missing embedded documents if we're not trying the OS specific keys as well. As Andreas points out, according to the spec the OS specific keys shouldn't be used any more, but I think we should support extraction for them. My proposal is to pull all documents that are available by any of the four keys (well, via getEmbeddedFile() in PDFBox). The code fix is trivial, and I'll try to commit it today. However, it will take me a bit of time to generate a test file that stores files under the OS specific keys. So, if anyone has an ASF-friendly file available or wants to take the task of generating one, please do. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server
[ https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073164#comment-14073164 ] Nick Burch commented on TIKA-1269: -- It's a bit hard to be sure on Miredot when most (all?) of the endpoints lack the documentation to make them look nice... Any chance someone (Lewis?) could add those javadocs and annotations etc to just one endpoint, so we can see Miredot in all it's glory before we decide? > Self-hosted documentation for the JAX-RS Server > --- > > Key: TIKA-1269 > URL: https://issues.apache.org/jira/browse/TIKA-1269 > Project: Tika > Issue Type: Improvement > Components: server >Affects Versions: 1.5 >Reporter: Nick Burch > Fix For: 1.7 > > Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch > > > Currently, if you fire up the JAX-RS Tika Server, and go to the root of the > server in a web browser, you get an empty page back. You have to know to head > over to https://wiki.apache.org/tika/TikaJAXRS find out what the available > URLs are > We should self-host some simple documentation on the server at the root of > it, so that people can discover what it offers. Ideally, this should be > largely auto-generated based on the endpoints, so that we don't risk missing > things when we add new features > This will also allow us to potentially offer a sample running version of the > server for people to discover Tika with -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073077#comment-14073077 ] Hudson commented on TIKA-1373: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #114 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/114/]) TIKA-1373 - Send html content to SAX events by using TagSoup (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613051) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java > AutoDetectParser extracts no text when SourceCodeParser is selected > --- > > Key: TIKA-1373 > URL: https://issues.apache.org/jira/browse/TIKA-1373 > Project: Tika > Issue Type: Bug >Affects Versions: 1.5 >Reporter: Andrés Aguilar-Umaña > > When using the AutoDetectParser in java code, and the SourceCodeParser is > selected (i.e. java files), the handler gets no text: > I have this test program: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source"); > try { >autoDetectParser.parse(bais, bch, metadata, parseContext); > } catch (Exception e) { >e.printStackTrace(); > } > System.out.println("Text extracted: "+bch.toString()) > {code} > It returns (using the SourceCodeParser): > {code} > Text extracted: {code} > But when I use this code: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/plain"); > try { autoDetectParser.parse(bais, bch, metadata, parseContext); } > catch (Exception e) { e.printStackTrace(); } > System.out.println("Text extracted: "+bch.toString()) > {code} > The Text Parser is used and I get: > {code} > Text extracted: public class HelloWorld {} {code} > I have also tested this command: > {code} > > java -jar tika-app-1.5.jar -t D:\text.java > (no text) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073052#comment-14073052 ] Hudson commented on TIKA-1373: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #112 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/112/]) TIKA-1373 - Send html content to SAX events by using TagSoup (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613051) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java > AutoDetectParser extracts no text when SourceCodeParser is selected > --- > > Key: TIKA-1373 > URL: https://issues.apache.org/jira/browse/TIKA-1373 > Project: Tika > Issue Type: Bug >Affects Versions: 1.5 >Reporter: Andrés Aguilar-Umaña > > When using the AutoDetectParser in java code, and the SourceCodeParser is > selected (i.e. java files), the handler gets no text: > I have this test program: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source"); > try { >autoDetectParser.parse(bais, bch, metadata, parseContext); > } catch (Exception e) { >e.printStackTrace(); > } > System.out.println("Text extracted: "+bch.toString()) > {code} > It returns (using the SourceCodeParser): > {code} > Text extracted: {code} > But when I use this code: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/plain"); > try { autoDetectParser.parse(bais, bch, metadata, parseContext); } > catch (Exception e) { e.printStackTrace(); } > System.out.println("Text extracted: "+bch.toString()) > {code} > The Text Parser is used and I get: > {code} > Text extracted: public class HelloWorld {} {code} > I have also tested this command: > {code} > > java -jar tika-app-1.5.jar -t D:\text.java > (no text) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1373. Resolution: Fixed > AutoDetectParser extracts no text when SourceCodeParser is selected > --- > > Key: TIKA-1373 > URL: https://issues.apache.org/jira/browse/TIKA-1373 > Project: Tika > Issue Type: Bug >Affects Versions: 1.5 >Reporter: Andrés Aguilar-Umaña > > When using the AutoDetectParser in java code, and the SourceCodeParser is > selected (i.e. java files), the handler gets no text: > I have this test program: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source"); > try { >autoDetectParser.parse(bais, bch, metadata, parseContext); > } catch (Exception e) { >e.printStackTrace(); > } > System.out.println("Text extracted: "+bch.toString()) > {code} > It returns (using the SourceCodeParser): > {code} > Text extracted: {code} > But when I use this code: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/plain"); > try { autoDetectParser.parse(bais, bch, metadata, parseContext); } > catch (Exception e) { e.printStackTrace(); } > System.out.println("Text extracted: "+bch.toString()) > {code} > The Text Parser is used and I get: > {code} > Text extracted: public class HelloWorld {} {code} > I have also tested this command: > {code} > > java -jar tika-app-1.5.jar -t D:\text.java > (no text) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073042#comment-14073042 ] Hong-Thai Nguyen commented on TIKA-1373: HtmlParser skips tags generated by JHighlight. I found a solution by using directly TagSoup Parser. Commit in r1613051. As I mentioned in TIKA-1224, this parser is quick & dirty approach to parser source code file. Again, the _right_ one parser is must have dedicate parser by language and parse deeply elements and build events on-the-fly. > AutoDetectParser extracts no text when SourceCodeParser is selected > --- > > Key: TIKA-1373 > URL: https://issues.apache.org/jira/browse/TIKA-1373 > Project: Tika > Issue Type: Bug >Affects Versions: 1.5 >Reporter: Andrés Aguilar-Umaña > > When using the AutoDetectParser in java code, and the SourceCodeParser is > selected (i.e. java files), the handler gets no text: > I have this test program: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source"); > try { >autoDetectParser.parse(bais, bch, metadata, parseContext); > } catch (Exception e) { >e.printStackTrace(); > } > System.out.println("Text extracted: "+bch.toString()) > {code} > It returns (using the SourceCodeParser): > {code} > Text extracted: {code} > But when I use this code: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/plain"); > try { autoDetectParser.parse(bais, bch, metadata, parseContext); } > catch (Exception e) { e.printStackTrace(); } > System.out.println("Text extracted: "+bch.toString()) > {code} > The Text Parser is used and I get: > {code} > Text extracted: public class HelloWorld {} {code} > I have also tested this command: > {code} > > java -jar tika-app-1.5.jar -t D:\text.java > (no text) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)