[jira] [Commented] (TIKA-1361) Update MP4Parser to 1.0.2

2014-07-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073596#comment-14073596
 ] 

Hudson commented on TIKA-1361:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #114 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/114/])
Update imports following TIKA-1361 changes, to match our current preference for 
explicit (not wildcard) imports (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613252)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java
Patch from Matthias Krueger from TIKA-1361 - Upgrade MP4Parser to 1.0.2, add a 
custom Data Source and use that for explicit temp handling. This closes #14 
from Github (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613249)
* /tika/trunk/tika-parsers/pom.xml
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/DirectFileReadDataSource.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java


> Update MP4Parser to 1.0.2
> -
>
> Key: TIKA-1361
> URL: https://issues.apache.org/jira/browse/TIKA-1361
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Matthias Krueger
> Fix For: 1.6
>
>
> The currently used com.googlecode.mp4parser:isoparser version is 1.0-RC-1. 
> According to https://code.google.com/p/mp4parser/#Changes/Releases and 
> https://code.google.com/p/mp4parser/source/list there have been quite some 
> improvements since then. Before tackling more metadata (such as in TIKA-852) 
> we should update to 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1361) Update MP4Parser to 1.0.2

2014-07-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073569#comment-14073569
 ] 

Hudson commented on TIKA-1361:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #116 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/116/])
Update imports following TIKA-1361 changes, to match our current preference for 
explicit (not wildcard) imports (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613252)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java
Patch from Matthias Krueger from TIKA-1361 - Upgrade MP4Parser to 1.0.2, add a 
custom Data Source and use that for explicit temp handling. This closes #14 
from Github (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613249)
* /tika/trunk/tika-parsers/pom.xml
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/DirectFileReadDataSource.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java


> Update MP4Parser to 1.0.2
> -
>
> Key: TIKA-1361
> URL: https://issues.apache.org/jira/browse/TIKA-1361
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Matthias Krueger
> Fix For: 1.6
>
>
> The currently used com.googlecode.mp4parser:isoparser version is 1.0-RC-1. 
> According to https://code.google.com/p/mp4parser/#Changes/Releases and 
> https://code.google.com/p/mp4parser/source/list there have been quite some 
> improvements since then. Before tackling more metadata (such as in TIKA-852) 
> we should update to 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server

2014-07-24 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073562#comment-14073562
 ] 

Lewis John McGibbney commented on TIKA-1269:


Yep I am on it right now. Patch coming up [~gagravarr]

> Self-hosted documentation for the JAX-RS Server
> ---
>
> Key: TIKA-1269
> URL: https://issues.apache.org/jira/browse/TIKA-1269
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.5
>Reporter: Nick Burch
> Fix For: 1.7
>
> Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch
>
>
> Currently, if you fire up the JAX-RS Tika Server, and go to the root of the 
> server in a web browser, you get an empty page back. You have to know to head 
> over to https://wiki.apache.org/tika/TikaJAXRS find out what the available 
> URLs are
> We should self-host some simple documentation on the server at the root of 
> it, so that people can discover what it offers. Ideally, this should be 
> largely auto-generated based on the endpoints, so that we don't risk missing 
> things when we add new features
> This will also allow us to potentially offer a sample running version of the 
> server for people to discover Tika with



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-24 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073557#comment-14073557
 ] 

Tyler Palsulich commented on TIKA-1373:
---

bq. HtmlParser skips tags generated by JHighlight.
Is there a particular reason?

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1361) Update MP4Parser to 1.0.2

2014-07-24 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1361.
--

   Resolution: Fixed
Fix Version/s: 1.6

> Update MP4Parser to 1.0.2
> -
>
> Key: TIKA-1361
> URL: https://issues.apache.org/jira/browse/TIKA-1361
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Matthias Krueger
> Fix For: 1.6
>
>
> The currently used com.googlecode.mp4parser:isoparser version is 1.0-RC-1. 
> According to https://code.google.com/p/mp4parser/#Changes/Releases and 
> https://code.google.com/p/mp4parser/source/list there have been quite some 
> improvements since then. Before tackling more metadata (such as in TIKA-852) 
> we should update to 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1361) Update MP4Parser to 1.0.2

2014-07-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073543#comment-14073543
 ] 

ASF GitHub Bot commented on TIKA-1361:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/14


> Update MP4Parser to 1.0.2
> -
>
> Key: TIKA-1361
> URL: https://issues.apache.org/jira/browse/TIKA-1361
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Matthias Krueger
>
> The currently used com.googlecode.mp4parser:isoparser version is 1.0-RC-1. 
> According to https://code.google.com/p/mp4parser/#Changes/Releases and 
> https://code.google.com/p/mp4parser/source/list there have been quite some 
> improvements since then. Before tackling more metadata (such as in TIKA-852) 
> we should update to 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[GitHub] tika pull request: TIKA-1361: MP4Parser Update

2014-07-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/14


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (TIKA-1376) Improve embedded file name extraction in PDFParser

2014-07-24 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1376:
-

 Summary: Improve embedded file name extraction in PDFParser
 Key: TIKA-1376
 URL: https://issues.apache.org/jira/browse/TIKA-1376
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.6


When we extract embedded files from PDFs, we are currently using the key in the 
PDEmbeddedFilesNameTreeNode as the file name that we store as the value of 
Metadata.RESOURCE_NAME_KEY in the embedded document's  metadata.

I think we should try to get the file name from PDComplexFileSpecification's 
getFilename() first.  If that is null, then we should fall back to the key 
value.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: How should video files with audio be handled by parsers?

2014-07-24 Thread Nick Burch

On Wed, 23 Jul 2014, Ray Gauss wrote:
2) There are are several PBCore instantiation properties that apply to 
the entire file like duration and tracks that we'd want prefixed with 
pbcore so I think it would be odd to see:


  pbcore:instantiationDuration=00:00:05.20
  stream[0]/pbcore:essenceTrackType=Video


This structure does have the advantage that any tool can easily see that 
the second metadata key relates to a sub-stream / sub-track etc, without 
having to know anything about PBCore. That will make it easier for tools 
to exclude or handle these differently in a general way.


(I can't think, off the top of my head, of another kind of thing that 
might need this structure, but I'm reluctant to nail it down to being only 
for PBCore if that'll cause us issues when we try to support something 
very similar in future)



Any chance you could get / fake a nearly-full set of metadata keys and 
value for a media file with (say) 3 streams? We can then generate pbcore 
prefixed and general prefixed versions, which should hopefully make it 
easier for other community members to compare and offer their input!


Nick

[jira] [Created] (TIKA-1375) Decrease memory consumption when extracting images from PDFs

2014-07-24 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1375:
-

 Summary: Decrease memory consumption when extracting images from 
PDFs
 Key: TIKA-1375
 URL: https://issues.apache.org/jira/browse/TIKA-1375
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.6


This patch applies changes made in PDFBOX-2101 to decrease memory consumption 
during extraction of embedded images.  This also applies the recommendation by 
[~tilman] on the PDFBox dev [list | 
http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201407.mbox/%3c53cff0ce.9090...@t-online.de%3e]
 to clear resources after handling each page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1374) Need to add code to look for OS-specific keys for embedded files within PDFs

2014-07-24 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1374:
--

Description: 
Embedded files in PDFs can be found by the general all purpose key we  
currently use via PDFBox:  "F".  However, embedded documents can also be stored 
under OS specific keys: "DOS", "Mac" and "Unix".

[~lehmi] confirmed on the PDFBox users 
[list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e]
 that we might be missing embedded documents if we're not trying the OS 
specific keys as well.  As Andreas points out, according to the spec the OS 
specific keys shouldn't be used any more, but I think we should support 
extraction for them.

My proposal is to pull all documents that are available by any of the four keys 
(well, via getEmbeddedFile() in PDFBox).  This has the downside of 
potentially extracting basically duplicate documents, but I'd prefer to err on 
the side of extracting everything.

The code fix is trivial, and I'll try to commit it today.  However, it will 
take me a bit of time to generate a test file that stores files under the OS 
specific keys.  So, if anyone has an ASF-friendly file available or wants to 
take the task of generating one, please do.

  was:
Embedded files in PDFs can be found by the general all purpose key we  
currently use via PDFBox:  "EF/F".  However, embedded documents can also be 
stored under OS specific keys: "EF/DOS", "EF/Mac" and "EF/Unix".

[~lehmi] confirmed on the PDFBox users 
[list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e]
 that we might be missing embedded documents if we're not trying the OS 
specific keys as well.  As Andreas points out, according to the spec the OS 
specific keys shouldn't be used any more, but I think we should support 
extraction for them.

My proposal is to pull all documents that are available by any of the four keys 
(well, via getEmbeddedFile() in PDFBox).  The code fix is trivial, and I'll 
try to commit it today.  However, it will take me a bit of time to generate a 
test file that stores files under the OS specific keys.  So, if anyone has an 
ASF-friendly file available or wants to take the task of generating one, please 
do.


> Need to add code to look for OS-specific keys for embedded files within PDFs
> 
>
> Key: TIKA-1374
> URL: https://issues.apache.org/jira/browse/TIKA-1374
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.6
>
>
> Embedded files in PDFs can be found by the general all purpose key we  
> currently use via PDFBox:  "F".  However, embedded documents can also be 
> stored under OS specific keys: "DOS", "Mac" and "Unix".
> [~lehmi] confirmed on the PDFBox users 
> [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e]
>  that we might be missing embedded documents if we're not trying the OS 
> specific keys as well.  As Andreas points out, according to the spec the OS 
> specific keys shouldn't be used any more, but I think we should support 
> extraction for them.
> My proposal is to pull all documents that are available by any of the four 
> keys (well, via getEmbeddedFile() in PDFBox).  This has the downside of 
> potentially extracting basically duplicate documents, but I'd prefer to err 
> on the side of extracting everything.
> The code fix is trivial, and I'll try to commit it today.  However, it will 
> take me a bit of time to generate a test file that stores files under the OS 
> specific keys.  So, if anyone has an ASF-friendly file available or wants to 
> take the task of generating one, please do.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1374) Need to add code to look for OS-specific keys for embedded files within PDFs

2014-07-24 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1374:
-

 Summary: Need to add code to look for OS-specific keys for 
embedded files within PDFs
 Key: TIKA-1374
 URL: https://issues.apache.org/jira/browse/TIKA-1374
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.6


Embedded files in PDFs can be found by the general all purpose key we  
currently use via PDFBox:  "EF/F".  However, embedded documents can also be 
stored under OS specific keys: "EF/DOS", "EF/Mac" and "EF/Unix".

[~lehmi] confirmed on the PDFBox users 
[list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e]
 that we might be missing embedded documents if we're not trying the OS 
specific keys as well.  As Andreas points out, according to the spec the OS 
specific keys shouldn't be used any more, but I think we should support 
extraction for them.

My proposal is to pull all documents that are available by any of the four keys 
(well, via getEmbeddedFile() in PDFBox).  The code fix is trivial, and I'll 
try to commit it today.  However, it will take me a bit of time to generate a 
test file that stores files under the OS specific keys.  So, if anyone has an 
ASF-friendly file available or wants to take the task of generating one, please 
do.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server

2014-07-24 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073164#comment-14073164
 ] 

Nick Burch commented on TIKA-1269:
--

It's a bit hard to be sure on Miredot when most (all?) of the endpoints lack 
the documentation to make them look nice... Any chance someone (Lewis?) could 
add those javadocs and annotations etc to just one endpoint, so we can see 
Miredot in all it's glory before we decide?

> Self-hosted documentation for the JAX-RS Server
> ---
>
> Key: TIKA-1269
> URL: https://issues.apache.org/jira/browse/TIKA-1269
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.5
>Reporter: Nick Burch
> Fix For: 1.7
>
> Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch
>
>
> Currently, if you fire up the JAX-RS Tika Server, and go to the root of the 
> server in a web browser, you get an empty page back. You have to know to head 
> over to https://wiki.apache.org/tika/TikaJAXRS find out what the available 
> URLs are
> We should self-host some simple documentation on the server at the root of 
> it, so that people can discover what it offers. Ideally, this should be 
> largely auto-generated based on the endpoints, so that we don't risk missing 
> things when we add new features
> This will also allow us to potentially offer a sample running version of the 
> server for people to discover Tika with



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073077#comment-14073077
 ] 

Hudson commented on TIKA-1373:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #114 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/114/])
TIKA-1373 - Send html content to SAX events by using TagSoup (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613051)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java


> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073052#comment-14073052
 ] 

Hudson commented on TIKA-1373:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #112 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/112/])
TIKA-1373 - Send html content to SAX events by using TagSoup (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613051)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java


> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1373.


Resolution: Fixed

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-24 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073042#comment-14073042
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


HtmlParser skips tags generated by JHighlight. I found a solution by using 
directly TagSoup Parser. Commit in r1613051.
As I mentioned in TIKA-1224, this parser is quick & dirty approach to parser 
source code file. Again, the _right_ one parser is must have dedicate parser by 
language and parse deeply elements and build events on-the-fly.

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)