[jira] [Commented] (TIKA-591) Separate launcer process for forking JVMs

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342034#comment-14342034
 ] 

Tyler Palsulich commented on TIKA-591:
--

Is there still interest in this, or is it superseded by tika-batch?

> Separate launcer process for forking JVMs
> -
>
> Key: TIKA-591
> URL: https://issues.apache.org/jira/browse/TIKA-591
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>Priority: Minor
>
> As a followup to TIKA-416, it would be good to implement at least optional 
> support for a separate launcher process for the ForkParser feature. The need 
> for such an extra process came up in JCR-2864 where a reference to 
> http://developers.sun.com/solaris/articles/subprocess/subprocess.html  was 
> made.
> To summarize, the problem is that the ProcessBuilder.start() call can result 
> in a temporary duplication of the memory space of the parent JVM. Even with 
> copy-on-write semantics this can be a fairly expensive operation and prone to 
> out-of-memory issues especially in large-scale deployments where the parent 
> JVM already uses the majority of the available RAM on a computer.
> A similar problem is also being discussed at HADOOP-5059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-590) Create facility for deeper introspection of media files

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-590.

Resolution: Won't Fix

> Create facility for deeper introspection of media files
> ---
>
> Key: TIKA-590
> URL: https://issues.apache.org/jira/browse/TIKA-590
> Project: Tika
>  Issue Type: Wish
>  Components: metadata
>Reporter: Andre-John Mas
>
> This feature would allow applications to dig deeper into files to define 
> meta-data that is not presented as a tag in the file. For example a file that 
> has no duration information could with a little more work provide this 
> missing information. The idea is to let the API user make a difference 
> between data that is quick to retrieve and data that is slower to retrieve 
> because of the extra processing needed to get that information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-579) DcXMLParser: DC metadata text in extracted body

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-579:
-
Affects Version/s: (was: 0.8)
   1.8

> DcXMLParser: DC metadata text in extracted body
> ---
>
> Key: TIKA-579
> URL: https://issues.apache.org/jira/browse/TIKA-579
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.8
> Environment: N/A
>Reporter: Scott Severtson
>
> The DcXMLParser correctly extracts Dublin Core metadata text into the 
> Metadata object, but the metadata text is included in the extracted "body". 
> Sample XML document:
> ---
> 
> http://purl.org/dc/elements/1.1/";>
>   This is the title
>   Scott Severtson
>   This is the subject
>   This is the body text.
> 
> ---
> Sample code:
> ---
> URL xmlDocument = ...
> TikaConfig tikaConfig = new TikaConfig();
> ParseUtils.getStringContent(xmlDocument, tikaConfig, "application/xml");
> ---
> Actual output:
> ---
>   This is the title
>   Scott Severtson
>   This is the subject
>   This is the body text.
> ---
> Expected output:
> ---
>   This is the body text.
> ---
> The output is consistent when using ParseUtils *and* when using DcXMLParser 
> directly with a ContentHandler. The ContentHandler receives a single text 
> node containing concatinated metadata and body text, so there is no 
> opportunity to externally work around this issue. We would expect DcXMLParser 
> to remove DC nodes from the body prior to extracting the body text, to be 
> more consistent with other Tika parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-579) DcXMLParser: DC metadata text in extracted body

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342031#comment-14342031
 ] 

Tyler Palsulich commented on TIKA-579:
--

+1. DC tags should be put into the Metadata. This is still a problem with 
1.8-SNAPHOT.

> DcXMLParser: DC metadata text in extracted body
> ---
>
> Key: TIKA-579
> URL: https://issues.apache.org/jira/browse/TIKA-579
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.8
> Environment: N/A
>Reporter: Scott Severtson
>
> The DcXMLParser correctly extracts Dublin Core metadata text into the 
> Metadata object, but the metadata text is included in the extracted "body". 
> Sample XML document:
> ---
> 
> http://purl.org/dc/elements/1.1/";>
>   This is the title
>   Scott Severtson
>   This is the subject
>   This is the body text.
> 
> ---
> Sample code:
> ---
> URL xmlDocument = ...
> TikaConfig tikaConfig = new TikaConfig();
> ParseUtils.getStringContent(xmlDocument, tikaConfig, "application/xml");
> ---
> Actual output:
> ---
>   This is the title
>   Scott Severtson
>   This is the subject
>   This is the body text.
> ---
> Expected output:
> ---
>   This is the body text.
> ---
> The output is consistent when using ParseUtils *and* when using DcXMLParser 
> directly with a ContentHandler. The ContentHandler receives a single text 
> node containing concatinated metadata and body text, so there is no 
> opportunity to externally work around this issue. We would expect DcXMLParser 
> to remove DC nodes from the body prior to extracting the body text, to be 
> more consistent with other Tika parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-577) IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no pictures

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-577.
--
Resolution: Not a Problem

The document is corrupted. The POI error is now {{Caused by: 
java.lang.IllegalArgumentException: The end (4157) must not be before the start 
(4158)}}. It seems reasonable to bubble up this exception through Tika.

> IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no 
> pictures
> --
>
> Key: TIKA-577
> URL: https://issues.apache.org/jira/browse/TIKA-577
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.8, 0.9
> Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
>Reporter: Dennis Adler
> Attachments: X'd Out Doc for Tika.doc
>
>
> When cracking a Word 03 document (which, unfortunately, I cannot upload -- it 
> has client-confidential data), an index out of bounds exception occurs in the 
> POI code used by the WordExtractor. To try to make up for the unavailable doc 
> file, I've included the results of a couple of hours stepping through the 
> code to find the failure point. The error occurs because point[0] = point[1] 
> = 301; upperbound of _paragraphs = 301. This is in the method 
> org.apache.poi.hwpf.usermodel.CharacterRun() .
> The method + line numbers are:
> public CharacterRun getCharacterRun(int index)
> line 792: int[] point = findRange(_paragraphs, _parStart, 
> Math.max(chpx.getStart(), _start), chpx.getEnd());
> line 794: PAPX papx = _paragraphs.get(point[0]);  // <<< This is the 
> source of the exception
> STACK at time of exception:
> Range.GetCharacterRun(int) line 794
> PicturesTable.getAllPictures() line 191
> WordExtractor$PicturesSource.(HPWFDocument) line 429
> WordExtractor$PicturesSource.(HPWFDocument, WordExtractor#1) line 419
> WordExtractor.parse(POIFSFileSystem, XHTMLContentHandler) line 75
> OfficeParser.parse(CompositeParser).parse(InputStream, ContentHandler, 
> Metadata, ParseContext) line 187
> DefaulttParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, 
> ParseContext) line 197
> AutoDetectParser(CompositeParser).parse(InputStream, ContentHandler, 
> Metadata, ParseContext) line 197
> AutoDetectParser.parse(InputStream, ContentHandler, Metadata, ParseContext) 
> line 137
> ... (my project) ...
> As noted, this occurs in a Word 2003 doc which has no pictures (it is a 
> table); 147 character runs (0 - 146) found in first pass. Problem occurs on
> first pass (not sure if there will be others) on this run. Last run in this 
> code section from org.apache.poi.hwpf.model.PicturesTable.getAllPictures(),
> lines 186-191:
>   public List getAllPictures() {
> ArrayList pictures = new ArrayList();
> Range range = _document.getOverallRange();
> for (int i = 0; i < range.numCharacterRuns(); i++) {
>   CharacterRun run = range.getCharacterRun(i);
> Error occurs on getCharacterRun(i) when i = 146, which is the last run in the 
> range. If I change point[0] to 300 (in getCharacterRun), the call returns 
> nicely to 
> WordExtractor$PicturesSource(HPWFDocument) line 429, setting the List 
> all to an empty List. Fails again later on subsequent call to
> getAllPictures with same error.
> POTENTIAL FIX: if point[0] > papx.Length then return an EMPTY CharacterRun 
> for the paragraph in question.
> Cannot send repro document - contains confidential client data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-539.

Resolution: Fixed

> Encoding detection is too biased by encoding in meta tag
> 
>
> Key: TIKA-539
> URL: https://issues.apache.org/jira/browse/TIKA-539
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8, 0.9, 0.10
>Reporter: Reinhard Schwab
>Assignee: Ken Krugler
> Fix For: 1.8
>
> Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "\n"
>   + " content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>   + "Über den Wolken\n";
>   /**
>* @param args
>* @throws IOException
>* @throws TikaException
>* @throws SAXException
>*/
>   public static void main(String[] args) throws IOException, SAXException,
>   TikaException {
>   Metadata metadata = new Metadata();
>   metadata.set(Metadata.CONTENT_TYPE, "text/html");
>   metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>   AutoDetectParser parser = new AutoDetectParser();
>   BodyContentHandler h = new BodyContentHandler(1);
>   parser.parse(in, h, metadata, new ParseContext());
>   System.out.print(h.toString());
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-524) Unification of HTML output from Office, OOXML and Open Document parsers

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342026#comment-14342026
 ] 

Tyler Palsulich commented on TIKA-524:
--

Is there still interest/a possibility of implementing this? I haven't worked 
with the word document Parsers.

> Unification of HTML output from Office, OOXML and Open Document parsers
> ---
>
> Key: TIKA-524
> URL: https://issues.apache.org/jira/browse/TIKA-524
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 0.7
>Reporter: Geoff Jarrad
>Priority: Minor
>
> Word documents can easily be transformed, apparently without loss of 
> information, between common storage formats, such as .doc, .docx and .odt.
> However, when the above variants of a single document are analysed with the 
> respective Tika parsers, OfficeParser, OOXMLParser and OpenDocumentParser, 
> the resulting HTML output varies considerably between parsers.
> Given the latest advances in these parsers, it should now be feasible to: (i) 
> establish a common HTML representation that can adequately describe word 
> document content, and  (ii) modify the aforementioned parsers to conform to 
> this new standard.
> Points of interest include: headings, pre-formatted text and other styles, 
> headers and footers, tables, hyperlinks, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-497) HtmlHandler should fix up incorrect capitalization of names in attributes before putting into metadata

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-497.

Resolution: Fixed

> HtmlHandler should fix up incorrect capitalization of names in  http-equiv="xxx"> attributes before putting into metadata
> --
>
> Key: TIKA-497
> URL: https://issues.apache.org/jira/browse/TIKA-497
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 0.7
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
>
> With the current behavior, you can get metadata entries that have 
> "Content-Type" and "content-type" as their names, because http-equiv 
> attribute values often use incorrect capitalization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-465) LanguageIdentifier API enhancements

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342023#comment-14342023
 ] 

Tyler Palsulich commented on TIKA-465:
--

Is there still interest in implementing this? If not, I'll close off later this 
week.

> LanguageIdentifier API enhancements
> ---
>
> Key: TIKA-465
> URL: https://issues.apache.org/jira/browse/TIKA-465
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Chris A. Mattmann
>Assignee: Ken Krugler
>Priority: Minor
>
> As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set 
> of improvements for the LanguageIdentifier that we should consider in Tika:
> {quote}
> More informations can be found on the following thread on Nutch-Dev mailing 
> list:
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html
> Summary:
> 1. LanguageIdentifier API changes. The similarity methods should return an 
> ordered array of language-code/score pairs instead of a simple String 
> containing the language-code.
> 2. Ensure consistency between LanguageIdentifier scoring and 
> NGramProfile.getSimilarity().
> {quote}
> I just wanted to capture the issue here in Tika, since I'm about to close it 
> out in Nutch since LanguageIdentification is something that can happen in 
> Tika-ville...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-456) Support timeouts for parsers

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342013#comment-14342013
 ] 

Tyler Palsulich commented on TIKA-456:
--

This could be useful for tika-batch, [~talli...@apache.org]!

> Support timeouts for parsers
> 
>
> Key: TIKA-456
> URL: https://issues.apache.org/jira/browse/TIKA-456
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Ken Krugler
>Assignee: Chris A. Mattmann
>
> There are a number of reasons why Tika could hang while parsing. One common 
> case is when a parser is fed an incomplete document, such as what happens 
> when limiting the amount of data fetched during a web crawl.
> One solution is to create a TikaCallable that wraps the Tika   parser, and 
> then use this with a FutureTask. For example, when using a ParsedDatum POJO 
> for the results of the parse operation, I do something like this:
> parser = new AutoDetectParser();
> Callable c = new TikaCallable(parser, contenthandler, 
> inputstream, metadata);
> FutureTask task = new  FutureTask(c);
> Thread t = new Thread(task);
> t.start();
> ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
> And TikaCallable() looks like:
> class TikaCallable implements Callable {
> public TikaCallable(Parser parser, ContentHandler handler, InputStream 
> is, Metadata metadata) {
> _parser = parser;
> _handler = handler;
> _input = is;
> _metadata = metadata;
> ...
> }
> public ParsedDatum call() throws Exception {
> 
> _parser.parse(_input, _handler, _metadata, new ParseContext());
> 
> }
> }
> This seems like it would be generally useful, as I doubt that we'd  ever be 
> able to guarantee that none of the parsers being wrapped by Tika could ever 
> hang.
> One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. 
> something like:
>   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
> Then the call to p.parse(...) would create a Callable (similar to the code 
> above) and use the specified timeout when calling task.get().
> One minus with this approach is that it creates a new thread for each parse 
> request, but I don't think the thread overhead is significant when compared 
> to the typical parser operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-450) Document our issue tracking workflows

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-450.
--
Resolution: Fixed

The [contributors page|http://tika.apache.org/contribute.html] covers this 
pretty well.

> Document our issue tracking workflows
> -
>
> Key: TIKA-450
> URL: https://issues.apache.org/jira/browse/TIKA-450
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Jukka Zitting
>Priority: Minor
>
> As noted by Nick on dev@, we don't currently document our development and 
> issue tracking practices very well. This makes it more difficult for new 
> contributors and committers to jump in. It would be good to have a page on 
> our web site about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-381) HtmlParser should strip linefeeds out of links

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342009#comment-14342009
 ] 

Tyler Palsulich commented on TIKA-381:
--

This is still an issue in 1.8-SNAPSHOT.

{code}http://goog
le.com">link{code}

turns into

{{http://google.com">link}}

> HtmlParser should strip linefeeds out of links
> --
>
> Key: TIKA-381
> URL: https://issues.apache.org/jira/browse/TIKA-381
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>
> A number of HTML pages contain links where the URL has a linefeed in the 
> middle of it.
> Browsers such as Firefox will automatically remove the character but Tika 
> passes it back, which results in a broken URL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-381) HtmlParser should strip linefeeds out of links

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-381:
-
Affects Version/s: (was: 0.6)
   1.8

> HtmlParser should strip linefeeds out of links
> --
>
> Key: TIKA-381
> URL: https://issues.apache.org/jira/browse/TIKA-381
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.8
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>
> A number of HTML pages contain links where the URL has a linefeed in the 
> middle of it.
> Browsers such as Firefox will automatically remove the character but Tika 
> passes it back, which results in a broken URL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-289) Add magic byte patterns from file(1)

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342005#comment-14342005
 ] 

Tyler Palsulich commented on TIKA-289:
--

Sounds great!

> Add magic byte patterns from file(1)
> 
>
> Key: TIKA-289
> URL: https://issues.apache.org/jira/browse/TIKA-289
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Reporter: Jukka Zitting
>Priority: Minor
>
> As discussed in TIKA-285, the file(1) command comes with a pretty 
> comprehensive set of magic byte patterns. It would be nice to get those 
> patterns included also in Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TIKA-289) Add magic byte patterns from file(1)

2015-02-28 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch reopened TIKA-289:
-

I've just checked, and there are actually a handful of mime types defined in 
the file magic which we don't have, doh! Let's add those few in, then close

Also, I wonder if we should add a reference to the file magic (probably the 
[github version|https://github.com/file/file/tree/master/magic/Magdir]) to the 
detection or similar docs, to point new people to it as a possible source of 
suitably licensed magics if they can't otherwise find some?

> Add magic byte patterns from file(1)
> 
>
> Key: TIKA-289
> URL: https://issues.apache.org/jira/browse/TIKA-289
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Reporter: Jukka Zitting
>Priority: Minor
>
> As discussed in TIKA-285, the file(1) command comes with a pretty 
> comprehensive set of magic byte patterns. It would be nice to get those 
> patterns included also in Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-289) Add magic byte patterns from file(1)

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-289.

Resolution: Won't Fix

I agree, [~gagravarr]. Let's consider {{file}} as a reference when we need help 
with how to support a new type, rather than as a source to bulk import from.

> Add magic byte patterns from file(1)
> 
>
> Key: TIKA-289
> URL: https://issues.apache.org/jira/browse/TIKA-289
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Reporter: Jukka Zitting
>Priority: Minor
>
> As discussed in TIKA-285, the file(1) command comes with a pretty 
> comprehensive set of magic byte patterns. It would be nice to get those 
> patterns included also in Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-375) Improve code quality metrics

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341999#comment-14341999
 ] 

Tyler Palsulich commented on TIKA-375:
--

This is a great candidate for any new contributors who don't know where to 
start!

> Improve code quality metrics
> 
>
> Key: TIKA-375
> URL: https://issues.apache.org/jira/browse/TIKA-375
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Jukka Zitting
> Attachments: TIKA-375.patch
>
>
> The Sonar report at http://sonar.zitting.name/project/index/3338 highlights a 
> number of code quality issues that we could fix with fairly little effort and 
> risk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-375) Improve code quality metrics

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-375:
-
Labels: newbie  (was: )

> Improve code quality metrics
> 
>
> Key: TIKA-375
> URL: https://issues.apache.org/jira/browse/TIKA-375
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Jukka Zitting
>  Labels: newbie
> Attachments: TIKA-375.patch
>
>
> The Sonar report at http://sonar.zitting.name/project/index/3338 highlights a 
> number of code quality issues that we could fix with fairly little effort and 
> risk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-354) ProfilingHandler should take a length-limiting parameter

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-354.

Resolution: Not a Problem

Closing this off, unless you're still interested in getting it in, [~kkrugler]. 
We recently had a good improvement to detection speed (TIKA-1549).

> ProfilingHandler should take a length-limiting parameter
> 
>
> Key: TIKA-354
> URL: https://issues.apache.org/jira/browse/TIKA-354
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Affects Versions: 0.5
>Reporter: Vivek Magotra
>Assignee: Ken Krugler
> Attachments: TIKA-354-2.patch, TIKA-354.patch
>
>
> ProfilingHandler currently parses the entire document (thereby analyzing 
> n-grams for the entire doc).
> ProfilingHandler should take a length-limiting parameter that allows a user 
> to specify the amount of data that should get analyzed.
> In fact, by default that limit should be set to something like 8K.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-369) Improve accuracy of language detection

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341994#comment-14341994
 ] 

Tyler Palsulich commented on TIKA-369:
--

Is there any update on this? Language detection was recently significantly sped 
up in TIKA-1549.

> Improve accuracy of language detection
> --
>
> Key: TIKA-369
> URL: https://issues.apache.org/jira/browse/TIKA-369
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Affects Versions: 0.6
>Reporter: Ken Krugler
>Assignee: Ken Krugler
> Attachments: Surprise and Coincidence.pdf, lingdet-mccs.pdf, 
> textcat.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language 
> profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper 
> (attached) indicates that a log-likelihood ratio (LLR) test works much 
> better, which would then make language detection faster due to less text 
> needing to be processed.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact 
> value as a threshold for certainty. This is very sensitive to the amount of 
> text being processed, and thus gives false negative results for short runs of 
> text.
> 3. Certainty should also be based on how much better the result is for 
> language X, compared to the next best language. If two languages both had 
> identical sum-of-squares values, and this value was below the threshold, then 
> the result is still not very certain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-289) Add magic byte patterns from file(1)

2015-02-28 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341993#comment-14341993
 ] 

Nick Burch commented on TIKA-289:
-

There are a few issues with integrating it:
 * Very few of the entries in the file magic list have mimetypes, only 
descriptions, so we'd need to manually review each one and search for a 
mimetype. (I see only 287 different mimetypes, as compared to the vast number 
of magic entries)
 * Many of the file magic entries include a little bit of parser logic too, 
with various bits of the matching being included in the description string, 
sometimes lots
 * Some of the matching is actually done with code (much like our container 
aware detectors), not the mime magic, see the {{src}} directory for those

The file magic and sourcecode are a very good source of magic patterns, and 
sometimes also basic parser logic, but I'm not sure how practical a bulk import 
would be?

> Add magic byte patterns from file(1)
> 
>
> Key: TIKA-289
> URL: https://issues.apache.org/jira/browse/TIKA-289
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Reporter: Jukka Zitting
>Priority: Minor
>
> As discussed in TIKA-285, the file(1) command comes with a pretty 
> comprehensive set of magic byte patterns. It would be nice to get those 
> patterns included also in Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-307) Better handling of partial/truncated input data to parsers

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-307.

Resolution: Fixed

Zip and other type Parsers are much more robust at this point. Can reopen if 
still an issue.

> Better handling of partial/truncated input data to parsers
> --
>
> Key: TIKA-307
> URL: https://issues.apache.org/jira/browse/TIKA-307
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 0.4
>Reporter: Ken Krugler
>
> Some parsers (e.g. ZipParser) can hang if they prematurely reach the end of 
> the input stream.
> As a way of avoiding this issue, Jukka had suggested the following approach 
> on the list:
> The input stream could be wrapped into a decorator that throws a tagged 
> IOException when the given size limit has been reached. This assumes that all 
> parsers will correctly propagate up an IOException (at the very least).
> Smarter parsers could cleanly close the emitted XHTML stream, potentially 
> adding a metadata entry that signifies that the extracted text has been 
> truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-89) Rename MimeType and MimeTypes

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-89.
---
Resolution: Fixed

> Rename MimeType and MimeTypes
> -
>
> Key: TIKA-89
> URL: https://issues.apache.org/jira/browse/TIKA-89
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>Priority: Minor
>
> I'd like to rename the MimeType and MimeTypes classes respectively to 
> MediaType and MediaTypeRegistry. The rationale for this change is:
> a) The standard term for a MIME type is media type.
> b) MimeTypes is not just a collection of MimeType objects, ...Registry is 
> more appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-90) Allow thumbnails as document metadata

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-90:

Priority: Minor  (was: Major)

> Allow thumbnails as document metadata
> -
>
> Key: TIKA-90
> URL: https://issues.apache.org/jira/browse/TIKA-90
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Reporter: Jukka Zitting
>Priority: Minor
>
> It would be nice if parser components could produce thumbnail images and 
> other non-string metadata when parsing documents.
> To do this, we could either generalize the current Metadata methods, or 
> introduce new methods for handling such non-string metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-89) Rename MimeType and MimeTypes

2015-02-28 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341984#comment-14341984
 ] 

Nick Burch commented on TIKA-89:


I think this might have already been done? 

The {{MimeTypes}} class provides detection + a method 
{{getMediaTypeRegistry()}} which returns a {{MediaTypeRegistry}}, and we have 
classes {{MimeType}} and {{MediaType}} which are different things

> Rename MimeType and MimeTypes
> -
>
> Key: TIKA-89
> URL: https://issues.apache.org/jira/browse/TIKA-89
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>Priority: Minor
>
> I'd like to rename the MimeType and MimeTypes classes respectively to 
> MediaType and MediaTypeRegistry. The rationale for this change is:
> a) The standard term for a MIME type is media type.
> b) MimeTypes is not just a collection of MimeType objects, ...Registry is 
> more appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-291) Adobe InDesign support

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341982#comment-14341982
 ] 

Tyler Palsulich commented on TIKA-291:
--

We still don't have support for this, but it seems like a worthy type to 
support.

> Adobe InDesign support
> --
>
> Key: TIKA-291
> URL: https://issues.apache.org/jira/browse/TIKA-291
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
> Attachments: simple_test-1.indd
>
>
> It would be great if Tika could extract content from Adobe InDesign documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-288) Support override parsers in AutoDetectParser

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-288.

Resolution: Duplicate

> Support override parsers in AutoDetectParser
> 
>
> Key: TIKA-288
> URL: https://issues.apache.org/jira/browse/TIKA-288
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 0.4
>Reporter: Ken Krugler
>Priority: Minor
>
> In some situations, being able to specify an alternative parser is useful 
> even when the general parser framework/full set of parsers is desired.
> For example, when processing HTML documents the current HtmlParser doesn't 
> pass through all of the tags that a vertical crawler might want.
> I'm proposing an alternative constructor, something like:
> public AutoDetectParser(Map)
> where class would be the class of the standard Tika parser, and Parser is the 
> override.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-289) Add magic byte patterns from file(1)

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341980#comment-14341980
 ] 

Tyler Palsulich commented on TIKA-289:
--

Does anyone know if Tika integrated the magic from the file command?

> Add magic byte patterns from file(1)
> 
>
> Key: TIKA-289
> URL: https://issues.apache.org/jira/browse/TIKA-289
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Reporter: Jukka Zitting
>Priority: Minor
>
> As discussed in TIKA-285, the file(1) command comes with a pretty 
> comprehensive set of magic byte patterns. It would be nice to get those 
> patterns included also in Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-272) Expose characters offsets information while parsing text-based inputs.

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-272.

Resolution: Won't Fix

> Expose characters offsets information while parsing text-based inputs.
> --
>
> Key: TIKA-272
> URL: https://issues.apache.org/jira/browse/TIKA-272
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 0.4
>Reporter: David Causse
>Priority: Minor
>
> It would be interesting to access actual characters offset information when 
> parsing text-based files (I don't know if it's interesting/usable/doable for 
> binary formats...).
> If I use tika for parsing HTML and inject parsed strings into lucene, I'm not 
> able to tell to the lucene analyzer where is the actual character in the 
> original input.
> If tika expose this information It will permit to use unmodified lucene 
> analyzers behind tika and implement for example pretty highlighting in search 
> result (see google cache view).
> With new Lucene Attribute API it could be fairly easy to provide a sort of 
> TikaOffsetRectifierTokenFilter in lucene contrib and use a stack like tika -> 
> unmodified lucene analyzer -> tika offset correction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-94) Speech recognition

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341978#comment-14341978
 ] 

Tyler Palsulich commented on TIKA-94:
-

This is similar to machine text translation in that it requires a building a 
large external model. This would be a really cool feature... But, I don't think 
it is worth being a direct part of Tika. Will close later this week if no 
objection.

> Speech recognition
> --
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-89) Rename MimeType and MimeTypes

2015-02-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341976#comment-14341976
 ] 

Tyler Palsulich commented on TIKA-89:
-

Is there still interest in renaming these?

> Rename MimeType and MimeTypes
> -
>
> Key: TIKA-89
> URL: https://issues.apache.org/jira/browse/TIKA-89
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>Priority: Minor
>
> I'd like to rename the MimeType and MimeTypes classes respectively to 
> MediaType and MediaTypeRegistry. The rationale for this change is:
> a) The standard term for a MIME type is media type.
> b) MimeTypes is not just a collection of MimeType objects, ...Registry is 
> more appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-100) Structured PDF parsing

2015-02-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-100.

Resolution: Fixed

> Structured PDF parsing
> --
>
> Key: TIKA-100
> URL: https://issues.apache.org/jira/browse/TIKA-100
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
>
> The PDF parser currently extracts and outputs document content as a single 
> string. PDFBox could be used to support structuring at least down to page and 
> paragraph (not sure how accurate) level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


FeedBack required for Geographic Parser

2015-02-28 Thread Gautham Shankar
Hy Guys,

I am currently building a Geographic Parser for .iso1939 files under the
guidance of Proff Chris Mattmann,since Apache Tika currently lacks the
required support to parse these files. I am basically working on the below
issue

https://issues.apache.org/jira/browse/TIKA-1479.


My progress has been updated on the below link.

https://wiki.apache.org/tika/TikaGeographicInformationParser

I would like you guys to comment on the Key Names that i have come up for
customized Meta data, this could certainly be shortened.

I look forward to your invaluable feedback.

Regards
Gautham


[jira] [Commented] (TIKA-1479) Build a parser to extract data from .iso19139 format

2015-02-28 Thread Gautham Gowrishankar (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341942#comment-14341942
 ] 

Gautham Gowrishankar commented on TIKA-1479:


Hy Guys,

I am continuing on the work left by Prasanth, I am currently working on 
implementing the .iso19139 Parser. This Directed Research   is being under the 
guidance of Proff Chris Mattmann. The progress on the work is as below.

1. I am able to extract Meta Data from one of the .iso19139 files crawled by 
Prasanth  using Apache SIS library framework.
I would be updating my work periodically on the below page.

https://wiki.apache.org/tika/TikaGeographicInformationParser


> Build a parser to extract data from .iso19139 format
> 
>
> Key: TIKA-1479
> URL: https://issues.apache.org/jira/browse/TIKA-1479
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata, mime, parser
>Affects Versions: 1.6
>Reporter: Prasanth Iyer
>
> An initial crawl of the Acadis website (https://www.aoncadis.org/home.htm) 
> revealed that a number of the files on this website are of the .iso19139 
> type. Currently, Tika categorizes these files as text/plain since it does not 
> have a parser for this type of file. The need is to provide metadata support 
> and to build a parser for this kind of file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Curating Issues

2015-02-28 Thread Mattmann, Chris A (3980)
Hey Tyler if you want to take a whack, here are some criteria
I tend to use:

1. Bug report from 1+ years old.
  - Close it - either not reproducible, fixed in a later version
and not come back to, or not as bad of a bug anymore since it’s
not a blocker.

2. Feature request from 1+ years old that no one has acted upon.
 - Good candidate for closing - if it was important someone would
have acted up on it.

3. Issue from 1+ years old with lots of discussion on it
  - Poke the issue - see if a consensus can be reached, if not
move forward and close.

4. Issue that is your own that you aren’t interested in anymore
that is 1+ years old
  - Close it you didn’t work on it then, may not get back to it
and no one else has

5. Issue that is 2+ years old
  - Close, regardless, unless it has patch

6. Issue that is 1+ years old, with patch, uncommitted
  - Try to apply patch or minimal effort to bring current with
trunk and apply
  - if too much work ask for help
  - if 1+ weeks and no one replies, close it and move forward

There are more but that’s a start. I’ll check out this article
thanks for sending it.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Tyler Palsulich 
Reply-To: "dev@tika.apache.org" 
Date: Saturday, February 28, 2015 at 8:53 PM
To: "dev@tika.apache.org" 
Subject: Curating Issues

>Hi Folks,
>
>I just read an article [0] about managing a large project's issues list.
>Tika currently has 331 open issues. Do we know if all of these have been
>"triaged"? At what point do we want to label an issue as stale and close
>it
>off? What is our preferred split between when to make an issue and when to
>send a message to the mailing list?
>
>Have a good weekend,
>Tyler
>
>[0] http://words.steveklabnik.com/how-to-be-an-open-source-gardener?r=1



Curating Issues

2015-02-28 Thread Tyler Palsulich
Hi Folks,

I just read an article [0] about managing a large project's issues list.
Tika currently has 331 open issues. Do we know if all of these have been
"triaged"? At what point do we want to label an issue as stale and close it
off? What is our preferred split between when to make an issue and when to
send a message to the mailing list?

Have a good weekend,
Tyler

[0] http://words.steveklabnik.com/how-to-be-an-open-source-gardener?r=1


[jira] [Commented] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341747#comment-14341747
 ] 

Hudson commented on TIKA-1561:
--

ABORTED: Integrated in tika-trunk-jdk1.7 #515 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/515/])
Fix for TIKA-1561 GCMD Directory Interchange Format (.dif) identification 
contributed by LukeLiush . This closes #32. (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1662970)
* /tika/trunk/CHANGES.txt
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* /tika/trunk/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
* /tika/trunk/tika-core/src/test/resources/org/apache/tika/mime/brwNIMS_2014.dif
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/active_layer_arcss_grid_barrow_alaska_2012.dif
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif


> GCMD Directory Interchange Format (.dif) identification
> ---
>
> Key: TIKA-1561
> URL: https://issues.apache.org/jira/browse/TIKA-1561
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: 
> carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif
>
>
> cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
> "The Directory Interchange Format (DIF) is metadata format used to create 
> directory entries that describe scientific data
> sets. A DIF holds a collection of fields, which detail specific information 
> about the data."
>  The .dif file respect proper xml format that describe the scientific data 
> set, the schema xsd files can be found inside the .dif xml file.
> i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
> The reason opening this ticket is tika parser for this dif file is being 
> under consideration with development, the support to identify the type of xml 
> file is needed.
> Although dif file in this case seems to be an proper xml file which can be 
> parsed by xmlparser, still it might need a specific process on some of the 
> fields to be extracted and injected into the Solr System for analysis.
> Then it is proposed that the following type 'text/dif+xml' is appended and 
> used in the tika-mimetypes.xml to be able to support the specific xml type 
> detection which extends the application/xml, so that some special process can 
> be applied to this particular xml file.
> 
>
> namespaceURI="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/"/>
>
>
> 
> Expected MIME type: text/dif+xml
> The following is the link to the dif format guide
> http://gcmd.nasa.gov/add/difguide/
> example dif files:
> 1) 
> https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
> 2) 
> https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
> 3) 
> https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
> an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1509) Create configurable strategies for composite parsers

2015-02-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341683#comment-14341683
 ] 

Chris A. Mattmann commented on TIKA-1509:
-

Fantastic, Tyler, great summary.

> Create configurable strategies for composite parsers
> 
>
> Key: TIKA-1509
> URL: https://issues.apache.org/jira/browse/TIKA-1509
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> Several parsers can handle the same mime type, and we are currently ordering 
> which parser is chosen (roughly) by the alphabetic order of the parser class 
> name.
> Let's allow users to configure strategies for picking parsers.
> See and contribute to full discussion here: 
> http://wiki.apache.org/tika/CompositeParserDiscussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341674#comment-14341674
 ] 

Chris A. Mattmann edited comment on TIKA-1561 at 2/28/15 5:31 PM:
--

Merged Pull Request #32 and applied this patch to trunk in r1662970:

Thank you [~Lukeliush]!

{noformat}
[mattmann-0420740:~/tmp/tika] mattmann% svn commit -m "Fix for TIKA-1561 GCMD 
Directory Interchange Format (.dif) identification contributed by LukeLiush 
. This closes #32."
SendingCHANGES.txt
Sending
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Sendingtika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
Sending
tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
Adding 
tika-core/src/test/resources/org/apache/tika/mime/brwNIMS_2014.dif
Adding 
tika-parsers/src/test/resources/test-documents/active_layer_arcss_grid_barrow_alaska_2012.dif
Adding 
tika-parsers/src/test/resources/test-documents/carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif
Transmitting file data ...
Committed revision 1662970.
[mattmann-0420740:~/tmp/tika] mattmann% 
{noformat}


was (Author: chrismattmann):
Merged Pull Request #32 and applied this patch to trunk in r1662970:

Thank you [~luke_lu]!

{noformat}
[mattmann-0420740:~/tmp/tika] mattmann% svn commit -m "Fix for TIKA-1561 GCMD 
Directory Interchange Format (.dif) identification contributed by LukeLiush 
. This closes #32."
SendingCHANGES.txt
Sending
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Sendingtika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
Sending
tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
Adding 
tika-core/src/test/resources/org/apache/tika/mime/brwNIMS_2014.dif
Adding 
tika-parsers/src/test/resources/test-documents/active_layer_arcss_grid_barrow_alaska_2012.dif
Adding 
tika-parsers/src/test/resources/test-documents/carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif
Transmitting file data ...
Committed revision 1662970.
[mattmann-0420740:~/tmp/tika] mattmann% 
{noformat}

> GCMD Directory Interchange Format (.dif) identification
> ---
>
> Key: TIKA-1561
> URL: https://issues.apache.org/jira/browse/TIKA-1561
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: 
> carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif
>
>
> cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
> "The Directory Interchange Format (DIF) is metadata format used to create 
> directory entries that describe scientific data
> sets. A DIF holds a collection of fields, which detail specific information 
> about the data."
>  The .dif file respect proper xml format that describe the scientific data 
> set, the schema xsd files can be found inside the .dif xml file.
> i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
> The reason opening this ticket is tika parser for this dif file is being 
> under consideration with development, the support to identify the type of xml 
> file is needed.
> Although dif file in this case seems to be an proper xml file which can be 
> parsed by xmlparser, still it might need a specific process on some of the 
> fields to be extracted and injected into the Solr System for analysis.
> Then it is proposed that the following type 'text/dif+xml' is appended and 
> used in the tika-mimetypes.xml to be able to support the specific xml type 
> detection which extends the application/xml, so that some special process can 
> be applied to this particular xml file.
> 
>
> namespaceURI="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/"/>
>
>
> 
> Expected MIME type: text/dif+xml
> The following is the link to the dif format guide
> http://gcmd.nasa.gov/add/difguide/
> example dif files:
> 1) 
> https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
> 2) 
> https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
> 3) 
> https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
> an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-28 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1561.
-
Resolution: Fixed

Merged Pull Request #32 and applied this patch to trunk in r1662970:

Thank you [~luke_lu]!

{noformat}
[mattmann-0420740:~/tmp/tika] mattmann% svn commit -m "Fix for TIKA-1561 GCMD 
Directory Interchange Format (.dif) identification contributed by LukeLiush 
. This closes #32."
SendingCHANGES.txt
Sending
tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Sendingtika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
Sending
tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
Adding 
tika-core/src/test/resources/org/apache/tika/mime/brwNIMS_2014.dif
Adding 
tika-parsers/src/test/resources/test-documents/active_layer_arcss_grid_barrow_alaska_2012.dif
Adding 
tika-parsers/src/test/resources/test-documents/carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif
Transmitting file data ...
Committed revision 1662970.
[mattmann-0420740:~/tmp/tika] mattmann% 
{noformat}

> GCMD Directory Interchange Format (.dif) identification
> ---
>
> Key: TIKA-1561
> URL: https://issues.apache.org/jira/browse/TIKA-1561
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: 
> carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif
>
>
> cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
> "The Directory Interchange Format (DIF) is metadata format used to create 
> directory entries that describe scientific data
> sets. A DIF holds a collection of fields, which detail specific information 
> about the data."
>  The .dif file respect proper xml format that describe the scientific data 
> set, the schema xsd files can be found inside the .dif xml file.
> i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
> The reason opening this ticket is tika parser for this dif file is being 
> under consideration with development, the support to identify the type of xml 
> file is needed.
> Although dif file in this case seems to be an proper xml file which can be 
> parsed by xmlparser, still it might need a specific process on some of the 
> fields to be extracted and injected into the Solr System for analysis.
> Then it is proposed that the following type 'text/dif+xml' is appended and 
> used in the tika-mimetypes.xml to be able to support the specific xml type 
> detection which extends the application/xml, so that some special process can 
> be applied to this particular xml file.
> 
>
> namespaceURI="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/"/>
>
>
> 
> Expected MIME type: text/dif+xml
> The following is the link to the dif format guide
> http://gcmd.nasa.gov/add/difguide/
> example dif files:
> 1) 
> https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
> 2) 
> https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
> 3) 
> https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
> an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341673#comment-14341673
 ] 

ASF GitHub Bot commented on TIKA-1561:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/32


> GCMD Directory Interchange Format (.dif) identification
> ---
>
> Key: TIKA-1561
> URL: https://issues.apache.org/jira/browse/TIKA-1561
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: 
> carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif
>
>
> cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
> "The Directory Interchange Format (DIF) is metadata format used to create 
> directory entries that describe scientific data
> sets. A DIF holds a collection of fields, which detail specific information 
> about the data."
>  The .dif file respect proper xml format that describe the scientific data 
> set, the schema xsd files can be found inside the .dif xml file.
> i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
> The reason opening this ticket is tika parser for this dif file is being 
> under consideration with development, the support to identify the type of xml 
> file is needed.
> Although dif file in this case seems to be an proper xml file which can be 
> parsed by xmlparser, still it might need a specific process on some of the 
> fields to be extracted and injected into the Solr System for analysis.
> Then it is proposed that the following type 'text/dif+xml' is appended and 
> used in the tika-mimetypes.xml to be able to support the specific xml type 
> detection which extends the application/xml, so that some special process can 
> be applied to this particular xml file.
> 
>
> namespaceURI="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/"/>
>
>
> 
> Expected MIME type: text/dif+xml
> The following is the link to the dif format guide
> http://gcmd.nasa.gov/add/difguide/
> example dif files:
> 1) 
> https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
> 2) 
> https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
> 3) 
> https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
> an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: add mime detection with dif(TIKA-1561) support

2015-02-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/32


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341671#comment-14341671
 ] 

Chris A. Mattmann commented on TIKA-1561:
-

OK applied the Pull Request, cleaned up some issues (since this was generated 
before [~gagravarr]'s latest commits), and all tests pass:

{noformat}
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent . SUCCESS [  1.810 s]
[INFO] Apache Tika core ... SUCCESS [ 18.987 s]
[INFO] Apache Tika parsers  SUCCESS [02:38 min]
[INFO] Apache Tika XMP  SUCCESS [  2.601 s]
[INFO] Apache Tika serialization .. SUCCESS [  2.393 s]
[INFO] Apache Tika application  SUCCESS [ 16.845 s]
[INFO] Apache Tika OSGi bundle  SUCCESS [ 19.185 s]
[INFO] Apache Tika server . SUCCESS [ 25.777 s]
[INFO] Apache Tika translate .. SUCCESS [  3.130 s]
[INFO] Apache Tika examples ... SUCCESS [  7.419 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [  2.902 s]
[INFO] Apache Tika  SUCCESS [  0.034 s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 04:20 min
[INFO] Finished at: 2015-02-28T09:13:00-08:00
[INFO] Final Memory: 89M/1486M
[INFO] 
[mattmann-0420740:~/tmp/tika] mattmann% 
{noformat}

Going to commit this now.


> GCMD Directory Interchange Format (.dif) identification
> ---
>
> Key: TIKA-1561
> URL: https://issues.apache.org/jira/browse/TIKA-1561
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: 
> carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif
>
>
> cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
> "The Directory Interchange Format (DIF) is metadata format used to create 
> directory entries that describe scientific data
> sets. A DIF holds a collection of fields, which detail specific information 
> about the data."
>  The .dif file respect proper xml format that describe the scientific data 
> set, the schema xsd files can be found inside the .dif xml file.
> i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
> The reason opening this ticket is tika parser for this dif file is being 
> under consideration with development, the support to identify the type of xml 
> file is needed.
> Although dif file in this case seems to be an proper xml file which can be 
> parsed by xmlparser, still it might need a specific process on some of the 
> fields to be extracted and injected into the Solr System for analysis.
> Then it is proposed that the following type 'text/dif+xml' is appended and 
> used in the tika-mimetypes.xml to be able to support the specific xml type 
> detection which extends the application/xml, so that some special process can 
> be applied to this particular xml file.
> 
>
> namespaceURI="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/"/>
>
>
> 
> Expected MIME type: text/dif+xml
> The following is the link to the dif format guide
> http://gcmd.nasa.gov/add/difguide/
> example dif files:
> 1) 
> https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
> 2) 
> https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
> 3) 
> https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
> an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-28 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1561:

Fix Version/s: 1.8

> GCMD Directory Interchange Format (.dif) identification
> ---
>
> Key: TIKA-1561
> URL: https://issues.apache.org/jira/browse/TIKA-1561
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.7
>Reporter: Luke sh
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: 
> carbon_isotopic_values_of_alkanes_extracted_from_paleosols.dif
>
>
> cited from the http://gcmd.nasa.gov/add/difguide/WRITEADIF.pdf 
> "The Directory Interchange Format (DIF) is metadata format used to create 
> directory entries that describe scientific data
> sets. A DIF holds a collection of fields, which detail specific information 
> about the data."
>  The .dif file respect proper xml format that describe the scientific data 
> set, the schema xsd files can be found inside the .dif xml file.
> i,e, http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.8.4.xsd
> The reason opening this ticket is tika parser for this dif file is being 
> under consideration with development, the support to identify the type of xml 
> file is needed.
> Although dif file in this case seems to be an proper xml file which can be 
> parsed by xmlparser, still it might need a specific process on some of the 
> fields to be extracted and injected into the Solr System for analysis.
> Then it is proposed that the following type 'text/dif+xml' is appended and 
> used in the tika-mimetypes.xml to be able to support the specific xml type 
> detection which extends the application/xml, so that some special process can 
> be applied to this particular xml file.
> 
>
> namespaceURI="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/"/>
>
>
> 
> Expected MIME type: text/dif+xml
> The following is the link to the dif format guide
> http://gcmd.nasa.gov/add/difguide/
> example dif files:
> 1) 
> https://www.aoncadis.org/dataset/id/005f3222-7548-11e2-851e-00c0f03d5b7c.dif
> 2) 
> https://www.aoncadis.org/dataset/id/0091cf0c-7ad3-11e2-851e-00c0f03d5b7c.dif
> 3) 
> https://www.aoncadis.org/dataset/id/02a6301c-3ab3-11e4-8ee7-00c0f03d5b7c.dif
> an example dif file has also been attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341556#comment-14341556
 ] 

Hudson commented on TIKA-1558:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #514 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/514/])
TIKA-1558 Support excluding (blacklisting) parsers from config, so you can use 
DefaultParser for all except certain parsers. Also supports child parsers of a 
composite parser from config, towards TIKA-1509 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1662940)
* /tika/trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java
* /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/DefaultParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/config/TikaParserConfigTest.java


> Create a Parser Blacklist
> -
>
> Key: TIKA-1558
> URL: https://issues.apache.org/jira/browse/TIKA-1558
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Tyler Palsulich
> Fix For: 1.8
>
>
> As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
> disable Parsers without pulling their dependencies out. In some cases (e.g. 
> disable all ExternalParsers), there may not be an easy way to exclude the 
> dependencies via Maven.
> So, an initial design would be to include another file like 
> {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
> new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
> {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
> that are assignable to an element in 
> {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1509) Create configurable strategies for composite parsers

2015-02-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341555#comment-14341555
 ] 

Hudson commented on TIKA-1509:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #514 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/514/])
TIKA-1558 Support excluding (blacklisting) parsers from config, so you can use 
DefaultParser for all except certain parsers. Also supports child parsers of a 
composite parser from config, towards TIKA-1509 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1662940)
* /tika/trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java
* /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/DefaultParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/config/TikaParserConfigTest.java


> Create configurable strategies for composite parsers
> 
>
> Key: TIKA-1509
> URL: https://issues.apache.org/jira/browse/TIKA-1509
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> Several parsers can handle the same mime type, and we are currently ordering 
> which parser is chosen (roughly) by the alphabetic order of the parser class 
> name.
> Let's allow users to configure strategies for picking parsers.
> See and contribute to full discussion here: 
> http://wiki.apache.org/tika/CompositeParserDiscussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341529#comment-14341529
 ] 

Hudson commented on TIKA-1558:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #513 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/513/])
Start on unit testing for the new TIKA-1558 style parser blacklisting (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1662927)
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/config
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/config/TikaParserConfigTest.java
* /tika/trunk/tika-parsers/src/test/resources/org/apache/tika/config
* 
/tika/trunk/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-1558-blacklist.xml


> Create a Parser Blacklist
> -
>
> Key: TIKA-1558
> URL: https://issues.apache.org/jira/browse/TIKA-1558
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Tyler Palsulich
> Fix For: 1.8
>
>
> As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
> disable Parsers without pulling their dependencies out. In some cases (e.g. 
> disable all ExternalParsers), there may not be an easy way to exclude the 
> dependencies via Maven.
> So, an initial design would be to include another file like 
> {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
> new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
> {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
> that are assignable to an element in 
> {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-28 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341527#comment-14341527
 ] 

Nick Burch commented on TIKA-1558:
--

As of r1662940, it is now possible to blacklist one or more parsers from being 
used by {{DefaultParser}} from the config file, eg with config like:
{code}

  

  
  

  

{code}

A config file like that will use the normal DefaultParser, but without the 
Tesseract or Executable parsers

Is that enough to be able to back out the blacklist service file?

> Create a Parser Blacklist
> -
>
> Key: TIKA-1558
> URL: https://issues.apache.org/jira/browse/TIKA-1558
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Tyler Palsulich
> Fix For: 1.8
>
>
> As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
> disable Parsers without pulling their dependencies out. In some cases (e.g. 
> disable all ExternalParsers), there may not be an easy way to exclude the 
> dependencies via Maven.
> So, an initial design would be to include another file like 
> {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
> new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
> {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
> that are assignable to an element in 
> {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)