[jira] Updated: (SOLR-1274) Provide multiple output formats in extract-only mode for tika handler

2009-08-03 Thread Peter Wolanin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Wolanin updated SOLR-1274:


Attachment: SOLR-1274.patch

Well, indeed - something like that works better.


> Provide multiple output formats in extract-only mode for tika handler
> -
>
> Key: SOLR-1274
> URL: https://issues.apache.org/jira/browse/SOLR-1274
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>Reporter: Peter Wolanin
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-1274.patch, SOLR-1274.patch
>
>
> The proposed feature is to accept a URL parameter when using extract-only 
> mode to specify an output format.  This parameter might just overload the 
> existing "ext.extract.only" so that one can optionally specify a format, e.g. 
> false|true|xml|text  where true and xml give the same response (i.e. xml 
> remains the default)
> I had been assuming that I could choose among possible tika output
> formats when using the extracting request handler in extract-only mode
> as if from the CLI with the tika jar:
>-x or --xmlOutput XHTML content (default)
>-h or --html   Output HTML content
>-t or --text   Output plain text content
>-m or --metadata   Output only metadata
> However, looking at the docs and source, it seems that only the xml
> option is available (hard-coded) in ExtractingDocumentLoader.java
> {code}
> serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", 
> true));
> {code}
> Providing at least a plain-text response seems to work if you change the 
> serializer to a TextSerializer (org.apache.xml.serialize.TextSerializer).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1274) Provide multiple output formats in extract-only mode for tika handler

2009-08-03 Thread Peter Wolanin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Wolanin updated SOLR-1274:


Attachment: SOLR-1274.patch

Here's a patch that's nearly there, but somehow I'm missing something in how 
java behaves.  The param is getting picked up, but this line never evals as 
true, even when the param is parsed right:

{code}
  if (extractFormat == "text") {
{code}


If I set it to
{code}
  if (true) {
{code}

I get the desired text-only output.

> Provide multiple output formats in extract-only mode for tika handler
> -
>
> Key: SOLR-1274
> URL: https://issues.apache.org/jira/browse/SOLR-1274
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.4
>Reporter: Peter Wolanin
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-1274.patch
>
>
> The proposed feature is to accept a URL parameter when using extract-only 
> mode to specify an output format.  This parameter might just overload the 
> existing "ext.extract.only" so that one can optionally specify a format, e.g. 
> false|true|xml|text  where true and xml give the same response (i.e. xml 
> remains the default)
> I had been assuming that I could choose among possible tika output
> formats when using the extracting request handler in extract-only mode
> as if from the CLI with the tika jar:
>-x or --xmlOutput XHTML content (default)
>-h or --html   Output HTML content
>-t or --text   Output plain text content
>-m or --metadata   Output only metadata
> However, looking at the docs and source, it seems that only the xml
> option is available (hard-coded) in ExtractingDocumentLoader.java
> {code}
> serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", 
> true));
> {code}
> Providing at least a plain-text response seems to work if you change the 
> serializer to a TextSerializer (org.apache.xml.serialize.TextSerializer).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.