[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-29 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217280#comment-15217280
 ] 

Thamme Gowda N commented on TIKA-1508:
--

[~talli...@mitre.org] [~chrismattmann]

Starting the discussion again on this issue.

Your suggestion regarding specialized exception and a getter for params is 
already incorporated in PR.
https://github.com/apache/tika/pull/91#commits-pushed-64db961

Suggestion related to Type system is pending. I see what we are getting with 
the support for value Type. 
However, I still have a concern as I feel this is incomplete. Whats our take on 
multivalued params (aka Arrays of integers and strings as parameter values)?


> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1657) Allow easier XML serialization of TikaConfig

2016-03-09 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187699#comment-15187699
 ] 

Thamme Gowda N commented on TIKA-1657:
--

[~talli...@mitre.org][~gagravarr][~chrismattmann]
I am wondering if you have considered the option of creating model classes for 
all the configuration elements, and then using JAXB to easily convert 
to-and-from XML for (De)Serialization.?


> Allow easier XML serialization of TikaConfig
> 
>
> Key: TIKA-1657
> URL: https://issues.apache.org/jira/browse/TIKA-1657
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.13
>
> Attachments: TIKA-1558-blacklist-effective.xml, TIKA-1657v1.patch
>
>
> In TIKA-1418, we added an example for how to dump the config file so that 
> users could easily modify it.  I think we should go further and make this an 
> option at the tika-core level with hooks for tika-app and tika-server.  I 
> propose adding a main() to TikaConfig that will print the xml config file 
> that Tika is currently using to stdout.
> I'd like to put this into core so that e.g. Solr's DIH users can get by 
> without having to download tika-app separately.  
> There's every chance that I've not accounted for issues with dynamic loading 
> etc.  Also, I'd be ok with only having this available in tika-app and 
> tika-server if there are good reasons.
> Feedback?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1508) Add uniformity to parser parameter configuration

2016-03-09 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187683#comment-15187683
 ] 

Thamme Gowda N commented on TIKA-1508:
--

1.  Please Let me know the final verdict when all of you agree to one thing, I 
will make changes as per the recommendation.

2. +1. Agreed. I will update the code

3.  I really like the suggestion. That would allow us to validate parameters 
and fail early when they are wrong.
 But I think it requires a lot of rework on the side of Parsers as well. 
Parsers have to declare what params they expect from the configuration file, it 
is only after that we will be able to validate.  Another simple/lazy approach 
is to simply assume all params are valid, pass all the params and let the 
parser raise exception when there are errors. The current PR  has the latter 
approach. Let me know what you think?

4. +1 Agreed. Will update the code.

5. Anything that extends AbstractParser is now instance of Configurable. 
Anything that is an instance of Configurable will be checked and invoked with 
params while instantiating them. So ParserDecorator, DelegatingParser, 
ParserPostProcessor are all covered, Yay!! If no params are found in config 
file, a call is made with empty Map. Now it is up to the 
implementation of these parsers to make use of params by overriding configure() 
method. 

A & B) I think solr way is complex to implement considering that we dont gain 
much after the effort (As of now we can just do Integer.parse() or similar ). 
Plus it introduces ambiguities with the type expected by parsers and the values 
supplied from configuration.


Being said that, I am open to all the suggestions.




> Add uniformity to parser parameter configuration
> 
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
> 
>   
> 2
> something or other
>   
>   audio/basic
>   audio/x-aiff
>   audio/x-wav
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-02-29 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172679#comment-15172679
 ] 

Thamme Gowda N commented on TIKA-1663:
--

Yes, I like to work on TIKA-1508,  provided 6 to 8 days timeline from now.


> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-02-29 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172527#comment-15172527
 ] 

Thamme Gowda N commented on TIKA-1663:
--

[~chrismattmann] [~talli...@mitre.org] We need SHA digest of raw content for 
MEMEX project.
I tried to enable digesting parser by editing our config file:
{code}






.
{code}

This doesnt work for the obvious reason that we havent told which digest 
algorithm.
After checking 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java,
 I found that  DigestingParser is a flexible framwork and takes constructor 
args. 

So, I propose two options:
1. We offer few popular implementations like SHA, MD5 parsers which doesnt need 
constructor args. This will enable us to activate them by editing the config 
xml file instead of source code.
2. We enhance tika configuration framework and these flexible parsers to accept 
runtime arguments, so that the flexibility and ease of use is preserved. For 
instance, if we can supply digest algorithm name from config file and let the 
DigestingParser use it to instantiate, then we dont need to edit source code of 
applications.
{code}




  MD5
   



.
{code}

I vote for option 2 even though it is slightly more work, but I feel it is the 
way to go.
I donot know if Tika already has a support for option 2 by accepting runtime 
arguments from config file.
 I faced a similar issue with NamedEntityParser, but found a workaround by 
using System properties.

> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-05 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135458#comment-15135458
 ] 

Thamme Gowda N commented on TIKA-1851:
--

Shell script was the initial version we had for manual setup (we must have 
deleted it from source control to avoid this ambiguity). The groovy script is 
actually used in the build to automate the test setup. 
We chose groovy over shell script because it makes the setup portable to 
windows since it doesnt need tools like wget or curl.

I tried to include the groovy script in src/test/groovy first, that forced me 
to setup groovy for the entire test goal of the project. But actually, the 
script is part of build tools (i.e. maven plugins) and not part of tika's test 
cases, so it was kept outside.


As always, open to any suggestions you will have.

> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-05 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135434#comment-15135434
 ] 

Thamme Gowda N commented on TIKA-1851:
--

Hi [~talli...@mitre.org]  [~chrismattmann]  and [~bobpaulin]

I spent some time in debugging tests on 2.x branch.

I think there's an issue with test setup in this multi module maven project.
I found that when tests are run on parser-modules and the *tika-test-resources* 
artifact is downloaded from the public repository 
(http://repository.apache.org/snapshots/org/apache/tika/tika-test-resources/2.0-SNAPSHOT/tika-test-resources-2.0-XXX-tests.jar)
 instead of locally building it. Is it an expected behaviour? 

Probably this is the reason why the tests are failing.

I think it is required to pull the jar from tika-test-resources module's build 
instead of public snapshot repo.
Something like calling mvn install on tika-test-resources first before the 
tests on other modules should pass.
(Sorry, I haven't figured out a way to configure this in maven yet.) But here 
is how you can reproduce this:
{code}
$ rm -r ~/.m2/repository/org/apache/tika/tika-test-resources/
$ rm 
tika-test-resources/src/test/resources/org/apache/tika/parser/ner/opennlp/*.bin
$ mvn package -pl tika-test-resources# This should download NER files and 
result in SUCCESS. Otherwise it is connection/proxy problem, we have fixed it 
on 1.x just need to port it to 2.x. 
# Get into flight mode: plug out your internet connection
$ mvn package -pl tika-test-resources   # Once more but this time offline. 
It did pass for me! Sounds good
$ mvn test -pl tika-parser-modules/tika-parser-advanced-module/  # it fails 
because tika-test-resources is not built, not available
$ mvn install -pl tika-test-resources/ -DskipTests# lets install, 
skipped tests because we just seen pass in previous step
$ mvn test -pl tika-parser-modules/tika-parser-advanced-module/   # Did it 
pass?  No for me. It tries to download test-resources and fails. But we just 
installed the test resources to local repo, it should have added to the class 
path. Why not? 
# get off from flight mode, plug in the internet
$ mvn test -pl tika-parser-modules/tika-parser-advanced-module/ -U# Did it 
pass? Yes for me.
{code}

> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser

2016-02-02 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128912#comment-15128912
 ] 

Thamme Gowda N commented on TIKA-1816:
--

Looks Good.

I just confirmed that *tika-test-resources* dependency is added to the modules 
for the test goal. Indeed, this is best!


Thanks.

> Lenient testing for NamedEntityParser
> -
>
> Key: TIKA-1816
> URL: https://issues.apache.org/jira/browse/TIKA-1816
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>Assignee: Tim Allison
>  Labels: memex
> Fix For: 1.12
>
> Attachments: TIKA-1816-proxy-fix.patch
>
>
> NamedEntityParser has a hard setup requirement like downloading of NER models 
> from remote servers and adding them to classpath.
> These model files are huge and hence are not added to source control.
> So, the tests are most likely to fail in various environments.
> Make the best effort to set up the tests, but in the worst case skip tests 
> instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1816) Lenient testing for NamedEntityParser

2016-02-01 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127061#comment-15127061
 ] 

Thamme Gowda N commented on TIKA-1816:
--

[~talli...@mitre.org] Sure, I will have a look.

Correct me if I am wrong (as I was little away from 2.x discussions):
The NER is now provided by *tika-parser-advanced-module*, so the tests should 
be set-up over there, am I correct?

> Lenient testing for NamedEntityParser
> -
>
> Key: TIKA-1816
> URL: https://issues.apache.org/jira/browse/TIKA-1816
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>Assignee: Tim Allison
>  Labels: memex
> Fix For: 1.12
>
> Attachments: TIKA-1816-proxy-fix.patch
>
>
> NamedEntityParser has a hard setup requirement like downloading of NER models 
> from remote servers and adding them to classpath.
> These model files are huge and hence are not added to source control.
> So, the tests are most likely to fail in various environments.
> Make the best effort to set up the tests, but in the worst case skip tests 
> instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1816) Lenient testing for NamedEntityParser

2016-01-08 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090290#comment-15090290
 ] 

Thamme Gowda N edited comment on TIKA-1816 at 1/9/16 1:04 AM:
--

[~talli...@mitre.org] Thanks for reporting.
Please test the provided patch with your proxy setup and let me know if there 
are any issues.

you are welcome to do any modifications to the model downloader program. As of 
now, the downloader uses first active proxy from maven's settings.



was (Author: thammegowda):
NER model downloader uses proxy configured in maven's settings.xml

> Lenient testing for NamedEntityParser
> -
>
> Key: TIKA-1816
> URL: https://issues.apache.org/jira/browse/TIKA-1816
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: TIKA-1816-proxy-fix.patch
>
>
> NamedEntityParser has a hard setup requirement like downloading of NER models 
> from remote servers and adding them to classpath.
> These model files are huge and hence are not added to source control.
> So, the tests are most likely to fail in various environments.
> Make the best effort to set up the tests, but in the worst case skip tests 
> instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1816) Lenient testing for NamedEntityParser

2016-01-08 Thread Thamme Gowda N (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thamme Gowda N updated TIKA-1816:
-
Attachment: TIKA-1816-proxy-fix.patch

NER model downloader uses proxy configured in maven's settings.xml

> Lenient testing for NamedEntityParser
> -
>
> Key: TIKA-1816
> URL: https://issues.apache.org/jira/browse/TIKA-1816
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: TIKA-1816-proxy-fix.patch
>
>
> NamedEntityParser has a hard setup requirement like downloading of NER models 
> from remote servers and adding them to classpath.
> These model files are huge and hence are not added to source control.
> So, the tests are most likely to fail in various environments.
> Make the best effort to set up the tests, but in the worst case skip tests 
> instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1816) Lenient testing for NamedEntityParser

2015-12-20 Thread Thamme Gowda N (JIRA)
Thamme Gowda N created TIKA-1816:


 Summary: Lenient testing for NamedEntityParser
 Key: TIKA-1816
 URL: https://issues.apache.org/jira/browse/TIKA-1816
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Thamme Gowda N


NamedEntityParser has a hard setup requirement like downloading of NER models 
from remote servers and adding them to classpath.
These model files are huge and hence are not added to source control.
So, the tests are most likely to fail in various environments.

Make the best effort to set up the tests, but in the worst case skip tests 
instead of failing the whole build process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2015-12-19 Thread Thamme Gowda N (JIRA)
Thamme Gowda N created TIKA-1815:


 Summary: Text content from parser is empty when NamedEntityParser 
is enabled
 Key: TIKA-1815
 URL: https://issues.apache.org/jira/browse/TIKA-1815
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Thamme Gowda N
 Fix For: 1.12


When the NamedEntityParser is enabled, the Tika#parseToString() and other 
parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1787) Include Stanford Name Entity Recognition in Tika

2015-11-11 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001042#comment-15001042
 ] 

Thamme Gowda N commented on TIKA-1787:
--

With #61, The CoreNLP NER can be activated by following steps:

- Add CoreNLP jars and models to classpath. If you are using maven, then add :
{code}
   
edu.stanford.nlp
stanford-corenlp
${corenlp.version}


   
   
edu.stanford.nlp
stanford-corenlp
${corenlp.version}
models

{code}

- Set System property "ner.impl.class" to 
"org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser"
   You can do it either by calling `System.setProperty()` before instantiating 
tika parsers in code or via commandline by using 
"-Dner.impl.class=org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser" while 
launching the JVM.

- Activate the NamedEntityParser

A demo project setup is at : https://github.com/thammegowda/tika-ner-corenlp





> Include Stanford Name Entity Recognition in Tika
> 
>
> Key: TIKA-1787
> URL: https://issues.apache.org/jira/browse/TIKA-1787
> Project: Tika
>  Issue Type: Improvement
>  Components: mime, parser
>Affects Versions: 1.12
> Environment: Java 1.8, Mac OSX 10.11
>Reporter: Yueheng He
>Assignee: Chris A. Mattmann
>  Labels: features, newbie, test
> Fix For: 1.12
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Using the Stanford Name Entity Recognition, Tika will be able to extract name 
> entities like PERSON, ORGANIZATION, LOCATION, etc from the given text. The 
> extracted name entities will be added to the metadata



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath

2015-11-11 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000975#comment-15000975
 ] 

Thamme Gowda N commented on TIKA-1791:
--

Thanks for pointing out the issue.
I didn't anticipate changes to configurations after the parser started to run. 

It's now handled in `intialize()`:
{code}
if (this.modelUrl != null && this.modelUrl.equals(modelUrl)) {
//previously initialized for the same URL
return;
}
{code}

If the Tika's environments are so dynamic (like files pointed by URLs are 
frequently updated/deleted), then probably states shouldn't be used. However, 
as you can see it's a tradeoff to performance. If this is the case, I can 
revert back to the older way.



> URI is not hierarchical exception when location model resource is inside a 
> jar in classpath
> ---
>
> Key: TIKA-1791
> URL: https://issues.apache.org/jira/browse/TIKA-1791
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: location model  file is placed inside a fat Jar (with 
> all the dependencies)
>Reporter: Thamme Gowda N
>
> {code:title=Stacktrace|borderStyle=solid}
> The following error happens when location NER model resource is packaged 
> inside a jar and GeoTopicParser is enabled.
> Caused by: java.lang.IllegalArgumentException: URI is not hierarchical
>   at java.io.File.(File.java:418)
>   at 
> org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33)
>   at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:166)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:149)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
>   at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45)
> {code}
> Refernces :
> http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath

2015-11-10 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999085#comment-14999085
 ] 

Thamme Gowda N commented on TIKA-1791:
--

Thanks for the feedback. 

* The fix for non-hierarchical URI is done by using URL instead of URI and path 
string. (Learned that we can have a URL to files inside ZIP archive, but not 
URI)

While I modified NER model loading code to make above change possible, I also 
happened to make these changes:

* The NER model was previously reloaded for every `parse()` call. It now reuses 
the model by making use of a state variable.
* The `isAvailable()` function was previously trying to launch an external 
process for every call to figureout availability of 'lucene-geo-gazeteer' 
command (it is invoked in `parse()`). This has been changed to use a state 
variable.
* The model is loaded on first call to `parse()` or `isAviable()` : via lazy 
intialization. My tests showed that it is backward compatible. 

UPDATE : 
Test case is now unaltered.  I was just trying to see if the test cases are 
passing different parse context. The lazy intialization of name extractor is 
gauranteed to work and thus shouldnt be breaking the existing usages. The 
{code} GeoParserConfig.setNERModelPath(String) {code} is also preserved for the 
users who are already using it to supply model path. However, 
{code}GeoParserConfig.getNERPath() {code} is swapped with URL getter.


> URI is not hierarchical exception when location model resource is inside a 
> jar in classpath
> ---
>
> Key: TIKA-1791
> URL: https://issues.apache.org/jira/browse/TIKA-1791
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: location model  file is placed inside a fat Jar (with 
> all the dependencies)
>Reporter: Thamme Gowda N
>
> {code:title=Stacktrace|borderStyle=solid}
> The following error happens when location NER model resource is packaged 
> inside a jar and GeoTopicParser is enabled.
> Caused by: java.lang.IllegalArgumentException: URI is not hierarchical
>   at java.io.File.(File.java:418)
>   at 
> org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33)
>   at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:166)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:149)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
>   at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45)
> {code}
> Refernces :
> http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath

2015-11-09 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997843#comment-14997843
 ] 

Thamme Gowda N commented on TIKA-1791:
--

Resolved and pull request is created on GitHub :  
https://github.com/apache/tika/pull/63

> URI is not hierarchical exception when location model resource is inside a 
> jar in classpath
> ---
>
> Key: TIKA-1791
> URL: https://issues.apache.org/jira/browse/TIKA-1791
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: location model  file is placed inside a fat Jar (with 
> all the dependencies)
>Reporter: Thamme Gowda N
>
> {code:title=Stacktrace|borderStyle=solid}
> The following error happens when location NER model resource is packaged 
> inside a jar and GeoTopicParser is enabled.
> Caused by: java.lang.IllegalArgumentException: URI is not hierarchical
>   at java.io.File.(File.java:418)
>   at 
> org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33)
>   at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559)
>   at 
> org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:166)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:149)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
>   at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45)
> {code}
> Refernces :
> http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1791) URI is not hierarchical exception when location model resource is inside a jar in classpath

2015-11-09 Thread Thamme Gowda N (JIRA)
Thamme Gowda N created TIKA-1791:


 Summary: URI is not hierarchical exception when location model 
resource is inside a jar in classpath
 Key: TIKA-1791
 URL: https://issues.apache.org/jira/browse/TIKA-1791
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.11
 Environment: location model  file is placed inside a fat Jar (with all 
the dependencies)
Reporter: Thamme Gowda N


{code:title=Stacktrace|borderStyle=solid}
The following error happens when location NER model resource is packaged inside 
a jar and GeoTopicParser is enabled.

Caused by: java.lang.IllegalArgumentException: URI is not hierarchical
at java.io.File.(File.java:418)
at 
org.apache.tika.parser.geo.topic.GeoParserConfig.(GeoParserConfig.java:33)
at org.apache.tika.parser.geo.topic.GeoParser.(GeoParser.java:54)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.lang.Class.newInstance(Class.java:442)
at 
org.apache.tika.config.TikaConfig$XmlLoader.loadOne(TikaConfig.java:559)
at 
org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:492)
at org.apache.tika.config.TikaConfig.(TikaConfig.java:166)
at org.apache.tika.config.TikaConfig.(TikaConfig.java:149)
at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
at edu.usc.cs.ir.cwork.tika.Parser.(Parser.java:45)

{code}

Refernces :
http://stackoverflow.com/questions/18055189/why-my-uri-is-not-hierarchical



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)