[jira] Resolved: (NUTCH-816) Add zip target to build.xml
[ https://issues.apache.org/jira/browse/NUTCH-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-816. - Resolution: Fixed - fixed in r942427 > Add zip target to build.xml > --- > > Key: NUTCH-816 > URL: https://issues.apache.org/jira/browse/NUTCH-816 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.0.0 > Environment: indep. of env. >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.1 > > > Just like we have an ant tar target (pun intended) we should have an ant zip > target. I'd like to have this ready for the release and future releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (NUTCH-816) Add zip target to build.xml
[ https://issues.apache.org/jira/browse/NUTCH-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-816 started by Chris A. Mattmann. > Add zip target to build.xml > --- > > Key: NUTCH-816 > URL: https://issues.apache.org/jira/browse/NUTCH-816 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.0.0 > Environment: indep. of env. >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.1 > > > Just like we have an ant tar target (pun intended) we should have an ant zip > target. I'd like to have this ready for the release and future releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-816) Add zip target to build.xml
Add zip target to build.xml --- Key: NUTCH-816 URL: https://issues.apache.org/jira/browse/NUTCH-816 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.0.0 Environment: indep. of env. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.1 Just like we have an ant tar target (pun intended) we should have an ant zip target. I'd like to have this ready for the release and future releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-814) SegmentMerger bug
[ https://issues.apache.org/jira/browse/NUTCH-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861401#action_12861401 ] Chris A. Mattmann commented on NUTCH-814: - Hey Andrzej, After you commit this, should I cut a new RC (rc #3)? Cheers, Chris > SegmentMerger bug > - > > Key: NUTCH-814 > URL: https://issues.apache.org/jira/browse/NUTCH-814 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.1 >Reporter: Dennis Kubes >Assignee: Andrzej Bialecki > Fix For: 1.1 > > Attachments: merger.patch > > > Dennis reported: > {quote} > In the SegmentMerger.java file about line 150 we have this: >final SequenceFile.Reader reader = > new SequenceFile.Reader(FileSystem.get(job), fSplit.getPath(), > job); > Then about line 166 in the record reader we have this: > boolean res = reader.next(key, w); > If I am reading that right, that would mean that the map tap would loop > over all records for a given file and not just a given split. > {quote} > Right, this should instead use SequenceFileRecordReader that already has the > logic to handle splits. Patch coming shortly - thanks for spotting this! This > could be the reason for "out of disk space" errors that many users reported. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE
[ https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-812. - Fix Version/s: 1.1 Resolution: Fixed - fixed in r935453. Thanks, Phil and Andrzej! > Crawl.java incorrectly uses the Generator API resulting in NPE > -- > > Key: NUTCH-812 > URL: https://issues.apache.org/jira/browse/NUTCH-812 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.1 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann >Priority: Critical > Fix For: 1.1 > > > As reported by Phil Barnett on nutch-user: > {quote} > The Fix. > In line 131 of Crawl.java > Generate no longer returns segments like it used to. Now it returns segs. > line 131 needs to read > If (segs == null) > Instead of the current > If (segments == null) > After that change and a recompile, crawl is working just fine. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE
[ https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-812: --- Assignee: Chris A. Mattmann > Crawl.java incorrectly uses the Generator API resulting in NPE > -- > > Key: NUTCH-812 > URL: https://issues.apache.org/jira/browse/NUTCH-812 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.1 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann >Priority: Critical > > As reported by Phil Barnett on nutch-user: > {quote} > The Fix. > In line 131 of Crawl.java > Generate no longer returns segments like it used to. Now it returns segs. > line 131 needs to read > If (segs == null) > Instead of the current > If (segments == null) > After that change and a recompile, crawl is working just fine. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE
[ https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-812 started by Chris A. Mattmann. > Crawl.java incorrectly uses the Generator API resulting in NPE > -- > > Key: NUTCH-812 > URL: https://issues.apache.org/jira/browse/NUTCH-812 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.1 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann >Priority: Critical > > As reported by Phil Barnett on nutch-user: > {quote} > The Fix. > In line 131 of Crawl.java > Generate no longer returns segments like it used to. Now it returns segs. > line 131 needs to read > If (segs == null) > Instead of the current > If (segments == null) > After that change and a recompile, crawl is working just fine. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854767#action_12854767 ] Chris A. Mattmann commented on NUTCH-570: - Hi Otis: I think your logic perfectly rational here. Maybe you could leave it open for another 48 hrs, and then close it out if you don't get any feedback from the original reporter, or those that were interested. Cheers, Chris > Improvement of URL Ordering in Generator.java > - > > Key: NUTCH-570 > URL: https://issues.apache.org/jira/browse/NUTCH-570 > Project: Nutch > Issue Type: Improvement > Components: generator >Reporter: Ned Rockson >Assignee: Otis Gospodnetic >Priority: Minor > Attachments: GeneratorDiff.out, GeneratorDiff_v1.out > > > [Copied directly from my email to nutch-dev list] > Recently I switched to Fetcher2 over Fetcher for larger whole web fetches > (50-100M at a time). I found that the URLs generated are not optimal because > they are simply randomized by a hash comparator. In one crawl on 24 machines > it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I > had set with regular Fetcher.java this was at least 3 fold more time. > Anyway, I realized that the best situation for ordering can be approached by > randomization, but in order to get optimal ordering, urls from the same host > should be as far apart in the list as possible. So I wrote a series of 2 > map/reduces to optimize the ordering and for a list of 25M documents it takes > about 10 minutes on our cluster. Right now I have it in its own class, but I > figured it can go in Generator.java and just add a flag in nutch-default.xml > determining if the user wants to use it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853285#action_12853285 ] Chris A. Mattmann commented on NUTCH-789: - Hey Julien, Tika 0.7 is available from Maven central: http://repo1.maven.org/maven2/org/apache/tika/tika-parsers/ Cheers, Chris > Improvements to Tika parser > --- > > Key: NUTCH-789 > URL: https://issues.apache.org/jira/browse/NUTCH-789 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Environment: reported by Sami, in NUTCH-766 >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.1 > > Attachments: NutchTikaConfig.java, TikaParser.java > > > As reported by Sami in NUTCH-766, Sami has a few improvements he made to the > Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853212#action_12853212 ] Chris A. Mattmann commented on NUTCH-789: - Hey Julien -- okey dok, Tika 0.7 has been released. Feel free to upgrade, and close this one out...after that, I'll cut the Nutch 1.1 RC. Thanks! Cheers, Chris > Improvements to Tika parser > --- > > Key: NUTCH-789 > URL: https://issues.apache.org/jira/browse/NUTCH-789 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Environment: reported by Sami, in NUTCH-766 >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.1 > > Attachments: NutchTikaConfig.java, TikaParser.java > > > As reported by Sami in NUTCH-766, Sami has a few improvements he made to the > Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852101#action_12852101 ] Chris A. Mattmann commented on NUTCH-794: - Hey Julien, yepper, I posted an RC of Tika 0.7, see: http://bit.ly/c7FZRc. If the VOTE passes on that in say the next 72 hours, I will push out a Tika 0.7 release to the mirrors. If everyone is OK with that, we can release Nutch 1.1 after...thoughts? > Language Identification must use check the parse metadata for language values > -- > > Key: NUTCH-794 > URL: https://issues.apache.org/jira/browse/NUTCH-794 > Project: Nutch > Issue Type: Bug > Components: parser >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 1.1 > > Attachments: NUTCH-794.patch > > > The following HTML document : > document 1 titlejotain > suomeksi > is rendered as the following xhtml by Tika : > xmlns="http://www.w3.org/1999/xhtml";>document 1 > titlejotain suomeksi > with the lang attribute getting lost. The lang is not stored in the metadata > either. > I will open an issue on Tika and modify TestHTMLLanguageParser so that the > tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852048#action_12852048 ] Chris A. Mattmann commented on NUTCH-789: - Folks, I'm going to put together an RC for Tika 0.7 and take care of JIRA now. Once I do that, we can try and close out this issue for 1.1. I should be able to do this before the 48 hr deadline I threw up for Nutch 1.1... > Improvements to Tika parser > --- > > Key: NUTCH-789 > URL: https://issues.apache.org/jira/browse/NUTCH-789 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Environment: reported by Sami, in NUTCH-766 >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.1 > > Attachments: NutchTikaConfig.java, TikaParser.java > > > As reported by Sami in NUTCH-766, Sami has a few improvements he made to the > Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852047#action_12852047 ] Chris A. Mattmann commented on NUTCH-673: - Folks: if you get time to put together a patch for 1.1 or feel that this should go into 1.1, please see: http://bit.ly/c7tBv9 and comment in the next 48 hrs... > Upgrade the Carrot2 plug-in to release 3.0 > -- > > Key: NUTCH-673 > URL: https://issues.apache.org/jira/browse/NUTCH-673 > Project: Nutch > Issue Type: Improvement > Components: web gui >Affects Versions: 0.9.0 > Environment: All Nutch deployments. >Reporter: Sean Dean >Priority: Minor > > Release 3.0 of the Carrot2 plug-in was released recently. > We currently have version 2.1 in the source tree and upgrading it to the > latest version before 1.0-release might make sence. > Details on the release can be found here: > http://project.carrot2.org/release-3.0-notes.html > One major change in requirements is for JDK 1.5 to be used, but this is also > now required for Hadoop 0.19 so this wouldnt be the only reason for the > switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-771) Add WebGraph classes to the bin/nutch script
[ https://issues.apache.org/jira/browse/NUTCH-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-771: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Add WebGraph classes to the bin/nutch script > > > Key: NUTCH-771 > URL: https://issues.apache.org/jira/browse/NUTCH-771 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.1 > Environment: All, shell script >Reporter: Dennis Kubes >Assignee: Dennis Kubes > > Currently the webgraph jobs are called on the command line by calling main > methods on their classes. I propose to upgrade the bin/nutch shell script to > allow calling these jobs as well. This would include the webgraphdb, > linkrank, scoreupdater, and nodedumper jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-475) Adaptive crawl delay
[ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-475: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Adaptive crawl delay > > > Key: NUTCH-475 > URL: https://issues.apache.org/jira/browse/NUTCH-475 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Doğacan Güney > Attachments: adaptive-delay_draft.patch > > > Current fetcher implementation waits a default interval before making another > request to the same server (if crawl-delay is not specified in robots.txt). > IMHO, an adaptive implementation will be better. If the server is under > little load and can server requests fast, then fetcher can ask for more pages > in a given interval. Similarly, if the server is suffering from heavy load, > fetcher can slow down(w.r.t that host), easing the load on the server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-666: Patch Info: [Patch Available] > Analysis plugins for multiple language and new Language Identifier Tool > --- > > Key: NUTCH-666 > URL: https://issues.apache.org/jira/browse/NUTCH-666 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.1 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch > > > Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, > russian, and thai. Also includes a new Language Identifier tool that used > the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-666: Due Date: 27/Nov/08 (was: 27/Nov/08) Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Analysis plugins for multiple language and new Language Identifier Tool > --- > > Key: NUTCH-666 > URL: https://issues.apache.org/jira/browse/NUTCH-666 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.1 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch > > > Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, > russian, and thai. Also includes a new Language Identifier tool that used > the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-583) FeedParser empty links for items
[ https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-583: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > FeedParser empty links for items > > > Key: NUTCH-583 > URL: https://issues.apache.org/jira/browse/NUTCH-583 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar > > FeedParser in feed plugin just discards the item if it does not have > element. However Rss 2.0 does not necessitate the element for each > . > Moreover sometimes the link is given in the element which is a > globally unique identifier for the item. I think we can search the url for an > item first, then if it is still not found, we can use the feed's url, but > with merging all the parse texts into one Parse object. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-628: Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Host database to keep track of host-level information > - > > Key: NUTCH-628 > URL: https://issues.apache.org/jira/browse/NUTCH-628 > Project: Nutch > Issue Type: New Feature > Components: fetcher, generator >Reporter: Otis Gospodnetic > Attachments: domain_statistics_v2.patch, > NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch > > > Nutch would benefit from having a DB with per-host/domain/TLD information. > For instance, Nutch could detect hosts that are timing out, store information > about that in this DB. Segment/fetchlist Generator could then skip such > hosts, so they don't slow down the fetch job. Another good use for such a DB > is keeping track of various host scores, e.g. spam score. > From the recent thread on nutch-u...@lucene: > Otis asked: > > While we are at it, how would one go about implementing this DB, as far as > > its structures go? > Andrzej said: > The easiest I can imagine is to use something like . > This way you could store arbitrary information under arbitrary keys. > I.e. a single database then could keep track of aggregate statistics at > different levels, e.g. TLD, domain, host, ip range, etc. The basic set > of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-650) Hbase Integration
[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-650: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Hbase Integration > - > > Key: NUTCH-650 > URL: https://issues.apache.org/jira/browse/NUTCH-650 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.0.0 >Reporter: Doğacan Güney >Assignee: Doğacan Güney > Attachments: hbase-integration_v1.patch, hbase_v2.patch, > malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, > NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch > > > This issue will track nutch/hbase integration -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-716) Make subcollection index filed multivalued
[ https://issues.apache.org/jira/browse/NUTCH-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-716: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Make subcollection index filed multivalued > -- > > Key: NUTCH-716 > URL: https://issues.apache.org/jira/browse/NUTCH-716 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.0.0 >Reporter: Dmitry Lihachev > Attachments: NUTCH-716_multivalued_subcollection.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-541) Index url field untokenized
[ https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-541: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Index url field untokenized > --- > > Key: NUTCH-541 > URL: https://issues.apache.org/jira/browse/NUTCH-541 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Enis Soztutar > > Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the > untokenized version of the url field in some contexts : > 1. For deleting duplicates by url (at search time). see NUTCH-455 > 2. For restricting the search to a certain url (may be used in the case of > RSS search where each entry in the Rss is added as a distinct document with > (possibly) same url ) >query-url extends FieldQueryFilter so: > Query: url:http://www.apache.org/ > Parsed: url:"http http-www http-www-apache www www-apache apache org" > Translated: +url:"http-http-www http-www-http-www-apache > http-www-apache-www www-www-apache www-apache apache org" > 3. for accessing a document(s) in the search servers in the search servers. > (using query plugin) > I suggest we add url as in index-basic and implement a query-url-untoken > plugin. > doc.add(new Field("url", url.toString(), Field.Store.YES, > Field.Index.TOKENIZED)); > doc.add(new Field("url_untoken", url.toString(), Field.Store.NO, > Field.Index.UN_TOKENIZED)); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-717) Make Nutch Solr integration easier
[ https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-717: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Make Nutch Solr integration easier > -- > > Key: NUTCH-717 > URL: https://issues.apache.org/jira/browse/NUTCH-717 > Project: Nutch > Issue Type: New Feature >Reporter: Sami Siren > > Erik Hatcher proposed we should provide a full solr config dir to be used > with Nutch-Solr. Now we only provide index schema. It would be considerably > easier to setup nutch-solr if we provided the whole conf dir that you could > use with solr like: > java -Dsolr.solr.home= -jar start.jar -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-573) Multiple Domains - Query Search
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-573: Fix Version/s: (was: 1.1) > Multiple Domains - Query Search > --- > > Key: NUTCH-573 > URL: https://issues.apache.org/jira/browse/NUTCH-573 > Project: Nutch > Issue Type: Improvement > Components: searcher >Affects Versions: 0.9.0 > Environment: All >Reporter: Rajasekar Karthik >Assignee: Enis Soztutar > Attachments: multiTermQuery_v1.patch > > > Searching multiple domains can be done on Lucene - nut not that efficiently > on nutch. > Query: > +content:"abc" +(site"www.aaa.com" site:"www.bbb.com") > works on lucene but the same concept does not work on nutch. > In Lucene, it works with > org.apache.lucene.analysis.KeywordAnalyzer > org.apache.lucene.analysis.standard.StandardAnalyzer > but NOT on > org.apache.lucene.analysis.SimpleAnalyzer > Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a > workaround to make this work? Is there an option to change what analyzer > nutch is using? > Just FYI, another solution (inefficient I believe) which seems to be working > on nutch > -site:"ccc.com" -site:"ddd.com" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-573) Multiple Domains - Query Search
[ https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-573: - pushing this out per http://bit.ly/c7tBv9 > Multiple Domains - Query Search > --- > > Key: NUTCH-573 > URL: https://issues.apache.org/jira/browse/NUTCH-573 > Project: Nutch > Issue Type: Improvement > Components: searcher >Affects Versions: 0.9.0 > Environment: All >Reporter: Rajasekar Karthik >Assignee: Enis Soztutar > Attachments: multiTermQuery_v1.patch > > > Searching multiple domains can be done on Lucene - nut not that efficiently > on nutch. > Query: > +content:"abc" +(site"www.aaa.com" site:"www.bbb.com") > works on lucene but the same concept does not work on nutch. > In Lucene, it works with > org.apache.lucene.analysis.KeywordAnalyzer > org.apache.lucene.analysis.standard.StandardAnalyzer > but NOT on > org.apache.lucene.analysis.SimpleAnalyzer > Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a > workaround to make this work? Is there an option to change what analyzer > nutch is using? > Just FYI, another solution (inefficient I believe) which seems to be working > on nutch > -site:"ccc.com" -site:"ddd.com" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist
[ https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-729: Due Date: 26/Mar/09 (was: 26/Mar/09) Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > NPE in FieldIndexer when BasicFields url doesn't exist > -- > > Key: NUTCH-729 > URL: https://issues.apache.org/jira/browse/NUTCH-729 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 0.9.0, 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes > Attachments: NUTCH-729-1-20090235.patch > > > There is a NullPointerException during a logging call in FieldIndexer when > there isn't a url for a document. Documents shouldn't be without urls but > since the FieldIndexer doesn't validate fields it is possible for it to > occur. Most often this happens when BasicFields is run with the wrong > segments directory and doesn't complain. It could also occur if using the > FieldIndexer to index things other than basic fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-460) RDF parser plugin
[ https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-460: Patch Info: [Patch Available] - pushing this out per http://bit.ly/c7tBv9 > RDF parser plugin > - > > Key: NUTCH-460 > URL: https://issues.apache.org/jira/browse/NUTCH-460 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Ricardo J. Méndez > Attachments: rubyspider-rdf.zip > > > I've written a couple plugins that I'd like to contribute. > RDFLinkParseFilter looks for links on the pages that point towards RDF > information, and tags the pages with metadata about the type of links they > hold. RDFLinkIndexingFilter indexes said metadata. RDFParser parses RDF > information from several possible formats using Jena, and extracts the links > that the file points to as Outlinks so that they can be fetched as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-460) RDF parser plugin
[ https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-460: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > RDF parser plugin > - > > Key: NUTCH-460 > URL: https://issues.apache.org/jira/browse/NUTCH-460 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Ricardo J. Méndez > Attachments: rubyspider-rdf.zip > > > I've written a couple plugins that I'd like to contribute. > RDFLinkParseFilter looks for links on the pages that point towards RDF > information, and tags the pages with metadata about the type of links they > hold. RDFLinkIndexingFilter indexes said metadata. RDFParser parses RDF > information from several possible formats using Jena, and extracts the links > that the file points to as Outlinks so that they can be fetched as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-774) Retry interval in crawl date is set to 0
[ https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-774: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Retry interval in crawl date is set to 0 > > > Key: NUTCH-774 > URL: https://issues.apache.org/jira/browse/NUTCH-774 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.0.0 >Reporter: Reinhard Schwab >Assignee: Andrzej Bialecki > Attachments: NUTCH-774.patch, NUTCH-774_2.patch > > > When i fetch and parse a feed with the feed plugin, > http://www.wachauclimbing.net/home/impressum-disclaimer/feed/ > another crawl date is generated > http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ > after fetching a second round > the dump in the crawl db still shows a retry interval with value 0. > http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ > Version: 7 > Status: 2 (db_fetched) > Fetch time: Wed Dec 02 12:48:22 CET 2009 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 0 seconds (0 days) > Score: 1.084 > Signature: db9ab2193924cd2d0b53113a500ca604 > Metadata: _pst_: success(1), lastModified=0 > a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in > the > method > setFetchSchedule -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-677) Segment merge filering based on segment content
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-677: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Segment merge filering based on segment content > --- > > Key: NUTCH-677 > URL: https://issues.apache.org/jira/browse/NUTCH-677 > Project: Nutch > Issue Type: Improvement >Affects Versions: 0.9.0 >Reporter: Marcin Okraszewski > Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, > SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, > SegmentMergeFilters.java > > > I needed a segment filtering based on meta data detected during parse phase. > Unfortunately current URL based filtering does not allow for this. So I have > created a new SegmentMergeFilter extension which receives segment entry which > is being merged and decides if it should be included or not. Even though I > needed only ParseData for my purpose I have done it a bit more general > purpose, so the filter receives all merged data. > The attached patch is for version 0.9 which I use. Unfortunately I didn't > have time to check how it fits to trunk version. Sorry :( -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-479) Support for OR queries
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-479: Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Support for OR queries > -- > > Key: NUTCH-479 > URL: https://issues.apache.org/jira/browse/NUTCH-479 > Project: Nutch > Issue Type: Improvement > Components: searcher >Affects Versions: 1.0.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: nutch_0.9_OR.patch, or.patch, or.patch > > > There have been many requests from users to extend Nutch query syntax to add > support for OR queries, in addition to the implicit AND and NOT queries > supported now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-747) inject&Index metadatas and inherit these metadatas to all matching suburls
[ https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-747: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > inject&Index metadatas and inherit these metadatas to all matching suburls > -- > > Key: NUTCH-747 > URL: https://issues.apache.org/jira/browse/NUTCH-747 > Project: Nutch > Issue Type: Improvement > Components: indexer, injector >Reporter: Marko Bauhardt > Attachments: index-metadata.patch, metadata.patch > > > Hi. > the following two patches supports > + inject metadatas to url's into a metadatadb > url.com : : > ... > ... > + updates the parse_data metadata from a shard and write the metadatas to all > fetched urls that starts with an url from the metadatadb > + this patch support's metadata to all matching suburls inheritance > the second patch implements a index-metadata plugin. > + this plugin extract all metadats from the parse_data of a shard and index > it. which metadats you can configure in the plugin.properties. > + to index for example the lang you have to configure the plugin.properties: > lang=STORE,UNTOKENIZED > + that means that the index plugin exract metadata values with key "lang". if > exists, all values are indexed stored and untokenized > Example > create start url's in "/tmp/urls/start/urls.txt" > http://lucene.apache.org/nutch/apidocs-1.0/index.html > http://lucene.apache.org/nutch/apidocs-0.9/index.html > create metadata url's in "/tmp/urls/metadata/urls.txt" > http://lucene.apache.org/nutch/apidocs-1.0/ version:1.0 > http://lucene.apache.org/nutch/apidocs-0.9/ version:0.9 > Inject Urls > bin/nutch inject crawldb /tmp/urls/start/ > bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb > /tmp/urls/metadata/ > Fetch & Parse & Update > bin/nutch generate crawldb segments > bin/nutch fetch segments/20090806105717/ > bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb > segments/20090806105717 > bin/nutch updatedb crawldb/ segments/20090806105717/ > Fetch & Parse & Update Again > ... > Index > bin/nutch invertlinks linkdb -dir segments/ > bin/nutch index index crawldb/ linkdb/ segments/20090806105717 > segments/20090806110127 > Check your Index > All urls starting with "http://lucene.apache.org/nutch/apidocs-1.0/ " are > indexed with "version:1.0". > All urls starting with "http://lucene.apache.org/nutch/apidocs-0.9/ " are > indexed with "version:0.9". > This issue is some related to http://issues.apache.org/jira/browse/NUTCH-655 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-455) dedup on tokenized fields is faulty
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-455: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > dedup on tokenized fields is faulty > --- > > Key: NUTCH-455 > URL: https://issues.apache.org/jira/browse/NUTCH-455 > Project: Nutch > Issue Type: Bug > Components: searcher >Affects Versions: 0.9.0 >Reporter: Enis Soztutar > Attachments: IndexSearcherCacheWarm.patch > > > (From LUCENE-252) > nutch uses several index servers, and the search results from these servers > are merged using a dedup field for for deleting duplicates. The values from > this field is cached by Lucene's FieldCachImpl. The default is the site > field, which is indexed and tokenized. However for a Tokenized Field (for > example "url" in nutch), FieldCacheImpl returns an array of Terms rather that > array of field values, so dedup'ing becomes faulty. Current FieldCache > implementation does not respect tokenized fields , and as described above > caches only terms. > So in the situation that we are searching using "url" as the dedup field, > when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of > the url (such as "www" or "com") rather that the whole url. This prevents > using tokenized fields in the dedup field. > I have written a patch for lucene and attached it in > http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the > aforementioned issue about tokenized field caching. However building such a > cache for about 1.5M documents takes 20+ secs. The code in > IndexSearcher.translateHits() starts with > if (dedupField != null) > dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); > and for the first call of search in IndexSearcher, cache is built. > Long story short, i have written a patch against IndexSearcher, which in > constructor warms-up the caches of wanted fields(configurable). I think we > should vote for LUCENE-252, and then commit the above patch with the last > version of lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-540) some problem about the Nutch cache
[ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-540: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > some problem about the Nutch cache > -- > > Key: NUTCH-540 > URL: https://issues.apache.org/jira/browse/NUTCH-540 > Project: Nutch > Issue Type: Bug > Components: searcher >Affects Versions: 0.9.0 > Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9 >Reporter: crossany > Attachments: 1.gif, 1186733525.jpg > > > I'am a chinese. > I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on > linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website > it a chinese website the web charset it's also UTF-8. when Use the nutch on > tomcat for search chinese word , I find the search result' Title and > description was right to display. but when I click the cache, the cache web > was display a error charset code, I see the cache > web' charset also utf-8. I find a website use Nutch > http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also > error. > I use Luke to see the segments It's can display chinese word, I think maybe > it's a Bug. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-578: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > URL fetched with 403 is generated over and over again > - > > Key: NUTCH-578 > URL: https://issues.apache.org/jira/browse/NUTCH-578 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.0.0 > Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I > have checked out the most recent version of the trunk as of Nov 20, 2007 >Reporter: Nathaniel Powell >Assignee: Dennis Kubes > Attachments: crawl-urlfilter.txt, NUTCH-578.patch, > NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, nutch-site.xml, > regex-normalize.xml, urls.txt > > > I have not changed the following parameter in the nutch-default.xml: > > db.fetch.retry.max > 3 > The maximum number of times a url that has encountered > recoverable errors is generated for fetch. > > However, there is a URL which is on the site that I'm crawling, > www.teachertube.com, which keeps being generated over and over again for > almost every segment (many more times than 3): > fetch of http://www.teachertube.com/images/ failed with: Http code=403, > url=http://www.teachertube.com/images/ > This is a bug, right? > Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-794) Language Identification must use check the parse metadata for language values
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-794. - Resolution: Fixed @julien -- I think this issue has been fixed in Tika right? If not, feel free to reopen, or better yet, re-file the issue against a post 1.1 Nutch release. Thanks! > Language Identification must use check the parse metadata for language values > -- > > Key: NUTCH-794 > URL: https://issues.apache.org/jira/browse/NUTCH-794 > Project: Nutch > Issue Type: Bug > Components: parser >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 1.1 > > Attachments: NUTCH-794.patch > > > The following HTML document : > document 1 titlejotain > suomeksi > is rendered as the following xhtml by Tika : > xmlns="http://www.w3.org/1999/xhtml";>document 1 > titlejotain suomeksi > with the lang attribute getting lost. The lang is not stored in the metadata > either. > I will open an issue on Tika and modify TestHTMLLanguageParser so that the > tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-609) Allow Plugins to be Loaded from Jar File(s)
[ https://issues.apache.org/jira/browse/NUTCH-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-609: Due Date: 13/Feb/08 (was: 13/Feb/08) Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Allow Plugins to be Loaded from Jar File(s) > --- > > Key: NUTCH-609 > URL: https://issues.apache.org/jira/browse/NUTCH-609 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.0.0 > Environment: All >Reporter: Dennis Kubes >Assignee: Dennis Kubes >Priority: Minor > Attachments: NUTCH-609-1-20080212.patch > > > Currently plugins cannot be loaded from a jar file. Plugins must be unzipped > in one or more directories specified by the plugin.folders config. I have > been thinking about an extension to PluginRepository or PluginManifestParser > (or both) that would allow plugins to packaged into multiple independent jar > files and placed on the classpath. The system would search the classpath for > resources with the correct folder name and would load any plugins in those > jars. > This functionality would be very useful in making the nutch core more > flexible in terms of packaging. It would also help with web applications > where we don't want to have a plugins directory included in the webapp. > Thoughts so far are unzipping those plugin jars into a common temp directory > before loading. Another option is using something like commons vfs to > interact with the jar files. VFS essential uses a disk based temporary cache > for jar files, so it is pretty much the same solution. What are everyone > else's thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-251) Administration GUI
[ https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-251: Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 (comment from me: would be nice to get this into 1.2) > Administration GUI > -- > > Key: NUTCH-251 > URL: https://issues.apache.org/jira/browse/NUTCH-251 > Project: Nutch > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Stefan Groschupf >Priority: Minor > Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, > nutch_gui_plugins_v1.zip, nutch_gui_v1.patch > > > Having a web based administration interface would help to make nutch > administration and management much more user friendly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-477: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Extend URLFilters to support different filtering chains > --- > > Key: NUTCH-477 > URL: https://issues.apache.org/jira/browse/NUTCH-477 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.1 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Attachments: urlfilters.patch > > > I propose to make the following changes to URLFilters: > * extend URLFilters so that they support different filtering rules depending > on the context where they are executed. This functionality mirrors the one > that URLNormalizers already support. > * change their return value to an int code, in order to support early > termination of long filtering chains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-564) External parser supports encoding attribute
[ https://issues.apache.org/jira/browse/NUTCH-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-564: Patch Info: [Patch Available] Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > External parser supports encoding attribute > --- > > Key: NUTCH-564 > URL: https://issues.apache.org/jira/browse/NUTCH-564 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 0.9.0 > Environment: All >Reporter: Antony Bowesman >Priority: Minor > Attachments: ExtParser_0.9.0.patch, ExtParser_1.0.0.patch > > > When an external component generates text, which is returned to the external > parser, it always converts the text using the default character set. > (os.toString()). For example, the returned text may be utf-8, but will not > be converted to a String correctly. > I added the attribute to the XML in plugin.xml > and this is then used to convert the text. > I have tested my original fix on my local 0.9 and include a patch, but have > also made an untested patch for trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-750) HtmlParser plugin - page title extraction
[ https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-750: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > HtmlParser plugin - page title extraction > - > > Key: NUTCH-750 > URL: https://issues.apache.org/jira/browse/NUTCH-750 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.0.0 >Reporter: Alexey Torochkov >Priority: Minor > Attachments: SkipBody.patch > > > A little improvement to trying to extract tag in body if it doesn't > exist in head. > In current version DOMContentUtils just skip all after in getTitle() > method. > Attached patch allows to change this behavior (for default it doesn't change > anything) and can cope with webmasters mistakes -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-664) Possibility to update already stored documents.
[ https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-664: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Possibility to update already stored documents. > --- > > Key: NUTCH-664 > URL: https://issues.apache.org/jira/browse/NUTCH-664 > Project: Nutch > Issue Type: Wish >Reporter: Sergey Khilkov >Priority: Minor > > We have huge index of stored documents. It is high cost procedure to fetch > page, merge indexes any time we update some information about page. The > information can be changed 1-3 times per day. At this moment we have to store > changed info in database, but in this case we have lots of problems with > sorting, search restricions and so on. Lucene itself allows delete single > document and add new one into existing index. But there is a problem with > hadoop... As I understand hadoop filesystem has no possibility to write in > random positions. But it will be great feature if nutch will be able to > update created index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-673: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Upgrade the Carrot2 plug-in to release 3.0 > -- > > Key: NUTCH-673 > URL: https://issues.apache.org/jira/browse/NUTCH-673 > Project: Nutch > Issue Type: Improvement > Components: web gui >Affects Versions: 0.9.0 > Environment: All Nutch deployments. >Reporter: Sean Dean >Priority: Minor > > Release 3.0 of the Carrot2 plug-in was released recently. > We currently have version 2.1 in the source tree and upgrading it to the > latest version before 1.0-release might make sence. > Details on the release can be found here: > http://project.carrot2.org/release-3.0-notes.html > One major change in requirements is for JDK 1.5 to be used, but this is also > now required for Hadoop 0.19 so this wouldnt be the only reason for the > switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-310) Review Log Levels
[ https://issues.apache.org/jira/browse/NUTCH-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-310: Fix Version/s: (was: 1.1) Assignee: Chris A. Mattmann (was: Jerome Charron) - pushing this out per http://bit.ly/c7tBv9 (and assign to me, I think this can be closed but will wait until after 1.1 to revisit) > Review Log Levels > - > > Key: NUTCH-310 > URL: https://issues.apache.org/jira/browse/NUTCH-310 > Project: Nutch > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Jerome Charron >Assignee: Chris A. Mattmann >Priority: Minor > > Review of logs content and logs levels (see Commons Logging Best Parctices : > http://jakarta.apache.org/commons/logging/guide.html#Message_Priorities_Levels) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-577) Use explicit tika-config.xml file to enable mime magic detection to be turned on and off
[ https://issues.apache.org/jira/browse/NUTCH-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-577: Due Date: 30/Nov/07 (was: 30/Nov/07) Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Use explicit tika-config.xml file to enable mime magic detection to be turned > on and off > > > Key: NUTCH-577 > URL: https://issues.apache.org/jira/browse/NUTCH-577 > Project: Nutch > Issue Type: Improvement > Components: mime_type_detector >Affects Versions: 1.0.0 > Environment: Mac Book Pro Intel Core Duo 2.0 Ghz, 2. 0 GB RAM, Mac OS > X 10.4, although improvement is indep. of env. >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Minor > > Currently, there is a configuration file for Tika (which the trunk in Nutch > uses for its mime type detection) called "tika-config.xml" left unexposed (a > default one lives in the tika-0.1-dev.jar file). Tika's mime system has two > config files it relies on: tika-mimetypes.xml (which Nutch has its own > version of, that overrides the version that comes with the tika jar file), > and tika-config.xml (to turn on or off magic char detection). We should > probably have a nutch version of tika-config.xml, so that Nutch users can > employ magic char mime detection. I'll get going on this in the next day or > so. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-763) Separate configuration files from resources to be included in the job file
[ https://issues.apache.org/jira/browse/NUTCH-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-763: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Separate configuration files from resources to be included in the job file > -- > > Key: NUTCH-763 > URL: https://issues.apache.org/jira/browse/NUTCH-763 > Project: Nutch > Issue Type: Wish >Reporter: Julien Nioche >Priority: Minor > > One of the things I found confusing when I was learning Nutch was the fact > that the conf/ directory contains at the same time : > - configuration files for Hadoop / Nutch which are put in the jar files but > not used there > - resource files (e.g. filtering rules) which MUST be up to date in the job > file > I would separate the conf/ directory from say a resources/ directory which > would contain the rule files and other things to put in the job file. Unless > I am mistaken none of the configuration files need to be in the job file. I > know it is a very minor point, but that would probably simplify things and > make it easier for beginners to understand what has to be modified where. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-309) Uses commons logging Code Guards
[ https://issues.apache.org/jira/browse/NUTCH-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-309: Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Uses commons logging Code Guards > > > Key: NUTCH-309 > URL: https://issues.apache.org/jira/browse/NUTCH-309 > Project: Nutch > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Jerome Charron >Assignee: Chris A. Mattmann >Priority: Minor > > "Code guards are typically used to guard code that only needs to execute in > support of logging, that otherwise introduces undesirable runtime overhead in > the general case (logging disabled). Examples are multiple parameters, or > expressions (e.g. string + " more") for parameters. Use the guard methods of > the form log.is() to verify that logging should be performed, > before incurring the overhead of the logging method call. Yes, the logging > methods will perform the same check, but only after resolving parameters." > (description extracted from > http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-249) black- white list url filtering
[ https://issues.apache.org/jira/browse/NUTCH-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-249: Fix Version/s: (was: 1.1) - push out per http://bit.ly/c7tBv9 > black- white list url filtering > --- > > Key: NUTCH-249 > URL: https://issues.apache.org/jira/browse/NUTCH-249 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.8 >Reporter: Stefan Groschupf >Assignee: Dennis Kubes >Priority: Trivial > Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch, bw.patch > > > Existing url filter mechanisms need to process each url against each filter > pattern. For very large filter sets this may be does not scale very well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-801) Remove RTF and MP3 parse plugins
[ https://issues.apache.org/jira/browse/NUTCH-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843576#action_12843576 ] Chris A. Mattmann commented on NUTCH-801: - +1 on this from me, Julien. Sounds good. > Remove RTF and MP3 parse plugins > > > Key: NUTCH-801 > URL: https://issues.apache.org/jira/browse/NUTCH-801 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.0.0 >Reporter: Julien Nioche > Fix For: 1.1 > > > *Parse-rtf* and *parse-mp3* are not built by default due to licensing > issues. Since we now have *parse-tika* to handle these formats I would be in > favour of removing these 2 plugins altogether to keep things nice and simple. > The other plugins will probably be phased out only after the release of 1.1 > when parse-tika will have been tested a lot more. > Any reasons not to? > Julien -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-790) Some external javadoc links are broken
[ https://issues.apache.org/jira/browse/NUTCH-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833591#action_12833591 ] Chris A. Mattmann commented on NUTCH-790: - +1 to commit this. Thanks, Sami! > Some external javadoc links are broken > -- > > Key: NUTCH-790 > URL: https://issues.apache.org/jira/browse/NUTCH-790 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Sami Siren >Assignee: Sami Siren >Priority: Trivial > Attachments: NUTCH-790.patch > > > Nutch javadoc links for lucene and hadoop are broken. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832866#action_12832866 ] Chris A. Mattmann commented on NUTCH-766: - - forgot to add in dep libs, added in r909269. Thanks! > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-789: Attachment: NutchTikaConfig.java TikaParser.java - updates contributed by Sami. I'll generate a diff and then re-attach. > Improvements to Tika parser > --- > > Key: NUTCH-789 > URL: https://issues.apache.org/jira/browse/NUTCH-789 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Environment: reported by Sami, in NUTCH-766 >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.1 > > Attachments: NutchTikaConfig.java, TikaParser.java > > > As reported by Sami in NUTCH-766, Sami has a few improvements he made to the > Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-789) Improvements to Tika parser
Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: fetcher Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-766. - Resolution: Fixed - committed in r909268. Added in the nutch-default.xml comments near the parse-tika plugin.includes enable block. Sami, I'll create a new issue now to track your proposed updates to the Tika parser. I ran unit tests with the patch i committed, and they all passed. Thanks, Julien! > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832588#action_12832588 ] Chris A. Mattmann commented on NUTCH-766: - @Julien: Sigh, no I didn't! :( That's probably why! Thanks for the help. I'll try it later today. If that passes, my +1 to commit. @Sami, regarding your updates, would you be OK with me creating another issue to track them, attaching your diffs as patches against this issue, once committed to the trunk? That way we'll make sure they get into 1.1, but we won't block this issue anymore from getting in. Let me know what you think, thanks. Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832565#action_12832565 ] Chris A. Mattmann commented on NUTCH-766: - Hi Julien: {quote} @Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto the directory parse-tika and ran the test just as you did but could not reproduce the problem. Could there be a difference between your version and the trunk? {quote} I tried this process last night: 1. SVN up to r908832 2. download patch v3 3. download sample.tgz 4. apply patch v3 to r908832 5. untar sample.tgz into src/plugin/parse-tika, creating a sample folder in that dir 6. ant clean compile-core test Any idea why I'm seeing the error? Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832398#action_12832398 ] Chris A. Mattmann commented on NUTCH-766: - I'm going to hold off on committing this tonight. I've updated the docs per Andrzej, and I've also updated CHANGES.txt, but when running: {code} ant clean compile-core test {code} I'm seeing these messages during plugin testing for parse-tika: {noformat} 2010-02-10 22:39:16,593 ERROR tika.TikaParser (TikaParser.java:getParse(63)) - Can't retrieve Tika parser for mime-type application/pdf - --- Testcase: testIt took 2.684 sec FAILED null junit.framework.AssertionFailedError at org.apache.nutch.tika.TestPdfParser.testIt(TestPdfParser.java:79) {noformat} It seems that the TikaConfig is not being found? I was looking at TikaParser#setConf and it seems that a default config is being created for Tika, but maybe not being loaded correctly? I need to look into this more... > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832255#action_12832255 ] Chris A. Mattmann commented on NUTCH-766: - {quote} +1 to commit this... {quote} Awesome, Andrzej. Will do so tonight, PST, if I don't hear any objections between now and then... Thanks! Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804546#action_12804546 ] Chris A. Mattmann commented on NUTCH-766: - Hi Sami: {quote} Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1? {quote} Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective. HTH, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803709#action_12803709 ] Chris A. Mattmann edited comment on NUTCH-766 at 1/22/10 2:38 PM: -- {quote} Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. {quote} +1, I'm going to agree on this one here Julien. Other communities ;) have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replaced by the Tika functionality) and then removing them in 1.2 or 1.3. I got bogged down with my paid job, but I found some Apache time recently so this is tops on my list to tackle. Cheers, Chris was (Author: chrismattmann): {quote} Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. {quote} +1, I'm going to agree on this one here Julien. Other communities ;) have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replace by the Tika functionality) and then removing them in 1.2 or 1.3. I got bogged down with my paid job, but I found some Apache time recently so this is tops on my list to tackle. Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required i
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803709#action_12803709 ] Chris A. Mattmann commented on NUTCH-766: - {quote} Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. {quote} +1, I'm going to agree on this one here Julien. Other communities ;) have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replace by the Tika functionality) and then removing them in 1.2 or 1.3. I got bogged down with my paid job, but I found some Apache time recently so this is tops on my list to tackle. Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798718#action_12798718 ] Chris A. Mattmann commented on NUTCH-766: - Hi Julien: I have had a look and was trying to test it out but got sidetracked. Give me this week to try and put together a final reviewable/commitable patch, otherwise, it's all yours. Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-766: Fix Version/s: 1.1 > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-777) Upgrading to jetty6 broke unit tests
[ https://issues.apache.org/jira/browse/NUTCH-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-777. - Resolution: Fixed - fixed in r892350 > Upgrading to jetty6 broke unit tests > > > Key: NUTCH-777 > URL: https://issues.apache.org/jira/browse/NUTCH-777 > Project: Nutch > Issue Type: Bug > Components: build > Environment: My MacBook pro, JDK 1.6.0. >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.1 > > > It seems that somewhere down the line, there was an upgrade to jetty6, which > broke unit tests, specifically TestFetcher and CrawlDBTestUtil. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-777) Upgrading to jetty6 broke unit tests
[ https://issues.apache.org/jira/browse/NUTCH-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792579#action_12792579 ] Chris A. Mattmann commented on NUTCH-777: - Okay with the changes I'm about to commit, we have: {noformat} copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-regex [junit] Running org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.286 sec [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 6.802 sec test: BUILD SUCCESSFUL Total time: 5 minutes 52 seconds [chipotle:~/src/nutch] mattmann% {noformat} Yay! > Upgrading to jetty6 broke unit tests > > > Key: NUTCH-777 > URL: https://issues.apache.org/jira/browse/NUTCH-777 > Project: Nutch > Issue Type: Bug > Components: build > Environment: My MacBook pro, JDK 1.6.0. >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.1 > > > It seems that somewhere down the line, there was an upgrade to jetty6, which > broke unit tests, specifically TestFetcher and CrawlDBTestUtil. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-777) Upgrading to jetty6 broke unit tests
[ https://issues.apache.org/jira/browse/NUTCH-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792566#action_12792566 ] Chris A. Mattmann commented on NUTCH-777: - I found this page, which shows the mapping from Jetty5 (which the Nutch test code used to depend on), to Jetty6: http://docs.codehaus.org/display/JETTY/Porting+to+jetty6 > Upgrading to jetty6 broke unit tests > > > Key: NUTCH-777 > URL: https://issues.apache.org/jira/browse/NUTCH-777 > Project: Nutch > Issue Type: Bug > Components: build > Environment: My MacBook pro, JDK 1.6.0. >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.1 > > > It seems that somewhere down the line, there was an upgrade to jetty6, which > broke unit tests, specifically TestFetcher and CrawlDBTestUtil. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-777) Upgrading to jetty6 broke unit tests
[ https://issues.apache.org/jira/browse/NUTCH-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792565#action_12792565 ] Chris A. Mattmann commented on NUTCH-777: - Here is what I was getting with the latest Nutch trunk: {noformat} compile: job: [jar] Building jar: /Users/mattmann/src/nutch/build/nutch-1.0.job compile-core-test: [javac] Compiling 43 source files to /Users/mattmann/src/nutch/build/test/classes [javac] /Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:33: package org.mortbay.http does not exist [javac] import org.mortbay.http.HttpContext; [javac]^ [javac] /Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:34: package org.mortbay.http does not exist [javac] import org.mortbay.http.SocketListener; [javac]^ [javac] /Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:35: package org.mortbay.http.handler does not exist [javac] import org.mortbay.http.handler.ResourceHandler; [javac]^ [javac] /Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:134: cannot find symbol [javac] symbol : class SocketListener [javac] location: class org.apache.nutch.crawl.CrawlDBTestUtil [javac] SocketListener listener = new SocketListener(); [javac] ^ [javac] /Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:134: cannot find symbol [javac] symbol : class SocketListener [javac] location: class org.apache.nutch.crawl.CrawlDBTestUtil [javac] SocketListener listener = new SocketListener(); [javac] ^ [javac] /Users/mattmann/src/nutch/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java:138: cannot find symbol [javac] symbol : class HttpContext [javac] location: class org.apache.nutch.crawl.CrawlDBTestUtil [javac] HttpContext staticContext = new HttpContext(); [javac] ^ ..snip... [javac] /Users/mattmann/src/nutch/src/test/org/apache/nutch/fetcher/TestFetcher.java:167: cannot find symbol [javac] symbol : method getListeners() [javac] location: class org.mortbay.jetty.Server [javac] urls.add("http://127.0.0.1:"; + server.getListeners()[0].getPort() + "/" + page); [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 9 errors BUILD FAILED /Users/mattmann/src/nutch/build.xml:229: Compile failed; see the compiler error output for details. Total time: 37 seconds {noformat} > Upgrading to jetty6 broke unit tests > > > Key: NUTCH-777 > URL: https://issues.apache.org/jira/browse/NUTCH-777 > Project: Nutch > Issue Type: Bug > Components: build > Environment: My MacBook pro, JDK 1.6.0. >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.1 > > > It seems that somewhere down the line, there was an upgrade to jetty6, which > broke unit tests, specifically TestFetcher and CrawlDBTestUtil. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (NUTCH-777) Upgrading to jetty6 broke unit tests
[ https://issues.apache.org/jira/browse/NUTCH-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-777 started by Chris A. Mattmann. > Upgrading to jetty6 broke unit tests > > > Key: NUTCH-777 > URL: https://issues.apache.org/jira/browse/NUTCH-777 > Project: Nutch > Issue Type: Bug > Components: build > Environment: My MacBook pro, JDK 1.6.0. >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.1 > > > It seems that somewhere down the line, there was an upgrade to jetty6, which > broke unit tests, specifically TestFetcher and CrawlDBTestUtil. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-777) Upgrading to jetty6 broke unit tests
Upgrading to jetty6 broke unit tests Key: NUTCH-777 URL: https://issues.apache.org/jira/browse/NUTCH-777 Project: Nutch Issue Type: Bug Components: build Environment: My MacBook pro, JDK 1.6.0. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.1 It seems that somewhere down the line, there was an upgrade to jetty6, which broke unit tests, specifically TestFetcher and CrawlDBTestUtil. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-766 started by Chris A. Mattmann. > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-766: --- Assignee: Chris A. Mattmann > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-185) XMLParser is configurable xml parser plugin.
[ https://issues.apache.org/jira/browse/NUTCH-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-185. - Resolution: Won't Fix Fix Version/s: 1.1 See comments related to NUTCH-767 in this issue's comments section. Once we address NUTCH-767, we get this functionality for free... > XMLParser is configurable xml parser plugin. > > > Key: NUTCH-185 > URL: https://issues.apache.org/jira/browse/NUTCH-185 > Project: Nutch > Issue Type: New Feature > Components: fetcher, indexer >Affects Versions: 0.7.2, 0.8, 0.8.1 > Environment: OS Independent >Reporter: Rida Benjelloun >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: parse-xml.patch, parse-xml.zip, parse-xml.zip > > > Xml parser is configurable plugin. It use XPath and namespaces to do the > mapping between the XML elements and Lucene fields. > Informations : > 1- Copy "xmlparser-conf.xml" to the nutch/conf dir > 2- To index your custom XML file, you have to modify the > "xmlparser-conf.xml". > This parser uses namespaces and XPATH to parse XML content > The config file do the mapping between the XML noeds (using XPATH) and lucene > field. > Example : > 3- The xmlIndexerProperties encapsulate a set of fields associated to a > namespace. > If the namespace is found in the xml document, the fields represented by the > namespace will be indexed. > Example : > http://purl.org/dc/elements/1.1/";> > > > > 4- It is possible to define a default namespace that will be applied when the > parser > didn't find any namespace in the document or when the namespace found in the > xml document doesn't match with the namespace defined in the > xmlIndexerProperties. > Example : > > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-767) Update version of Tika for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779476#action_12779476 ] Chris A. Mattmann commented on NUTCH-767: - Hi Julien, Thanks for pushing this forward. I'll take a look at this patch... Cheers, Chris > Update version of Tika for the MimeType detection > - > > Key: NUTCH-767 > URL: https://issues.apache.org/jira/browse/NUTCH-767 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Attachments: NUTCH-767.patch > > > The latest version of TIka requires a few changes to the MimeType > implementation. Tika is now split in several jars, we need to place the > tika-core.jar in the main nutch lib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-767) Update version of Tika for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-767: --- Assignee: Chris A. Mattmann > Update version of Tika for the MimeType detection > - > > Key: NUTCH-767 > URL: https://issues.apache.org/jira/browse/NUTCH-767 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Attachments: NUTCH-767.patch > > > The latest version of TIka requires a few changes to the MimeType > implementation. Tika is now split in several jars, we need to place the > tika-core.jar in the main nutch lib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-714) Need a SFTP and SCP Protocol Handler
[ https://issues.apache.org/jira/browse/NUTCH-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-714: --- Assignee: Chris A. Mattmann > Need a SFTP and SCP Protocol Handler > > > Key: NUTCH-714 > URL: https://issues.apache.org/jira/browse/NUTCH-714 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Sanjoy Ghosh >Assignee: Chris A. Mattmann > Fix For: 0.8.2 > > > An SFTP and SCP Protocol handler is needed to fetch intranet content on an > SFTP or SCP server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-714) Need a SFTP and SCP Protocol Handler
[ https://issues.apache.org/jira/browse/NUTCH-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680348#action_12680348 ] Chris A. Mattmann commented on NUTCH-714: - Hi Sanjoy, When you get a patch, let me know and I will work to integrate it. For reference, you were intending this as an upgrade for 0.8.2? I think we should probably do this as a post 1.0 upgrade (maybe 1.1)? Cheers,. Chris > Need a SFTP and SCP Protocol Handler > > > Key: NUTCH-714 > URL: https://issues.apache.org/jira/browse/NUTCH-714 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Sanjoy Ghosh >Assignee: Chris A. Mattmann > Fix For: 0.8.2 > > > An SFTP and SCP Protocol handler is needed to fetch intranet content on an > SFTP or SCP server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-631) MoreIndexingFilter fails with NoSuchElementException
[ https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674219#action_12674219 ] Chris A. Mattmann commented on NUTCH-631: - Sami, +1. Sorry I didn't have time to get to this. Thanks for whipping it up. > MoreIndexingFilter fails with NoSuchElementException > > > Key: NUTCH-631 > URL: https://issues.apache.org/jira/browse/NUTCH-631 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 > Environment: Verified on CentOS and OSX >Reporter: Stefan Will >Assignee: Chris A. Mattmann >Priority: Blocker > Fix For: 1.0.0 > > Attachments: NUTCH-631.patch > > > I did a simple crawl and started the indexer with the index-more plugin > activated. The index job fails with the following stack trace in the task log: > java.util.NoSuchElementException > at java.util.TreeMap.key(TreeMap.java:433) > at java.util.TreeMap.firstKey(TreeMap.java:287) > at java.util.TreeSet.first(TreeSet.java:407) > at > java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164) > I traced this down to the part in MoreIndexingFilter where the mime type is > split into primary type and subtype for indexing: > contentType = mimeType.getName(); > String primaryType = mimeType.getSuperType().getName(); > String subType = mimeType.getSubTypes().first().getName(); > Apparently Tika does not have a subtype for text/html. Furthermore, the > supertype for text/html is set as application/octet-stream, which I doubt is > what we want indexed. Don't we want primaryType to be "text" and subType to > be "html" ? > So I changed the code to: > contentType = mimeType.getName(); > String[] split = contentType.split("/"); > String primaryType = split[0]; > String subType = (split.length>1)?split[1]:null; > > This does what I think it should do, but perhaps I'm missing something ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (NUTCH-631) MoreIndexingFilter fails with NoSuchElementException
[ https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-631 started by Chris A. Mattmann. > MoreIndexingFilter fails with NoSuchElementException > > > Key: NUTCH-631 > URL: https://issues.apache.org/jira/browse/NUTCH-631 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 > Environment: Verified on CentOS and OSX >Reporter: Stefan Will >Assignee: Chris A. Mattmann >Priority: Blocker > Fix For: 1.0.0 > > > I did a simple crawl and started the indexer with the index-more plugin > activated. The index job fails with the following stack trace in the task log: > java.util.NoSuchElementException > at java.util.TreeMap.key(TreeMap.java:433) > at java.util.TreeMap.firstKey(TreeMap.java:287) > at java.util.TreeSet.first(TreeSet.java:407) > at > java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164) > I traced this down to the part in MoreIndexingFilter where the mime type is > split into primary type and subtype for indexing: > contentType = mimeType.getName(); > String primaryType = mimeType.getSuperType().getName(); > String subType = mimeType.getSubTypes().first().getName(); > Apparently Tika does not have a subtype for text/html. Furthermore, the > supertype for text/html is set as application/octet-stream, which I doubt is > what we want indexed. Don't we want primaryType to be "text" and subType to > be "html" ? > So I changed the code to: > contentType = mimeType.getName(); > String[] split = contentType.split("/"); > String primaryType = split[0]; > String subType = (split.length>1)?split[1]:null; > > This does what I think it should do, but perhaps I'm missing something ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-631) MoreIndexingFilter fails with NoSuchElementException
[ https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-631: --- Assignee: Chris A. Mattmann > MoreIndexingFilter fails with NoSuchElementException > > > Key: NUTCH-631 > URL: https://issues.apache.org/jira/browse/NUTCH-631 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 > Environment: Verified on CentOS and OSX >Reporter: Stefan Will >Assignee: Chris A. Mattmann >Priority: Blocker > Fix For: 1.0.0 > > > I did a simple crawl and started the indexer with the index-more plugin > activated. The index job fails with the following stack trace in the task log: > java.util.NoSuchElementException > at java.util.TreeMap.key(TreeMap.java:433) > at java.util.TreeMap.firstKey(TreeMap.java:287) > at java.util.TreeSet.first(TreeSet.java:407) > at > java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207) > at > org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249) > at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164) > I traced this down to the part in MoreIndexingFilter where the mime type is > split into primary type and subtype for indexing: > contentType = mimeType.getName(); > String primaryType = mimeType.getSuperType().getName(); > String subType = mimeType.getSubTypes().first().getName(); > Apparently Tika does not have a subtype for text/html. Furthermore, the > supertype for text/html is set as application/octet-stream, which I doubt is > what we want indexed. Don't we want primaryType to be "text" and subType to > be "html" ? > So I changed the code to: > contentType = mimeType.getName(); > String[] split = contentType.split("/"); > String primaryType = split[0]; > String subType = (split.length>1)?split[1]:null; > > This does what I think it should do, but perhaps I'm missing something ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-621: Affects Version/s: 0.7 0.7.1 0.7.2 0.8 0.8.1 0.9.0 Fix Version/s: 1.0.0 > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Affects Versions: 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0 >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > Fix For: 1.0.0 > > Attachments: NUTCH-621.Mattmann.091008.step3.txt, > NUTCH-621.step1.Mattmann.090408.patch.txt, > NUTCH-621.step1.Mattmann.091008.patch.txt > > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-621. - Resolution: Fixed - resolved in r699866 > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Affects Versions: 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.9.0 >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > Attachments: NUTCH-621.Mattmann.091008.step3.txt, > NUTCH-621.step1.Mattmann.090408.patch.txt, > NUTCH-621.step1.Mattmann.091008.patch.txt > > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635241#action_12635241 ] Chris A. Mattmann commented on NUTCH-621: - Folks, Based on Jukka's comments, I've ahead and updated Nutch's README file and completed step 4/4 of the crypto usage for Nutch: http://svn.apache.org/viewvc?rev=699866&view=rev Nutch is now fully compliant with Apache crypto reqts! Grant, if this is satisfactory, and you are +1, I will go ahead and close this issue. Thanks for everyone's help! Cheers, Chris > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > Attachments: NUTCH-621.Mattmann.091008.step3.txt, > NUTCH-621.step1.Mattmann.090408.patch.txt, > NUTCH-621.step1.Mattmann.091008.patch.txt > > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630445#action_12630445 ] Chris A. Mattmann commented on NUTCH-621: - Grant: Great, thanks. Okay, once you get back the email from the govt (which hopefully we will since perhaps they will CC nutch-dev@ on the reply), I will proceed with step 4: http://www.apache.org/dev/crypto.html#inform And update the appropriate Nutch README file here: http://svn.apache.org/repos/asf/lucene/nutch/trunk/README.txt with the crypto notice and then I think we're done! Cheers, Chris > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > Attachments: NUTCH-621.Mattmann.091008.step3.txt, > NUTCH-621.step1.Mattmann.090408.patch.txt, > NUTCH-621.step1.Mattmann.091008.patch.txt > > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-621: Attachment: NUTCH-621.Mattmann.091008.step3.txt Hey Doug: I've attached a text file containing an email template that we need you to send as the PMC Chair for Lucene, regarding Nutch's crypto status. Could you send ASAP to the TO: addresses in the attached txt file, using the attached text email body, and then let me know when this has been complete? At that point, I'll move onto step 4. Thanks! Cheers, Chris > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > Attachments: NUTCH-621.Mattmann.091008.step3.txt, > NUTCH-621.step1.Mattmann.090408.patch.txt, > NUTCH-621.step1.Mattmann.091008.patch.txt > > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-621: Attachment: NUTCH-621.step1.Mattmann.091008.patch.txt Hey Grant: Sorry about this, but I put Nutch in the wrong place on the original patch you committed (I put it under the Incubator project -- which is incorrect). This new patch: 1. creates an entry for Apache Lucene as a top-level project with crypto Products 2. lists Nutch as one of those products, with 2 versions (dev, and releases 0.7 and later) Your commit mojo is appreciated! ^_^ Cheers, Chris > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > Attachments: NUTCH-621.step1.Mattmann.090408.patch.txt, > NUTCH-621.step1.Mattmann.091008.patch.txt > > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629611#action_12629611 ] Chris A. Mattmann commented on NUTCH-621: - Thanks, Grant! I will begin step 3 in a few hours... > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > Attachments: NUTCH-621.step1.Mattmann.090408.patch.txt > > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-621: Attachment: NUTCH-621.step1.Mattmann.090408.patch.txt Hi: Okey dok, could someone with site-dev karma commit the attached patch to: https://svn.apache.org/repos/asf/infrastructure/site/trunk/xdocs/licenses/exports/index.xml as specified in step 2 of http://www.apache.org/dev/crypto.html ? This will get us started. Once that's complete, I'll begin step 3, notifying the U.S. govt. Thanks, Chris > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > Attachments: NUTCH-621.step1.Mattmann.090408.patch.txt > > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-621 started by Chris A. Mattmann. > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604409#action_12604409 ] Chris A. Mattmann commented on NUTCH-621: - Hi Grant: Thanks. The code does exist in nutch, in the parse-pdf plugin. It seems to be using PDFBox's decrypt functionality: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java?view=markup Judging by your comment, it sounds like this makes Nutch have to declare its crypto usage. I will work to move Nutch towards this. Thanks for the clarification. Cheers, Chris > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603884#action_12603884 ] Chris A. Mattmann commented on NUTCH-621: - Hi Grant: Thanks for the poke on this. I was speaking with Jukka Zitting about this. Tika requires the crypto declaration because of its transitive Maven dependencies in its Parsing framework on the Bountycastle libraries. Nutch, on the other hand, is using Tika at this point for mime detection only, and Nutch achieves its usage of Tika (0.1-incubating) by CM'ing only the Apache Tika 0.1 jar, and not making use of any of its transitive dependencies (which are inherently Parsing specific, and not Mime Detection specific). In addition, there was a similar thread discussed here: http://markmail.org/message/u7sjfzt7naknsv34 where the consensus was you don't need crypto notifications if you don't include any crypto libraries or use the related functionality in an included other library that has an optional dependency on a crypto library. So, I think that Nutch falls within that category. Would you agree? Thanks for your help and guidance. Cheers, Chris > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-618) Tika error "Media type alias already exists"
[ https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-618. - Resolution: Fixed Fix Version/s: 1.0.0 - patch applied to trunk: http://svn.apache.org/viewvc?rev=663092&view=rev > Tika error "Media type alias already exists" > > > Key: NUTCH-618 > URL: https://issues.apache.org/jira/browse/NUTCH-618 > Project: Nutch > Issue Type: Bug > Components: mime_type_detector >Affects Versions: 1.0.0 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann > Fix For: 1.0.0 > > Attachments: NUTCH-618.Mattmann.patch.060108.2.txt, > NUTCH-618.Mattmann.patch.060108.txt > > Time Spent: 2h > Remaining Estimate: 0h > > After the upgrade to the latest Tika jar we see a lot of errors like this: > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid > media type alias: text/xml > org.apache.tika.mime.MimeTypeException: Media type alias already exists: > text/xml > at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312) > at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238) > at > org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121) > at > org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56) > at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58) > at org.apache.nutch.protocol.Content.(Content.java:85) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) > at > org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523) > This is caused most likely by the duplicate tika-mimetypes.xml file - one > copy is embedded inside the Tika jar, the other is found in Nutch conf/ > directory. The one inside the jar seems to be more recent, so I propose to > simply remove the one we have in conf. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-621: --- Assignee: Chris A. Mattmann > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage
[ https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602100#action_12602100 ] Chris A. Mattmann commented on NUTCH-621: - Grant, Will do. Thanks. Cheers, Chris > Nutch needs to declare it's crypto usage > > > Key: NUTCH-621 > URL: https://issues.apache.org/jira/browse/NUTCH-621 > Project: Nutch > Issue Type: Task >Reporter: Grant Ingersoll >Assignee: Chris A. Mattmann >Priority: Blocker > > Per the ASF board direction outlined at > http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of > crypto libraries (i.e. BouncyCastle, via PDFBox/Tika). > See TIKA-118. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-618) Tika error "Media type alias already exists"
[ https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601519#action_12601519 ] Chris A. Mattmann commented on NUTCH-618: - Dennis Kubes tested this patch for me. According to Dennis, there were 2 lingering log warnings that still came up: 1. For alias: removing the ;exe removed one of the errors 2. removing the subclass from: removes the second of the errors. I am going to attach an updated patch that address these issues. Thanks, Chris > Tika error "Media type alias already exists" > > > Key: NUTCH-618 > URL: https://issues.apache.org/jira/browse/NUTCH-618 > Project: Nutch > Issue Type: Bug > Components: mime_type_detector >Affects Versions: 1.0.0 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann > Attachments: NUTCH-618.Mattmann.patch.060108.2.txt, > NUTCH-618.Mattmann.patch.060108.txt > > Time Spent: 2h > Remaining Estimate: 0h > > After the upgrade to the latest Tika jar we see a lot of errors like this: > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid > media type alias: text/xml > org.apache.tika.mime.MimeTypeException: Media type alias already exists: > text/xml > at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312) > at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238) > at > org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121) > at > org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56) > at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58) > at org.apache.nutch.protocol.Content.(Content.java:85) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) > at > org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523) > This is caused most likely by the duplicate tika-mimetypes.xml file - one > copy is embedded inside the Tika jar, the other is found in Nutch conf/ > directory. The one inside the jar seems to be more recent, so I propose to > simply remove the one we have in conf. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-618) Tika error "Media type alias already exists"
[ https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-618: Attachment: NUTCH-618.Mattmann.patch.060108.2.txt Updated patch that includes the updates to tika-mimetypes.xml identified by Dennis Kubes. Thanks, Dennis! Dennis tested this on his testbed environment and it ran through great. So, I'd like to call for 24-48 hr review on the patch, and then if no objections, I'd like to commit it. Thanks! Cheers, Chris > Tika error "Media type alias already exists" > > > Key: NUTCH-618 > URL: https://issues.apache.org/jira/browse/NUTCH-618 > Project: Nutch > Issue Type: Bug > Components: mime_type_detector >Affects Versions: 1.0.0 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann > Attachments: NUTCH-618.Mattmann.patch.060108.2.txt, > NUTCH-618.Mattmann.patch.060108.txt > > Time Spent: 2h > Remaining Estimate: 0h > > After the upgrade to the latest Tika jar we see a lot of errors like this: > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid > media type alias: text/xml > org.apache.tika.mime.MimeTypeException: Media type alias already exists: > text/xml > at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312) > at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238) > at > org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121) > at > org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56) > at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58) > at org.apache.nutch.protocol.Content.(Content.java:85) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) > at > org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523) > This is caused most likely by the duplicate tika-mimetypes.xml file - one > copy is embedded inside the Tika jar, the other is found in Nutch conf/ > directory. The one inside the jar seems to be more recent, so I propose to > simply remove the one we have in conf. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work logged: (NUTCH-618) Tika error "Media type alias already exists"
[ https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#action_10651 ] Chris A. Mattmann logged work on NUTCH-618: --- Author: Chris A. Mattmann Created on: 01/Jun/08 06:23 PM Start Date: 01/Jun/08 06:23 PM Worklog Time Spent: 2h Work Description: produced candidate patch for review Issue Time Tracking --- Time Spent: 2h Remaining Estimate: 0h > Tika error "Media type alias already exists" > > > Key: NUTCH-618 > URL: https://issues.apache.org/jira/browse/NUTCH-618 > Project: Nutch > Issue Type: Bug > Components: mime_type_detector >Affects Versions: 1.0.0 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann > Attachments: NUTCH-618.Mattmann.patch.060108.txt > > Time Spent: 2h > Remaining Estimate: 0h > > After the upgrade to the latest Tika jar we see a lot of errors like this: > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid > media type alias: text/xml > org.apache.tika.mime.MimeTypeException: Media type alias already exists: > text/xml > at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312) > at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238) > at > org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121) > at > org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56) > at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58) > at org.apache.nutch.protocol.Content.(Content.java:85) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) > at > org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523) > This is caused most likely by the duplicate tika-mimetypes.xml file - one > copy is embedded inside the Tika jar, the other is found in Nutch conf/ > directory. The one inside the jar seems to be more recent, so I propose to > simply remove the one we have in conf. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-618) Tika error "Media type alias already exists"
[ https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-618: Attachment: NUTCH-618.Mattmann.patch.060108.txt Hey Guys: Okey dok: here's a candidate patch. Could someone who has an environment set up already in which these types of errors were manifesting please trying this patch out and see if it makes them go away? I'm thinking that the root of the issue is that the MimeTypes object was not necessarily being re instantiated many many times as much as it wasn't being cached in the ObjectCache. We'll see. This attached patch passes all unit tests. So, please let me know what you think. Thanks! Cheers, Chris > Tika error "Media type alias already exists" > > > Key: NUTCH-618 > URL: https://issues.apache.org/jira/browse/NUTCH-618 > Project: Nutch > Issue Type: Bug > Components: mime_type_detector >Affects Versions: 1.0.0 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann > Attachments: NUTCH-618.Mattmann.patch.060108.txt > > > After the upgrade to the latest Tika jar we see a lot of errors like this: > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid > media type alias: text/xml > org.apache.tika.mime.MimeTypeException: Media type alias already exists: > text/xml > at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312) > at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238) > at > org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121) > at > org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56) > at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58) > at org.apache.nutch.protocol.Content.(Content.java:85) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) > at > org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523) > This is caused most likely by the duplicate tika-mimetypes.xml file - one > copy is embedded inside the Tika jar, the other is found in Nutch conf/ > directory. The one inside the jar seems to be more recent, so I propose to > simply remove the one we have in conf. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-618) Tika error "Media type alias already exists"
[ https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12600909#action_12600909 ] Chris A. Mattmann commented on NUTCH-618: - Hey Andrzej: Sorry, I haven't made much progress on the issue. My time has dwindled a bit in the past few months. If someone else has time and wants to reassign the issue, please feel free. Otherwise, I just returned from vacation and will have some free time this weekend, so if there is time until then I can at least prepare a draft patch and submit it for review. Cheers, Chris > Tika error "Media type alias already exists" > > > Key: NUTCH-618 > URL: https://issues.apache.org/jira/browse/NUTCH-618 > Project: Nutch > Issue Type: Bug > Components: mime_type_detector >Affects Versions: 1.0.0 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann > > After the upgrade to the latest Tika jar we see a lot of errors like this: > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid > media type alias: text/xml > org.apache.tika.mime.MimeTypeException: Media type alias already exists: > text/xml > at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312) > at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238) > at > org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121) > at > org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56) > at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58) > at org.apache.nutch.protocol.Content.(Content.java:85) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) > at > org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523) > This is caused most likely by the duplicate tika-mimetypes.xml file - one > copy is embedded inside the Tika jar, the other is found in Nutch conf/ > directory. The one inside the jar seems to be more recent, so I propose to > simply remove the one we have in conf. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-618) Tika error "Media type alias already exists"
[ https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576051#action_12576051 ] Chris A. Mattmann commented on NUTCH-618: - Hey Andrzej: bq. I noticed also another problem: o.a.n.u.MimeUtil doesn't use ObjectCache, so it instantiates MimeTypes over and over again. It should do this once for a given Configuration, and then use ObjectCache to store this object. Yikes :/ Okay, I will get working on this right away. In addition, I will investigate the cause of the doubly loaded media types -- I'm not positive that it's due to the mime xml file being present inside the tika jar file too -- that's a default one, that we should have the capability to override (like we're doing in Nutch), if we need to. Thanks! Cheers, Chris > Tika error "Media type alias already exists" > > > Key: NUTCH-618 > URL: https://issues.apache.org/jira/browse/NUTCH-618 > Project: Nutch > Issue Type: Bug > Components: mime_type_detector >Affects Versions: 1.0.0 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann > > After the upgrade to the latest Tika jar we see a lot of errors like this: > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid > media type alias: text/xml > org.apache.tika.mime.MimeTypeException: Media type alias already exists: > text/xml > at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312) > at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238) > at > org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121) > at > org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56) > at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58) > at org.apache.nutch.protocol.Content.(Content.java:85) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) > at > org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523) > This is caused most likely by the duplicate tika-mimetypes.xml file - one > copy is embedded inside the Tika jar, the other is found in Nutch conf/ > directory. The one inside the jar seems to be more recent, so I propose to > simply remove the one we have in conf. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (NUTCH-618) Tika error "Media type alias already exists"
[ https://issues.apache.org/jira/browse/NUTCH-618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-618 started by Chris A. Mattmann. > Tika error "Media type alias already exists" > > > Key: NUTCH-618 > URL: https://issues.apache.org/jira/browse/NUTCH-618 > Project: Nutch > Issue Type: Bug > Components: mime_type_detector >Affects Versions: 1.0.0 >Reporter: Andrzej Bialecki >Assignee: Chris A. Mattmann > > After the upgrade to the latest Tika jar we see a lot of errors like this: > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid > media type alias: text/xml > org.apache.tika.mime.MimeTypeException: Media type alias already exists: > text/xml > at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312) > at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238) > at > org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:168) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121) > at > org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56) > at org.apache.nutch.util.MimeUtil.(MimeUtil.java:58) > at org.apache.nutch.protocol.Content.(Content.java:85) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) > at > org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java:523) > This is caused most likely by the duplicate tika-mimetypes.xml file - one > copy is embedded inside the Tika jar, the other is found in Nutch conf/ > directory. The one inside the jar seems to be more recent, so I propose to > simply remove the one we have in conf. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.